Transcriptformer基因嵌入工具

概述

The Transcriptformer tool provides access to contextualized gene embeddings learned from single-cell RNA sequencing data. Transcriptformer uses transformer architecture to capture cell-type-specific and disease-state-specific gene expression patterns, enabling precise analysis of gene behavior in relevant biological contexts. In Prism ToolSpace, we pre-inferenced the transcriptformer CGE (contextualized gene embeddings) across 5 single-cell disease atlas, while can be ealisy access and retrivel via this tool.

数据采集

1. Download Transcriptformer Embeddings

The Transcriptformer embeddings are hosted on Hugging Face at: https://huggingface.co/datasets/mims-harvard/ToolSpace

使用以下 shell 命令仅从 transcriptformer_cge 目录下载 Transcriptformer 文件:

# Install CLI if not already
uvx --from huggingface_hub hf

# Download only the transcriptformer_cge folder
uvx --from huggingface_hub hf download mims-harvard/ToolSpace \
  --repo-type dataset \
  --include "transcriptformer_cge/*" \
  --local-dir ./ToolSpace/

文件结构

Transcriptformer 目录包含针对特定疾病的嵌入存储:

transcriptformer_cge/
├── follicular_lymphoma/
│   ├── metadata.json.gz
│   ├── b_cell_normal.npy
│   ├── b_cell_follicular_lymphoma.npy
│   ├── t_cell_normal.npy
│   ├── t_cell_follicular_lymphoma.npy
│   └── ... (other cell type × disease state combinations)
├── rheumatoid_arthritis/
├── type_1_diabetes_mellitus/
├── sjogren_syndrome/
└── hepatoblastoma/

2. Set Environment Variable

下载完成后,设置 TRANSCRIPTFORMER_DATA_PATH 环境变量:

# Set environment variable to point to your data directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"

工具输入与输出

输入参数

参数

类型

需要

描述

疾病

字符串

疾病/数据集标识符(例如:“follicular_lymphoma”)

状态

字符串

疾病状态上下文(“正常”、“疾病名称”等)

单元类型

字符串

嵌入的单元类型上下文

基因名称

列表[str]

基因标识符(符号或Ensembl ID)

支持的疾病上下文

可用的疾病数据集包括:

  • follicular_lymphoma - 滤泡性淋巴瘤与正常组织对比

  • rheumatoid_arthritis - 类风湿关节炎与健康对照组比较

  • type_1_diabetes_mellitus - 1 型糖尿病与正常胰腺组织对比

  • sjogren_syndrome - 干燥综合征与健康对照组比较

  • hepatoblastoma - 肝母细胞瘤与正常肝组织对比

疾病状态选项

  • normal - 正常/对照状态

  • [disease_name] - 疾病影响状态(对应疾病标识符)

基因标识符格式

  • 基因符号["TP53", "BRCA1", "EGFR", "MYC"]

  • Ensembl ID["ENSG00000141510", "ENSG00000139618"]

  • 混合格式:支持在同一请求中使用

  • 空列表:检索所有可用基因

输出格式

该工具返回一个具有以下结构的 JSON 对象:

成功响应

{
  "embeddings": {
    "TP53": [0.1234, -0.5678, 0.9012, ...],
    "BRCA1": [-0.2345, 0.6789, -0.1234, ...],
    "EGFR": [0.3456, -0.7890, 0.2345, ...],
    "...": "..."
  },
  "context_info": [
    "Successfully retrieved 1247 gene embeddings for context: follicular_lymphoma - normal - b_cell",
    "Embedding dimensionality: 512 features per gene",
    "Disease context: follicular_lymphoma (validated and processed)"
  ]
}

错误响应

{
  "error": "Disease 'unknown_disease' not found in available stores",
  "context_info": [
    "Available diseases: ['follicular_lymphoma', 'rheumatoid_arthritis', 'type_1_diabetes_mellitus', 'sjogren_syndrome', 'hepatoblastoma']",
    "Please check disease identifier and ensure data is downloaded"
  ]
}

嵌入属性

  • 维度:每个基因512维向量

  • 格式:密集数值向量(float32 值列表)

  • 特定上下文:嵌入表示因细胞类型和疾病状态而异

  • 精度:使用 Float32 实现准确性与效率的最佳平衡

MCP 服务器设置

先决条件

# create a uv virtual enviroment
uv venv transcriptformer --python 3.10
source transcriptformer/bin/activate
uv pip install -r requirements.txt

配置

  1. 设置环境

# Ensure TRANSCRIPTFORMER_DATA_PATH points to your ToolSpace directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"
  1. 验证嵌入文件是否存在

ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/follicular_lymphoma/

运行 MCP 服务器

# Run the MCP server
python transcriptformer_tool.py

服务器配置

  • 主机0.0.0.0(接受来自任何 IP 的连接)

  • 端口7002(配置以避免与其他工具冲突)

  • 传输streamable-http

  • 模式:无状态 HTTP 以实现可扩展性