Transcriptformer基因嵌入工具¶

概述¶

The Transcriptformer tool provides access to contextualized gene embeddings learned from single-cell RNA sequencing data. Transcriptformer uses transformer architecture to capture cell-type-specific and disease-state-specific gene expression patterns, enabling precise analysis of gene behavior in relevant biological contexts. In Prism ToolSpace, we pre-inferenced the transcriptformer CGE (contextualized gene embeddings) across 5 single-cell disease atlas, while can be ealisy access and retrivel via this tool.

数据采集¶

1. Download Transcriptformer Embeddings¶

The Transcriptformer embeddings are hosted on Hugging Face at: https://huggingface.co/datasets/mims-harvard/ToolSpace

使用以下 shell 命令仅从 transcriptformer_cge 目录下载 Transcriptformer 文件：

# Install CLI if not already
uvx --from huggingface_hub hf

# Download only the transcriptformer_cge folder
uvx --from huggingface_hub hf download mims-harvard/ToolSpace \
  --repo-type dataset \
  --include "transcriptformer_cge/*" \
  --local-dir ./ToolSpace/

文件结构¶

Transcriptformer 目录包含针对特定疾病的嵌入存储：

transcriptformer_cge/
├── follicular_lymphoma/
│   ├── metadata.json.gz
│   ├── b_cell_normal.npy
│   ├── b_cell_follicular_lymphoma.npy
│   ├── t_cell_normal.npy
│   ├── t_cell_follicular_lymphoma.npy
│   └── ... (other cell type × disease state combinations)
├── rheumatoid_arthritis/
├── type_1_diabetes_mellitus/
├── sjogren_syndrome/
└── hepatoblastoma/

2. Set Environment Variable¶

下载完成后，设置 TRANSCRIPTFORMER_DATA_PATH 环境变量：

# Set environment variable to point to your data directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"

工具输入与输出¶

输入参数¶

参数	类型	需要	描述
疾病	字符串	是	疾病/数据集标识符（例如：“follicular_lymphoma”）
`状态`	字符串	是	疾病状态上下文（“正常”、“疾病名称”等）
`单元类型`	字符串	是	嵌入的单元类型上下文
`基因名称`	列表[str]	是	基因标识符（符号或Ensembl ID）

支持的疾病上下文¶

可用的疾病数据集包括：

follicular_lymphoma - 滤泡性淋巴瘤与正常组织对比
rheumatoid_arthritis - 类风湿关节炎与健康对照组比较
type_1_diabetes_mellitus - 1 型糖尿病与正常胰腺组织对比
sjogren_syndrome - 干燥综合征与健康对照组比较
hepatoblastoma - 肝母细胞瘤与正常肝组织对比

疾病状态选项¶

normal - 正常/对照状态
[disease_name] - 疾病影响状态（对应疾病标识符）

基因标识符格式¶

基因符号：["TP53", "BRCA1", "EGFR", "MYC"]
Ensembl ID：["ENSG00000141510", "ENSG00000139618"]
混合格式：支持在同一请求中使用
空列表：检索所有可用基因

输出格式¶

该工具返回一个具有以下结构的 JSON 对象：

成功响应¶

{
  "embeddings": {
    "TP53": [0.1234, -0.5678, 0.9012, ...],
    "BRCA1": [-0.2345, 0.6789, -0.1234, ...],
    "EGFR": [0.3456, -0.7890, 0.2345, ...],
    "...": "..."
  },
  "context_info": [
    "Successfully retrieved 1247 gene embeddings for context: follicular_lymphoma - normal - b_cell",
    "Embedding dimensionality: 512 features per gene",
    "Disease context: follicular_lymphoma (validated and processed)"
  ]
}

错误响应¶

{
  "error": "Disease 'unknown_disease' not found in available stores",
  "context_info": [
    "Available diseases: ['follicular_lymphoma', 'rheumatoid_arthritis', 'type_1_diabetes_mellitus', 'sjogren_syndrome', 'hepatoblastoma']",
    "Please check disease identifier and ensure data is downloaded"
  ]
}

嵌入属性¶

维度：每个基因512维向量
格式：密集数值向量（float32 值列表）
特定上下文：嵌入表示因细胞类型和疾病状态而异
精度：使用 Float32 实现准确性与效率的最佳平衡

MCP 服务器设置¶

先决条件¶

# create a uv virtual enviroment
uv venv transcriptformer --python 3.10
source transcriptformer/bin/activate
uv pip install -r requirements.txt

配置¶

设置环境：

# Ensure TRANSCRIPTFORMER_DATA_PATH points to your ToolSpace directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"

验证嵌入文件是否存在：

ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/follicular_lymphoma/

运行 MCP 服务器¶

# Run the MCP server
python transcriptformer_tool.py

服务器配置¶

主机：0.0.0.0（接受来自任何 IP 的连接）
端口：7002（配置以避免与其他工具冲突）
传输：streamable-http
模式：无状态 HTTP 以实现可扩展性