Transcriptformer基因嵌入工具¶
概述¶
The Transcriptformer tool provides access to contextualized gene embeddings learned from single-cell RNA sequencing data. Transcriptformer uses transformer architecture to capture cell-type-specific and disease-state-specific gene expression patterns, enabling precise analysis of gene behavior in relevant biological contexts. In Prism ToolSpace, we pre-inferenced the transcriptformer CGE (contextualized gene embeddings) across 5 single-cell disease atlas, while can be ealisy access and retrivel via this tool.
数据采集¶
1. Download Transcriptformer Embeddings¶
The Transcriptformer embeddings are hosted on Hugging Face at: https://huggingface.co/datasets/mims-harvard/ToolSpace
使用以下 shell 命令仅从 transcriptformer_cge 目录下载 Transcriptformer 文件:
# Install CLI if not already
uvx --from huggingface_hub hf
# Download only the transcriptformer_cge folder
uvx --from huggingface_hub hf download mims-harvard/ToolSpace \
--repo-type dataset \
--include "transcriptformer_cge/*" \
--local-dir ./ToolSpace/
文件结构¶
Transcriptformer 目录包含针对特定疾病的嵌入存储:
transcriptformer_cge/
├── follicular_lymphoma/
│ ├── metadata.json.gz
│ ├── b_cell_normal.npy
│ ├── b_cell_follicular_lymphoma.npy
│ ├── t_cell_normal.npy
│ ├── t_cell_follicular_lymphoma.npy
│ └── ... (other cell type × disease state combinations)
├── rheumatoid_arthritis/
├── type_1_diabetes_mellitus/
├── sjogren_syndrome/
└── hepatoblastoma/
2. Set Environment Variable¶
下载完成后,设置 TRANSCRIPTFORMER_DATA_PATH 环境变量:
# Set environment variable to point to your data directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"
工具输入与输出¶
输入参数¶
参数 |
类型 |
需要 |
描述 |
|---|---|---|---|
疾病 |
字符串 |
是 |
疾病/数据集标识符(例如:“follicular_lymphoma”) |
|
字符串 |
是 |
疾病状态上下文(“正常”、“疾病名称”等) |
|
字符串 |
是 |
嵌入的单元类型上下文 |
|
列表[str] |
是 |
基因标识符(符号或Ensembl ID) |
支持的疾病上下文¶
可用的疾病数据集包括:
follicular_lymphoma- 滤泡性淋巴瘤与正常组织对比rheumatoid_arthritis- 类风湿关节炎与健康对照组比较type_1_diabetes_mellitus- 1 型糖尿病与正常胰腺组织对比sjogren_syndrome- 干燥综合征与健康对照组比较hepatoblastoma- 肝母细胞瘤与正常肝组织对比
疾病状态选项¶
normal- 正常/对照状态[disease_name]- 疾病影响状态(对应疾病标识符)
基因标识符格式¶
基因符号:
["TP53", "BRCA1", "EGFR", "MYC"]Ensembl ID:
["ENSG00000141510", "ENSG00000139618"]混合格式:支持在同一请求中使用
空列表:检索所有可用基因
输出格式¶
该工具返回一个具有以下结构的 JSON 对象:
成功响应¶
{
"embeddings": {
"TP53": [0.1234, -0.5678, 0.9012, ...],
"BRCA1": [-0.2345, 0.6789, -0.1234, ...],
"EGFR": [0.3456, -0.7890, 0.2345, ...],
"...": "..."
},
"context_info": [
"Successfully retrieved 1247 gene embeddings for context: follicular_lymphoma - normal - b_cell",
"Embedding dimensionality: 512 features per gene",
"Disease context: follicular_lymphoma (validated and processed)"
]
}
错误响应¶
{
"error": "Disease 'unknown_disease' not found in available stores",
"context_info": [
"Available diseases: ['follicular_lymphoma', 'rheumatoid_arthritis', 'type_1_diabetes_mellitus', 'sjogren_syndrome', 'hepatoblastoma']",
"Please check disease identifier and ensure data is downloaded"
]
}
嵌入属性¶
维度:每个基因512维向量
格式:密集数值向量(float32 值列表)
特定上下文:嵌入表示因细胞类型和疾病状态而异
精度:使用 Float32 实现准确性与效率的最佳平衡
MCP 服务器设置¶
先决条件¶
# create a uv virtual enviroment
uv venv transcriptformer --python 3.10
source transcriptformer/bin/activate
uv pip install -r requirements.txt
配置¶
设置环境:
# Ensure TRANSCRIPTFORMER_DATA_PATH points to your ToolSpace directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"
验证嵌入文件是否存在:
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/follicular_lymphoma/
运行 MCP 服务器¶
# Run the MCP server
python transcriptformer_tool.py
服务器配置¶
主机:
0.0.0.0(接受来自任何 IP 的连接)端口:
7002(配置以避免与其他工具冲突)传输:
streamable-http模式:无状态 HTTP 以实现可扩展性