Transcriptformer Gene Embedding Tool¶
Overview¶
The Transcriptformer tool provides access to contextualized gene embeddings learned from single-cell RNA sequencing data. Transcriptformer uses transformer architecture to capture cell-type-specific and disease-state-specific gene expression patterns, enabling precise analysis of gene behavior in relevant biological contexts. In Prism ToolSpace, we pre-inferenced the transcriptformer CGE (contextualized gene embeddings) across 5 single-cell disease atlas, while can be ealisy access and retrivel via this tool.
Data Acquisition¶
1. Download Transcriptformer Embeddings¶
The Transcriptformer embeddings are hosted on Hugging Face at: https://huggingface.co/datasets/mims-harvard/ToolSpace
Use the following shell commands to download only the Transcriptformer files from the transcriptformer_cge
directory:
# Install CLI if not already
uvx --from huggingface_hub hf
# Download only the transcriptformer_cge folder
uvx --from huggingface_hub hf download mims-harvard/ToolSpace \
--repo-type dataset \
--include "transcriptformer_cge/*" \
--local-dir ./ToolSpace/
File Structure¶
The Transcriptformer directory contains disease-specific embedding stores:
transcriptformer_cge/
├── follicular_lymphoma/
│ ├── metadata.json.gz
│ ├── b_cell_normal.npy
│ ├── b_cell_follicular_lymphoma.npy
│ ├── t_cell_normal.npy
│ ├── t_cell_follicular_lymphoma.npy
│ └── ... (other cell type × disease state combinations)
├── rheumatoid_arthritis/
├── type_1_diabetes_mellitus/
├── sjogren_syndrome/
└── hepatoblastoma/
2. Set Environment Variable¶
After downloading, set the TRANSCRIPTFORMER_DATA_PATH
environment variable:
# Set environment variable to point to your data directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"
Tool Input and Output¶
Input Parameters¶
Parameter |
Type |
Required |
Description |
---|---|---|---|
|
string |
Yes |
Disease/dataset identifier (e.g., “follicular_lymphoma”) |
|
string |
Yes |
Disease state context (“normal”, “disease_name”, etc.) |
|
string |
Yes |
Cell type context for embeddings |
|
List[str] |
Yes |
Gene identifiers (symbols or Ensembl IDs) |
Supported Disease Contexts¶
Available disease datasets include:
follicular_lymphoma
- Follicular lymphoma vs normal tissuerheumatoid_arthritis
- Rheumatoid arthritis vs healthy controlstype_1_diabetes_mellitus
- Type 1 diabetes vs normal pancreatic tissuesjogren_syndrome
- Sjögren’s syndrome vs healthy controlshepatoblastoma
- Hepatoblastoma vs normal liver tissue
Disease State Options¶
normal
- Healthy/control condition[disease_name]
- Disease-affected state (matches the disease identifier)
Gene Identifier Formats¶
Gene symbols:
["TP53", "BRCA1", "EGFR", "MYC"]
Ensembl IDs:
["ENSG00000141510", "ENSG00000139618"]
Mixed formats: Supported in the same request
Empty list: Retrieves all available genes
Output Format¶
The tool returns a JSON object with the following structure:
Successful Response¶
{
"embeddings": {
"TP53": [0.1234, -0.5678, 0.9012, ...],
"BRCA1": [-0.2345, 0.6789, -0.1234, ...],
"EGFR": [0.3456, -0.7890, 0.2345, ...],
"...": "..."
},
"context_info": [
"Successfully retrieved 1247 gene embeddings for context: follicular_lymphoma - normal - b_cell",
"Embedding dimensionality: 512 features per gene",
"Disease context: follicular_lymphoma (validated and processed)"
]
}
Error Response¶
{
"error": "Disease 'unknown_disease' not found in available stores",
"context_info": [
"Available diseases: ['follicular_lymphoma', 'rheumatoid_arthritis', 'type_1_diabetes_mellitus', 'sjogren_syndrome', 'hepatoblastoma']",
"Please check disease identifier and ensure data is downloaded"
]
}
Embedding Properties¶
Dimensionality: 512-dimensional vectors per gene
Format: Dense numerical vectors (list of float32 values)
Context-specific: Embeddings vary by cell type and disease state
Precision: Float32 for optimal balance of accuracy and efficiency
MCP Server Setup¶
Prerequisites¶
# create a uv virtual enviroment
uv venv transcriptformer --python 3.10
source transcriptformer/bin/activate
uv pip install -r requirements.txt
Configuration¶
Set up the environment:
# Ensure TRANSCRIPTFORMER_DATA_PATH points to your ToolSpace directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"
Verify embedding files exist:
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/follicular_lymphoma/
Running the MCP Server¶
# Run the MCP server
python transcriptformer_tool.py
Server Configuration¶
Host:
0.0.0.0
(accepts connections from any IP)Port:
7002
(configured to avoid conflicts with other tools)Transport:
streamable-http
Mode: Stateless HTTP for scalability