Transcriptformer Gene Embedding Tool¶

Overview¶

The Transcriptformer tool provides access to contextualized gene embeddings learned from single-cell RNA sequencing data. Transcriptformer uses transformer architecture to capture cell-type-specific and disease-state-specific gene expression patterns, enabling precise analysis of gene behavior in relevant biological contexts. In Prism ToolSpace, we pre-inferenced the transcriptformer CGE (contextualized gene embeddings) across 5 single-cell disease atlas, while can be ealisy access and retrivel via this tool.

Data Acquisition¶

1. Download Transcriptformer Embeddings¶

The Transcriptformer embeddings are hosted on Hugging Face at: https://huggingface.co/datasets/mims-harvard/ToolSpace

Use the following shell commands to download only the Transcriptformer files from the transcriptformer_cge directory:

# Install CLI if not already
uvx --from huggingface_hub hf

# Download only the transcriptformer_cge folder
uvx --from huggingface_hub hf download mims-harvard/ToolSpace \
  --repo-type dataset \
  --include "transcriptformer_cge/*" \
  --local-dir ./ToolSpace/

File Structure¶

The Transcriptformer directory contains disease-specific embedding stores:

transcriptformer_cge/
├── follicular_lymphoma/
│   ├── metadata.json.gz
│   ├── b_cell_normal.npy
│   ├── b_cell_follicular_lymphoma.npy
│   ├── t_cell_normal.npy
│   ├── t_cell_follicular_lymphoma.npy
│   └── ... (other cell type × disease state combinations)
├── rheumatoid_arthritis/
├── type_1_diabetes_mellitus/
├── sjogren_syndrome/
└── hepatoblastoma/

2. Set Environment Variable¶

After downloading, set the TRANSCRIPTFORMER_DATA_PATH environment variable:

# Set environment variable to point to your data directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"

Tool Input and Output¶

Input Parameters¶

Parameter	Type	Required	Description
`disease`	string	Yes	Disease/dataset identifier (e.g., “follicular_lymphoma”)
`state`	string	Yes	Disease state context (“normal”, “disease_name”, etc.)
`cell_type`	string	Yes	Cell type context for embeddings
`gene_names`	List[str]	Yes	Gene identifiers (symbols or Ensembl IDs)

Supported Disease Contexts¶

Available disease datasets include:

follicular_lymphoma - Follicular lymphoma vs normal tissue
rheumatoid_arthritis - Rheumatoid arthritis vs healthy controls
type_1_diabetes_mellitus - Type 1 diabetes vs normal pancreatic tissue
sjogren_syndrome - Sjögren’s syndrome vs healthy controls
hepatoblastoma - Hepatoblastoma vs normal liver tissue

Disease State Options¶

normal - Healthy/control condition
[disease_name] - Disease-affected state (matches the disease identifier)

Gene Identifier Formats¶

Gene symbols: ["TP53", "BRCA1", "EGFR", "MYC"]
Ensembl IDs: ["ENSG00000141510", "ENSG00000139618"]
Mixed formats: Supported in the same request
Empty list: Retrieves all available genes

Output Format¶

The tool returns a JSON object with the following structure:

Successful Response¶

{
  "embeddings": {
    "TP53": [0.1234, -0.5678, 0.9012, ...],
    "BRCA1": [-0.2345, 0.6789, -0.1234, ...],
    "EGFR": [0.3456, -0.7890, 0.2345, ...],
    "...": "..."
  },
  "context_info": [
    "Successfully retrieved 1247 gene embeddings for context: follicular_lymphoma - normal - b_cell",
    "Embedding dimensionality: 512 features per gene",
    "Disease context: follicular_lymphoma (validated and processed)"
  ]
}

Error Response¶

{
  "error": "Disease 'unknown_disease' not found in available stores",
  "context_info": [
    "Available diseases: ['follicular_lymphoma', 'rheumatoid_arthritis', 'type_1_diabetes_mellitus', 'sjogren_syndrome', 'hepatoblastoma']",
    "Please check disease identifier and ensure data is downloaded"
  ]
}

Embedding Properties¶

Dimensionality: 512-dimensional vectors per gene
Format: Dense numerical vectors (list of float32 values)
Context-specific: Embeddings vary by cell type and disease state
Precision: Float32 for optimal balance of accuracy and efficiency

MCP Server Setup¶

Prerequisites¶

# create a uv virtual enviroment
uv venv transcriptformer --python 3.10
source transcriptformer/bin/activate
uv pip install -r requirements.txt

Configuration¶

Set up the environment:

# Ensure TRANSCRIPTFORMER_DATA_PATH points to your ToolSpace directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"

Verify embedding files exist:

ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/follicular_lymphoma/

Running the MCP Server¶

# Run the MCP server
python transcriptformer_tool.py

Server Configuration¶

Host: 0.0.0.0 (accepts connections from any IP)
Port: 7002 (configured to avoid conflicts with other tools)
Transport: streamable-http
Mode: Stateless HTTP for scalability