Transcriptformer Gene Embedding Tool

Overview

The Transcriptformer tool provides access to contextualized gene embeddings learned from single-cell RNA sequencing data. Transcriptformer uses transformer architecture to capture cell-type-specific and disease-state-specific gene expression patterns, enabling precise analysis of gene behavior in relevant biological contexts. In Prism ToolSpace, we pre-inferenced the transcriptformer CGE (contextualized gene embeddings) across 5 single-cell disease atlas, while can be ealisy access and retrivel via this tool.

Data Acquisition

1. Download Transcriptformer Embeddings

The Transcriptformer embeddings are hosted on Hugging Face at: https://huggingface.co/datasets/mims-harvard/ToolSpace

Use the following shell commands to download only the Transcriptformer files from the transcriptformer_cge directory:

# Install CLI if not already
uvx --from huggingface_hub hf

# Download only the transcriptformer_cge folder
uvx --from huggingface_hub hf download mims-harvard/ToolSpace \
  --repo-type dataset \
  --include "transcriptformer_cge/*" \
  --local-dir ./ToolSpace/

File Structure

The Transcriptformer directory contains disease-specific embedding stores:

transcriptformer_cge/
├── follicular_lymphoma/
│   ├── metadata.json.gz
│   ├── b_cell_normal.npy
│   ├── b_cell_follicular_lymphoma.npy
│   ├── t_cell_normal.npy
│   ├── t_cell_follicular_lymphoma.npy
│   └── ... (other cell type × disease state combinations)
├── rheumatoid_arthritis/
├── type_1_diabetes_mellitus/
├── sjogren_syndrome/
└── hepatoblastoma/

2. Set Environment Variable

After downloading, set the TRANSCRIPTFORMER_DATA_PATH environment variable:

# Set environment variable to point to your data directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"

Tool Input and Output

Input Parameters

Parameter

Type

Required

Description

disease

string

Yes

Disease/dataset identifier (e.g., “follicular_lymphoma”)

state

string

Yes

Disease state context (“normal”, “disease_name”, etc.)

cell_type

string

Yes

Cell type context for embeddings

gene_names

List[str]

Yes

Gene identifiers (symbols or Ensembl IDs)

Supported Disease Contexts

Available disease datasets include:

  • follicular_lymphoma - Follicular lymphoma vs normal tissue

  • rheumatoid_arthritis - Rheumatoid arthritis vs healthy controls

  • type_1_diabetes_mellitus - Type 1 diabetes vs normal pancreatic tissue

  • sjogren_syndrome - Sjögren’s syndrome vs healthy controls

  • hepatoblastoma - Hepatoblastoma vs normal liver tissue

Disease State Options

  • normal - Healthy/control condition

  • [disease_name] - Disease-affected state (matches the disease identifier)

Gene Identifier Formats

  • Gene symbols: ["TP53", "BRCA1", "EGFR", "MYC"]

  • Ensembl IDs: ["ENSG00000141510", "ENSG00000139618"]

  • Mixed formats: Supported in the same request

  • Empty list: Retrieves all available genes

Output Format

The tool returns a JSON object with the following structure:

Successful Response

{
  "embeddings": {
    "TP53": [0.1234, -0.5678, 0.9012, ...],
    "BRCA1": [-0.2345, 0.6789, -0.1234, ...],
    "EGFR": [0.3456, -0.7890, 0.2345, ...],
    "...": "..."
  },
  "context_info": [
    "Successfully retrieved 1247 gene embeddings for context: follicular_lymphoma - normal - b_cell",
    "Embedding dimensionality: 512 features per gene",
    "Disease context: follicular_lymphoma (validated and processed)"
  ]
}

Error Response

{
  "error": "Disease 'unknown_disease' not found in available stores",
  "context_info": [
    "Available diseases: ['follicular_lymphoma', 'rheumatoid_arthritis', 'type_1_diabetes_mellitus', 'sjogren_syndrome', 'hepatoblastoma']",
    "Please check disease identifier and ensure data is downloaded"
  ]
}

Embedding Properties

  • Dimensionality: 512-dimensional vectors per gene

  • Format: Dense numerical vectors (list of float32 values)

  • Context-specific: Embeddings vary by cell type and disease state

  • Precision: Float32 for optimal balance of accuracy and efficiency

MCP Server Setup

Prerequisites

# create a uv virtual enviroment
uv venv transcriptformer --python 3.10
source transcriptformer/bin/activate
uv pip install -r requirements.txt

Configuration

  1. Set up the environment:

# Ensure TRANSCRIPTFORMER_DATA_PATH points to your ToolSpace directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"
  1. Verify embedding files exist:

ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/follicular_lymphoma/

Running the MCP Server

# Run the MCP server
python transcriptformer_tool.py

Server Configuration

  • Host: 0.0.0.0 (accepts connections from any IP)

  • Port: 7002 (configured to avoid conflicts with other tools)

  • Transport: streamable-http

  • Mode: Stateless HTTP for scalability