# Transcriptformer Gene Embedding Tool

## Overview

The [Transcriptformer](https://github.com/czi-ai/transcriptformer) tool provides access to contextualized gene embeddings learned from single-cell RNA sequencing data. Transcriptformer uses transformer architecture to capture cell-type-specific and disease-state-specific gene expression patterns, enabling precise analysis of gene behavior in relevant biological contexts. In Prism ToolSpace, we pre-inferenced the transcriptformer CGE (contextualized gene embeddings) across 5 single-cell disease atlas, while can be ealisy access and retrivel via this tool.

## Data Acquisition

### 1. Download Transcriptformer Embeddings

The Transcriptformer embeddings are hosted on Hugging Face at: https://huggingface.co/datasets/mims-harvard/ToolSpace

Use the following shell commands to download only the Transcriptformer files from the `transcriptformer_cge` directory:

```bash
# Install CLI if not already
uvx --from huggingface_hub hf

# Download only the transcriptformer_cge folder
uvx --from huggingface_hub hf download mims-harvard/ToolSpace \
  --repo-type dataset \
  --include "transcriptformer_cge/*" \
  --local-dir ./ToolSpace/
```

### File Structure

The Transcriptformer directory contains disease-specific embedding stores:

```
transcriptformer_cge/
├── follicular_lymphoma/
│   ├── metadata.json.gz
│   ├── b_cell_normal.npy
│   ├── b_cell_follicular_lymphoma.npy
│   ├── t_cell_normal.npy
│   ├── t_cell_follicular_lymphoma.npy
│   └── ... (other cell type × disease state combinations)
├── rheumatoid_arthritis/
├── type_1_diabetes_mellitus/
├── sjogren_syndrome/
└── hepatoblastoma/
```

### 2. Set Environment Variable

After downloading, set the `TRANSCRIPTFORMER_DATA_PATH` environment variable:

```bash
# Set environment variable to point to your data directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"
```

## Tool Input and Output

### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `disease` | string | Yes | Disease/dataset identifier (e.g., "follicular_lymphoma") |
| `state` | string | Yes | Disease state context ("normal", "disease_name", etc.) |
| `cell_type` | string | Yes | Cell type context for embeddings |
| `gene_names` | List[str] | Yes | Gene identifiers (symbols or Ensembl IDs) |

#### Supported Disease Contexts
Available disease datasets include:
- `follicular_lymphoma` - Follicular lymphoma vs normal tissue
- `rheumatoid_arthritis` - Rheumatoid arthritis vs healthy controls
- `type_1_diabetes_mellitus` - Type 1 diabetes vs normal pancreatic tissue
- `sjogren_syndrome` - Sjögren's syndrome vs healthy controls
- `hepatoblastoma` - Hepatoblastoma vs normal liver tissue

#### Disease State Options
- `normal` - Healthy/control condition
- `[disease_name]` - Disease-affected state (matches the disease identifier)


#### Gene Identifier Formats
- **Gene symbols**: `["TP53", "BRCA1", "EGFR", "MYC"]`
- **Ensembl IDs**: `["ENSG00000141510", "ENSG00000139618"]`
- **Mixed formats**: Supported in the same request
- **Empty list**: Retrieves all available genes

### Output Format

The tool returns a JSON object with the following structure:

#### Successful Response
```json
{
  "embeddings": {
    "TP53": [0.1234, -0.5678, 0.9012, ...],
    "BRCA1": [-0.2345, 0.6789, -0.1234, ...],
    "EGFR": [0.3456, -0.7890, 0.2345, ...],
    "...": "..."
  },
  "context_info": [
    "Successfully retrieved 1247 gene embeddings for context: follicular_lymphoma - normal - b_cell",
    "Embedding dimensionality: 512 features per gene",
    "Disease context: follicular_lymphoma (validated and processed)"
  ]
}
```

#### Error Response
```json
{
  "error": "Disease 'unknown_disease' not found in available stores",
  "context_info": [
    "Available diseases: ['follicular_lymphoma', 'rheumatoid_arthritis', 'type_1_diabetes_mellitus', 'sjogren_syndrome', 'hepatoblastoma']",
    "Please check disease identifier and ensure data is downloaded"
  ]
}
```

### Embedding Properties

- **Dimensionality**: 512-dimensional vectors per gene
- **Format**: Dense numerical vectors (list of float32 values)
- **Context-specific**: Embeddings vary by cell type and disease state
- **Precision**: Float32 for optimal balance of accuracy and efficiency

## MCP Server Setup

### Prerequisites

```bash
# create a uv virtual enviroment
uv venv transcriptformer --python 3.10
source transcriptformer/bin/activate
uv pip install -r requirements.txt
```

### Configuration

1. **Set up the environment**:
```bash
# Ensure TRANSCRIPTFORMER_DATA_PATH points to your ToolSpace directory
export TRANSCRIPTFORMER_DATA_PATH="/path/to/ToolSpace"
```

2. **Verify embedding files exist**:
```bash
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/
ls -la $TRANSCRIPTFORMER_DATA_PATH/transcriptformer_cge/follicular_lymphoma/
```

### Running the MCP Server

```bash
# Run the MCP server
python transcriptformer_tool.py
```

### Server Configuration

- **Host**: `0.0.0.0` (accepts connections from any IP)
- **Port**: `7002` (configured to avoid conflicts with other tools)
- **Transport**: `streamable-http`
- **Mode**: Stateless HTTP for scalability