Multimodal Medical Code Tokenizer

Medical foundation models rely on tokenizing electronic health records (EHRs) into discrete sequences, but existing tokenizers treat medical codes as isolated text tokens, ignoring their structured relationships. With over 600,000 codes spanning diagnoses, treatments, and medications, capturing these dependencies is essential for accurate clinical reasoning.

We introduce MedTok, a multimodal medical code tokenizer that integrates both text descriptions and relational context. MedTok encodes medical codes using a language model for textual meaning and a graph encoder for hierarchical structure, unifying both into a shared token space. This approach ensures that medical codes are represented with richer, more context-aware embeddings.

Replacing standard tokenizers with MedTok significantly improves predictive performance across five EHR models, increasing AUPRC by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with the largest gains in drug recommendation. Beyond EHR modeling, MedTok enhances medical QA systems, demonstrating its potential as a unified tokenizer for medical foundation models.

Motivation

Medical codes are the foundation of structured electronic health records (EHRs), encoding diagnoses, treatments, medications, and procedures in a standardized format. However, existing tokenization strategies treat these codes as isolated text tokens, disregarding their hierarchical relationships, co-occurrences, and domain-specific dependencies. With over 600,000 unique codes spanning multiple coding systems like ICD-10, SNOMED CT, and ATC, conventional tokenization leads to inefficient vocabulary expansion, loss of relational context, and fragmented representations that hinder medical AI models. These limitations affect the performance of foundation models trained on EHR data, limiting their ability to make accurate predictions, generate meaningful representations, and generalize across clinical tasks.

MedTok addresses this challenge by introducing a multimodal tokenizer that integrates both textual descriptions and relational knowledge from medical ontologies. Unlike traditional approaches that rely solely on text-based embeddings, MedTok combines language models with graph-based encodings to preserve the structure and semantics of medical codes. This unified tokenization framework enables foundation models to leverage both linguistic and ontological context, leading to more expressive token representations.

Overview of MedTok

MedTok is a multimodal tokenizer designed to transform medical codes into structured token representations that preserve both textual semantics and hierarchical relationships. It achieves this by combining a text encoder, which processes the natural language descriptions of medical codes, with a graph encoder, which captures the dependencies and structured relations among codes. By jointly optimizing these two modalities, MedTok generates token embeddings that are both semantically rich and context-aware, allowing AI models to better interpret and utilize medical data.

MedTok operates through a three-step process:

  1. First, it encodes the textual descriptions of medical codes using a pre-trained language model, extracting meaningful embeddings.
  2. Next, it integrates graph-based representations from biomedical knowledge graphs, capturing relationships such as disease co-occurrence, drug interactions, and procedural hierarchies.
  3. Finally, it applies vector quantization to map both modalities into a shared token space, ensuring that information is retained in a format compatible with transformer-based models.

This process allows MedTok to serve as a drop-in replacement for existing tokenizers, improving model performance across a wide range of EHR-based applications.

MedTok tokenizer integrated with in-patient EHR transformer models

In the inpatient setting, MedTok is integrated with predictive models trained on ICU patient records from MIMIC-III and MIMIC-IV datasets. These datasets contain structured medical codes representing diagnoses, treatments, and lab results, which are critical for predicting clinical outcomes such as mortality risk, readmission likelihood, and length of stay. Standard tokenization approaches fail to capture dependencies between medical codes, leading to suboptimal performance in these predictive tasks.

By replacing traditional tokenization methods with MedTok, we observe significant improvements across all clinical tasks. MedTok enhances AUPRC scores for mortality prediction, readmission prediction, and phenotype classification by up to 4.10% in MIMIC-III and 4.78% in MIMIC-IV. The largest gains are seen in drug recommendation tasks, where capturing the relationships between medications and diagnoses is crucial. These results demonstrate that MedTok enables more structured and context-aware tokenization, allowing transformer-based models to better understand inpatient EHR data.

MedTok tokenizer integrated with out-patient EHR transformer models

MedTok is also evaluated in outpatient settings using the EHRShot dataset, which includes longitudinal patient records spanning multiple visits across different healthcare facilities. Unlike inpatient datasets, outpatient records contain more diverse sequences of medical events, making it essential to capture long-term dependencies between medical codes. Standard tokenization methods often struggle with this variability, leading to inconsistencies in predictive performance.

Integrating MedTok into outpatient models results in substantial improvements in both operational and clinical tasks. The most significant impact is observed in mortality prediction, where MedTok increases AUPRC by 11.30%, demonstrating its ability to better capture disease trajectories over time. MedTok also improves new disease diagnosis assignments, particularly in detecting conditions such as acute myocardial infarction (8.80% improvement).

MedTok and alternative tokenization strategies

Standard tokenization methods in EHR modeling either treat medical codes as isolated text tokens or rely on manually curated embeddings that lack adaptability. Traditional text-based tokenizers, such as BERT-based models, struggle with the sheer scale of medical vocabularies and fail to capture relational dependencies between codes. Graph-based tokenizers offer some advantages by incorporating structured relationships but often lose important textual semantics in the process.

MedTok bridges this gap by integrating both modalities, ensuring that token representations retain the advantages of text-based semantics while embedding relational structures from medical ontologies. Compared to alternative tokenization strategies, MedTok consistently outperforms both text-only and graph-only approaches. Experimental results show that MedTok achieves higher AUPRC scores across inpatient and outpatient tasks, outperforming text-based tokenizers by up to 5.24% and graph-based tokenizers by up to 6.00%.

Benefits of multimodal tokenization

Ablation studies confirm the importance of both modalities in MedTok’s design. Removing the graph-based component leads to a decline in performance on drug recommendation and disease co-occurrence tasks, while removing the text-based component degrades performance in readmission prediction and phenotype classification.

MedTok is a general purpose tokenizer: Medical QA systems

Beyond EHR-based tasks, MedTok demonstrates its versatility by improving performance in medical question-answering (QA) systems. Many medical QA models rely on large language models trained on free-text clinical data but lack structured knowledge from medical coding systems.

We evaluate MedTok’s impact on the MedDDx dataset, which contains multiple-choice medical questions at different difficulty levels. Integrating MedTok with an LLM (LLaMA3.1-8B) leads to a 2.3% accuracy improvement on expert-level medical questions, demonstrating its ability to provide structured context for clinical reasoning.

Publication

Multimodal Medical Code Tokenizer
Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
In Review 2025 [arXiv]

@article{su2025multimodal,
  title={Multimodal Medical Code Tokenizer},
  author={Su, Xiaorui and Messica, Shvat and Huang, Yepeng and Johnson, Ruth and Fesser, Lukas and Gao, Shanghua and Sahneh, Faryad and Zitnik, Marinka},
  journal={arXiv:2502.04397},
  year={2025}
}

Authors

Latest News

Feb 2025:   MedTok: Unlocking Medical Codes for GenAI

Meet MedTok, a multimodal medical code tokenizer that transforms how AI understands structured medical data. By integrating textual descriptions and relational contexts, MedTok enhances tokenization for transformer-based models—powering everything from EHR foundation models to medical QA. [Project website]

Feb 2025:   What If You Could Rewrite Biology? Meet CLEF

What if we could anticipate molecular and medical changes before they happen? Introducing CLEF, an approach for counterfactual generation in biological and medical sequence models. [Project website]

Feb 2025:   Digital Twins as Global Health and Disease Models

Jan 2025:   LLM and KG+LLM agent papers at ICLR

Jan 2025:   Artificial Intelligence in Medicine 2

Excited to share our new graduate course on Artificial Intelligence in Medicine 2.

Jan 2025:   ProCyon AI Highlighted by Kempner

Thanks to Kempner Institute for highlighting our latest research, ProCyon, our protein-text foundation model for modeling protein functions.

Jan 2025:   AI Design of Proteins for Therapeutics

Dec 2024:   Unified Clinical Vocabulary Embeddings

New paper: A unified resource provides a new representation of clinical knowledge by unifying medical vocabularies. (1) Phenotype risk score analysis across 4.57 million patients, (2) Inter-institutional clinician panels evaluate alignment with clinical knowledge across 90 diseases and 3,000 clinical codes.

Dec 2024:   SPECTRA in Nature Machine Intelligence

Are biomedical AI models truly as smart as they seem? SPECTRA is a framework that evaluates models by considering the full spectrum of cross-split overlap: train-test similarity. SPECTRA reveals gaps in benchmarks for molecular sequence data across 19 models, including LLMs, GNNs, diffusion models, and conv nets.

Nov 2024:   Ayush Noori Selected as a Rhodes Scholar

Congratulations to Ayush Noori on being named a Rhodes Scholar! Such an incredible achievement!

Nov 2024:   PocketGen in Nature Machine Intelligence

Oct 2024:   Activity Cliffs in Molecular Properties

Oct 2024:   Knowledge Graph Agent for Medical Reasoning

Sep 2024:   Three Papers Accepted to NeurIPS

Exciting projects include a unified multi-task time series model, a flow-matching approach for generating protein pockets using geometric priors, and a tokenization method that produces invariant molecular representations for integration into large language models.

Sep 2024:   TxGNN Published in Nature Medicine

Aug 2024:   Graph AI in Medicine

Excited to share a new perspective on Graph Artificial Intelligence in Medicine in Annual Reviews.

Aug 2024:   How Proteins Behave in Context

Harvard Medicine News on our new AI tool that captures how proteins behave in context. Kempner Institute on how context matters for foundation models in biology.

Jul 2024:   PINNACLE in Nature Methods

PINNACLE contextual AI model is published in Nature Methods. Paper. Research Briefing. Project website.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics