Datasets
COMPASS Immunotherapy Datasets
A Foundation Model for Predicting Immunotherapy Outcomes Across Cancers and Treatments
COMPASS is a foundation model for predicting immunotherapy response from pan-cancer transcriptomic data using a concept bottleneck architecture.
ProCyon-Instruct
A Foundation Model for Protein Phenotypes
ProCyon is a groundbreaking foundation model for modeling, generating, and predicting protein phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions. To train ProCyon, we created ProCyon-Instruct, a dataset of 33 million protein phenotype instructions, representing a comprehensive resource for multiscale protein phenotypes.
ClinGraph and ClinVec - Unified Clinical Vocabulary Embeddings
Unified Embeddings of Clinical Codes Enable Knowledge-Grounded AI in Medicine
Integrating structured clinical knowledge into artificial intelligence (AI) models remains a major challenge. Medical codes primarily reflect administrative workflows rather than clinical reasoning, limiting AI models’ ability to capture true clinical relationships and undermining their generalizability.
To address this, we introduce ClinGraph, a clinical knowledge graph that integrates eight EHR-based vocabularies, and ClinVec, a set of 153,166 clinical code embeddings derived from ClinGraph using a graph transformer neural network. ClinVec provides a machine-readable representation of clinical knowledge that captures semantic relationships among diagnoses, medications, laboratory tests, and procedures. Panels of clinicians from multiple institutions evaluated the embeddings across 96 diseases and more than 3,000 clinical codes, confirming their alignment with expert knowledge.
In a retrospective analysis of 4.57 million patients from Clalit Health Services, we show that ClinVec supports phenotype risk scoring and stratifies individuals by survival outcomes. We further demonstrate that injecting ClinVec into large language models improves performance on medical question answering, including for region-specific clinical scenarios. ClinVec enables structured clinical knowledge to be injected into predictive and generative AI models, bridging the gap between EHR codes and clinical reasoning
PrimeKG
Precision Medicine Oriented Knowledge Graph
PrimeKG is a precision medicine-oriented knowledge graph that provides a holistic view of diseases. It integrates 20 high-quality resources to describe 17,080 diseases with 5,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scale, and the entire range of approved and experimental drugs with their therapeutic action.
PrimeKG supports drug-disease prediction by including an abundance of ’indications’, ’contradictions’ and ’off-label use’ edges, which are usually missing in other knowledge graphs. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multi-modal analyses.
GraphXAI
Evaluating Explainability for Graph Neural Networks
GraphXAI is a resource to systematically evaluate and benchmark the quality of GNN explanations. A key component is a novel and flexible synthetic dataset generator called ShapeGGen that automatically generates a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) together with ground-truth explanations that address all known pitfalls of explainability methods.
Physical Activity Monitoring Dataset
Dataset for Irregular Time Series Research
We are developing representation learning techniques for complex time series dataset. In the Raindrop study (ICLR’22), we introduced a graph-guided network for irregularly sampled multivariate time series. The study includes a processed sensor dataset recording daily living activities of individuals.
Population-Scale Patient Safety Dataset
Adverse Events of Medications across Patient Groups and the Entire Range of Human Diseases and Approved Drugs
We present a comprehensive catalog of 10,443,476 adverse event reports (involving 19,193 adverse events and 3,624 drugs) from the U.S. Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS), collected from January 2013 to September 2020. The new resource can help discover relationships between drugs and safety events, especially in cases of rare events and effects within population subgroups that differ in their risks of specific clinical outcomes and are disproportionately affected by preventable inequities.
Subgraph Datasets
Datasets for Subgraph Representation Learning Research
We design novel synthetic and real-world social and biological datasets, consisting of underlying base graphs and many labeled subgraphs. These datasets are ready to be used for benchmarking, systematic model evaluation and comparison.
Therapeutics Data Commons
Machine Learning Datasets and Tasks for Drug Discovery and Development
TDC is the first unifying framework to systematically access and evaluate machine learning across the entire range of therapeutics.
At its core, TDC is a collection of AI/ML-ready datasets and learning tasks to serve as a meeting point for domain and ML scientists. TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles. All datasets and learning tasks are integrated and accessible via an open-source library.
BioSNAP
Stanford Biomedical Network Dataset Collection
BioSNAP is a collection diverse biomedical networks, inclusing protein-protein interaction networks, single-cell similarity networks, drug-drug interaction networks.
BioSNAP datasets contain metadata on graphs and node features, and can be easily linked to external repositories of biological knowledge.
Fair Graph Datasets
Graph datasets comprising of high-stakes decisions in criminal justice and financial lending domains
Graph datasets comprise of critical decisions in criminal justice (if a defendant should be released on bail) and financial lending (if an individual should be given loan) domains. These attributed graphs contain sensitive/protected attributes, which makes them suitable for studying algorithmic fairness.
OGB
The Open Graph Benchmark
OGB is a collection of benchmark datasets, data loaders, and evaluators for graph machine learning. Datasets cover a variety of graph machine learning tasks and real-world applications.
The OGB data loaders are fully compatible with popular graph deep learning frameworks, including Pytorch Geometric and DGL. They provide automatic dataset downloading, standardized dataset splits, and unified performance evaluation.
Multimodal cancer network
Multimodal network centered on genes frequently mutated in cancer patients
The multimodal cancer network integrates information on chemicals, diseases, molecular functions, genes, and protein.
The dataset has 21 types of biologically meaningful associations (edge types): chemical-chemical, chemical-protein, disease-chemical, disease-disease, disease-function, disease-gene, function-function, gene-gene (split into 6 edge types by interaction type), gene-protein, protein-function, and protein-protein interactions.
The network has 20 K nodes and 3.4 M edges.
Giga-scale biological network
The giga-scale biological network is one of the largest networks ever constructed in biology. The network integrates protein and genetic interaction data from more than two thousand species.
The network has 10 M nodes and 2.3 B edges.
Tree of life
Protein interactomes across the tree of life
The dataset contains protein interactomes from 1,840 species across the tree of life. The dataset contains rich metadata about proteins, including their homology relationships
The dataset also contains metadata about species, including taxonomy of species, phylogenetic reltionships, and ecological information on environments and habitats in which species live.
Polypharmacy network
Network of drugs, proteins, and side effects
The polypharmacy network is a highly multi-relational network, consisting of protein-protein interactions, drug-protein targets, and drug-drug interactions encoded by polypharmacy side effects.
The network has 20 K nodes and 5 M edges, which are split into 1 K distinct edge types.
Tissue-specific protein dataset
The dataset contains protein-protein interaction networks specific to 107 human tissues, a tissue hierarchy of anatomical relationships between tissues, and tissue-specific gene-function annotations.
Human knowledge network
The human knowledge network contains interactions between proteins, diseases, biological processes, side effects, and drugs.
The network has 98 K nodes and 8 M edges, which are split into 42 distinct types of biologically relevant molecular interactions.
Latest News
May 2025: COMPASS: Immunotherapy Outcome Prediction
Apr 2025: ATOMICA - A Universal Model of Molecular Interactions
Mar 2025: On Biomedical AI in Harvard Gazette
Read about AI in medicine in the latest Harvard Gazette and New York Times.
Mar 2025: TxAgent: AI Agent for Therapeutic Reasoning
TxAgent is an AI agent for therapeutic reasoning that consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939 and validated clinical insights. [Project website] [TxAgent] [ToolUniverse]
Mar 2025: Multimodal AI predicts clinical outcomes of drug combinations from preclinical data
Mar 2025: KGARevion: AI Agent for Knowledge-Intensive Biomedical QA
KGARevion is an AI agent designed for complex biomedical QA that integrates the non-codified knowledge of LLMs with the structured, codified knowledge found in knowledge graphs. [ICLR 2025 publication]
Feb 2025: MedTok: Unlocking Medical Codes for GenAI
Meet MedTok, a multimodal medical code tokenizer that transforms how AI understands structured medical data. By integrating textual descriptions and relational contexts, MedTok enhances tokenization for transformer-based models—powering everything from EHR foundation models to medical QA. [Project website]
Feb 2025: What If You Could Rewrite Biology? Meet CLEF
What if we could anticipate molecular and medical changes before they happen? Introducing CLEF, an approach for counterfactual generation in biological and medical sequence models. [Project website]
Feb 2025: Digital Twins as Global Health and Disease Models
New paper on the role of digital twins as global health and disease learning models for preventive and personalized medicine.
Jan 2025: LLM and KG+LLM agent papers at ICLR
New papers on test-time interventions in language models and knowledge graph based LLM agents accepted to ICLR. [KGARevion]
Jan 2025: Artificial Intelligence in Medicine 2
Excited to share our new graduate course on Artificial Intelligence in Medicine 2.
Jan 2025: ProCyon AI Highlighted by Kempner
Thanks to Kempner Institute for highlighting our latest research, ProCyon, our protein-text foundation model for modeling protein functions.
Jan 2025: AI Design of Proteins for Therapeutics
New Voices piece in Cell Systems: How will computational protein design change biotechnology and therapeutic development?
Dec 2024: Foundation Model for Protein Phenotypes
Dec 2024: Unified Clinical Vocabulary Embeddings
New paper: A unified resource provides a new representation of clinical knowledge by unifying medical vocabularies. (1) Phenotype risk score analysis across 4.57 million patients, (2) Inter-institutional clinician panels evaluate alignment with clinical knowledge across 90 diseases and 3,000 clinical codes.
Dec 2024: SPECTRA in Nature Machine Intelligence
Are biomedical AI models truly as smart as they seem? SPECTRA is a framework that evaluates models by considering the full spectrum of cross-split overlap: train-test similarity. SPECTRA reveals gaps in benchmarks for molecular sequence data across 19 models, including LLMs, GNNs, diffusion models, and conv nets.
Nov 2024: Ayush Noori Selected as a Rhodes Scholar
Congratulations to Ayush Noori on being named a Rhodes Scholar! Such an incredible achievement!
Nov 2024: PocketGen in Nature Machine Intelligence
Nov 2024: Biomedical AI Agents in Cell
Tweets