Evaluating Generalizability of Molecular AI Models

Deep learning has made rapid advances in modelling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata- or sequence similarity-based train and test splits of input data before assessing model performance.

Here we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, that is, similarity between train and test splits. We introduce SPECTRA, the spectral framework for model evaluation. Given a model and a dataset, SPECTRA plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability.

We use SPECTRA with 18 sequencing datasets and phenotypes ranging from antibiotic resistance in tuberculosis to protein–ligand binding and evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models and convolutional neural networks. We show that sequence similarity- and metadata-based splits provide an incomplete assessment of model generalizability.

Using SPECTRA, we find that as cross-split overlap decreases, deep learning models consistently show reduced performance, varying by task and model. Although no model consistently achieved the highest performance across all tasks, deep learning models can, in some cases, generalize to previously unseen sequences on specific tasks. SPECTRA advances our understanding of how foundation models generalize in biological applications.

Publication

Evaluating Generalizability of Artificial Intelligence Models for Molecular Datasets
Yasha Ektefaie, Andrew Shen, Daria Bykova, Maximillian Marin, Marinka Zitnik* and Maha Farhat*
Nature Machine Intelligence 2024 [bioRxiv]

@article{ektefaie2024evaluating,
  title={Evaluating Generalizability of Artificial Intelligence Models for Molecular Datasets},
  author={Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Maximillian, Marin and Zitnik, Marinka* and Farhat, Maha*},
  journal={Nature Machine Intelligence},
  url={https://rdcu.be/d2D0z},
  year={2024}
}

Code Availability

Pytorch implementation of SPECTRA is available in the GitHub repository.

Authors

Latest News

Sep 2025:   PDGrapher in Nature Biomedical Engineering

New paper in Nature Biomedical Engineering introducing PDGrapher, a model for phenotype-based target discovery. [Harvard Medicine News]

Sep 2025:   AI and Net Medicine: Path to Precision Medicine

Aug 2025:   CUREBench - Reasoning for Therapeutics

Update from CUREBench: 650+ entrants, 100+ teams and 500+ submissions. Thank you to the CUREBench community. Working on AI for drug discovery and reasoning in medicine? New teams welcome. Tasks, rules, and leaderboard: https://curebench.ai.

Aug 2025:   Drug Discovery Workshop at NeurIPS 2025

Excited to organize a NeurIPS workshop on Virtual Cells and Digital Instruments. Submit your papers.

Aug 2025:   AI for Science Workshop at NeurIPS

Excited to organize a NeurIPS workshop on AI for Science. This is our 6th workshop in the AI for Science series. Submit your papers.

Jul 2025:   Launching CUREBench

Launched CUREBench, the first competition in AI reasoning for therapeutics. Colocated with NeurIPS 2025. Start at https://curebench.ai.

Jul 2025:   Launching TxAgent Evaluation Portal

Launched TxAgent evaluation portal, our global evaluation of AI for drug decision-making and therapeutic reasoning. Participate in TxAgent evaluations! [TxAgent project]

Jul 2025:   SPATIA Model of Spatial Cell Phenotypes

Jul 2025:   AI-Enabled Drug Discovery Reaches Clinical Milestone

Jun 2025:   Knowledge Tracing for Biomedical AI Education

New preprint on biologically inspired architecture for knowledge tracing. The study on the use of generative AI in education with prospective evaluation of knowledge tracing in the classroom.

Jun 2025:   Few shot learning for rare disease diagnosis

Jun 2025:   One Patient, Many Contexts: Scaling Medical AI

Jun 2025:   ToolUniverse - 211+ Tools for "AI Scientist" Agents

ToolUniverse now offers access to over 211 cutting-edge biological and medical tools, all integrated with Model Context Protocol (MCP). Any “AI Scientist” agent can tap into these tools for biomedical research. [Tutorial] [ToolUniverse] [TxAgent]

May 2025:   What Perturbation Can Reverse Disease Effects?

In press at Nature Biomedical Engineering: PDGrapher AI predicts chemicals to reverse disease phenotypic effects — with applications to drug target identification.

May 2025:   Decision Transformers for Cell Reprogramming

New preprint: Decision transformers for generating reach-avoid policies in sequential decision making — with applications from robotics to cell reprogramming.

May 2025:   COMPASS: Immunotherapy Outcome Prediction

Apr 2025:   ATOMICA and TxAgent on the Kempner Blog

Check out the Kempner Deeper Learning posts describing our latest ATOMICA and TxAgent AI models.

Apr 2025:   ATOMICA - A Universal Model of Molecular Interactions

Mar 2025:   On Biomedical AI in Harvard Gazette

Read about AI in medicine in the latest Harvard Gazette and New York Times.

Mar 2025:   TxAgent: AI Agent for Therapeutic Reasoning

TxAgent is an AI agent for therapeutic reasoning that consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939 and validated clinical insights. [Project website] [TxAgent] [ToolUniverse]

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics