Evaluating Generalizability of Molecular AI Models

Deep learning has made rapid advances in modelling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata- or sequence similarity-based train and test splits of input data before assessing model performance.

Here we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, that is, similarity between train and test splits. We introduce SPECTRA, the spectral framework for model evaluation. Given a model and a dataset, SPECTRA plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability.

We use SPECTRA with 18 sequencing datasets and phenotypes ranging from antibiotic resistance in tuberculosis to protein–ligand binding and evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models and convolutional neural networks. We show that sequence similarity- and metadata-based splits provide an incomplete assessment of model generalizability.

Using SPECTRA, we find that as cross-split overlap decreases, deep learning models consistently show reduced performance, varying by task and model. Although no model consistently achieved the highest performance across all tasks, deep learning models can, in some cases, generalize to previously unseen sequences on specific tasks. SPECTRA advances our understanding of how foundation models generalize in biological applications.

Publication

Evaluating Generalizability of Artificial Intelligence Models for Molecular Datasets
Yasha Ektefaie, Andrew Shen, Daria Bykova, Maximillian Marin, Marinka Zitnik* and Maha Farhat*
Nature Machine Intelligence 2024 [bioRxiv]

@article{ektefaie2024evaluating,
  title={Evaluating Generalizability of Artificial Intelligence Models for Molecular Datasets},
  author={Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Maximillian, Marin and Zitnik, Marinka* and Farhat, Maha*},
  journal={Nature Machine Intelligence},
  url={https://rdcu.be/d2D0z},
  year={2024}
}

Code Availability

Pytorch implementation of SPECTRA is available in the GitHub repository.

Authors

Latest News

Oct 2025:   A Scientist's Guide to AI Agents in Nature

A piece on AI agents in Nature highlights ongoing projects in our group, including methods for evaluating scientific hypotheses, challenges in benchmarking AI agents, and the open ToolUniverse ecosystem.

Sep 2025:   ToolUniverse: AI Agents for Science and Medicine

New paper: ToolUniverse introduces an open ecosystem for building AI scientists with 600+ scientific and biomedical tools. Build your AI co-scientists at https://aiscientist.tools.

Sep 2025:   Democratizing "AI Scientists" with ToolUniverse

Our new initiative: Use Tool Universe to build an AI scientist for yourself from any language or reasoning model, whether open or closed. https://aiscientist.tools

Sep 2025:   InfEHR in Nature Communications

Collaboration with Ben and Girish on clinical phenotype resolution through deep geometric learning on electronic health records published in Nature Communications.

Sep 2025:   PDGrapher in Nature Biomedical Engineering

New paper in Nature Biomedical Engineering introducing PDGrapher, a model for phenotype-based target discovery. [Harvard Medicine News]

Sep 2025:   AI and Net Medicine: Path to Precision Medicine

Aug 2025:   CUREBench - Reasoning for Therapeutics

Update from CUREBench: 650+ entrants, 100+ teams and 500+ submissions. Thank you to the CUREBench community. Working on AI for drug discovery and reasoning in medicine? New teams welcome. Tasks, rules, and leaderboard: https://curebench.ai.

Aug 2025:   Drug Discovery Workshop at NeurIPS 2025

Excited to organize a NeurIPS workshop on Virtual Cells and Digital Instruments. Submit your papers.

Aug 2025:   AI for Science Workshop at NeurIPS

Excited to organize a NeurIPS workshop on AI for Science. This is our 6th workshop in the AI for Science series. Submit your papers.

Jul 2025:   Launching CUREBench

Launched CUREBench, the first competition in AI reasoning for therapeutics. Colocated with NeurIPS 2025. Start at https://curebench.ai.

Jul 2025:   Launching TxAgent Evaluation Portal

Launched TxAgent evaluation portal, our global evaluation of AI for drug decision-making and therapeutic reasoning. Participate in TxAgent evaluations! [TxAgent project]

Jul 2025:   SPATIA Model of Spatial Cell Phenotypes

Jul 2025:   AI-Enabled Drug Discovery Reaches Clinical Milestone

Jun 2025:   Knowledge Tracing for Biomedical AI Education

New preprint on biologically inspired architecture for knowledge tracing. The study on the use of generative AI in education with prospective evaluation of knowledge tracing in the classroom.

Jun 2025:   Few shot learning for rare disease diagnosis

Jun 2025:   One Patient, Many Contexts: Scaling Medical AI

Jun 2025:   ToolUniverse - 211+ Tools for "AI Scientist" Agents

ToolUniverse now offers access to over 211 cutting-edge biological and medical tools, all integrated with Model Context Protocol (MCP). Any “AI Scientist” agent can tap into these tools for biomedical research. [Tutorial] [ToolUniverse] [TxAgent]

May 2025:   What Perturbation Can Reverse Disease Effects?

In press at Nature Biomedical Engineering: PDGrapher AI predicts chemicals to reverse disease phenotypic effects — with applications to drug target identification.

May 2025:   Decision Transformers for Cell Reprogramming

New preprint: Decision transformers for generating reach-avoid policies in sequential decision making — with applications from robotics to cell reprogramming.

May 2025:   COMPASS: Immunotherapy Outcome Prediction

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics