ATOMICA - Universal Geometric AI for Molecular Interactions across Biomolecular Modalities

Molecular interactions fundamentally influence all aspects of chemistry and biology. Prevailing machine learning approaches emphasize modeling molecules in isolation or typically model molecular interactions restricted to a specific type of interaction such as protein-ligand and protein-protein interactions.

Here, we present ATOMICA, a geometric deep learning model that learns representations of atomic structures of intermolecular interactions that are generalizable across interactions between many biomolecular modalities (small molecules, metals, amino acids, and nucleic acids) and is pretrained on 2,105,703 molecular interaction interfaces. We leverage an all-atom, denoising and masking pre-training objective to learn representations at the scale of atoms, blocks (amino acids, nucleic acids, chemical motifs), and whole interaction interfaces between molecular entities.

We demonstrate synergies between pre-training on molecules of different modalities in the quality of learned representations, with for the first time scaling laws observed when integrating datasets from across molecular modalities. In a self-supervised manner, ATOMICA representations capture critical residues on interaction interfaces, composition of interacting molecules, and chemical similarity.

ATOMICA is a general-purpose model that we adapt to fingerprint protein interfaces binding to ion, ligand, nucleic-acid, lipid and protein interfaces for which we show that proteins with similar interactions are likely to be associated with the same diseases. Thus, motivating ATOMICANet networks that capture protein disease pathways for 27 diseases across ion, small molecule, and lipid interfacomes, and is predictive for identifying disease proteins for autoimmune neuropathies and lymphoma. Applying ATOMICA to the dark proteome, we predict 2,646 high-confidence ligands and explore novel functions and binding motifs of previously functionally unknown protein clusters, including a cluster of Bacteria C4 zinc fingers and clusters of transmembrane cytochrome subunits, illustrating ATOMICA to help with understanding biomolecular interactions across the protein universe.

Motivation

Current machine learning models in molecular biology often treat molecules in isolation or focus narrowly on specific types of molecular interactions, such as protein-ligand or protein-protein binding. These siloed models use separate architectures for different biomolecular classes, limiting their ability to transfer knowledge across modalities. This restriction hampers generalizability and reduces performance in low-data domains like rare interactions or uncharacterized proteins.

ATOMICA addresses these limitations by unifying the modeling of intermolecular interactions across small molecules, ions, nucleic acids, peptides, and proteins. It leverages fundamental physicochemical principles common to all biomolecular interactions, such as hydrogen bonding, van der Waals forces, and π-stacking, to build a universal representation. The model is designed to operate at the atomic scale and capture multi-scale structural and chemical relationships, enabling accurate and transferable representations of molecular interfaces across diverse contexts.

ATOMICA Model

ATOMICA is a hierarchical geometric deep learning model trained on over 2.1 million molecular interaction interfaces. It represents interaction complexes using an all-atom graph structure, where nodes correspond to atoms or grouped chemical blocks, and edges reflect both intra- and intermolecular spatial relationships. The model uses SE(3)-equivariant message passing to ensure that learned embeddings are invariant to rotations and translations of molecular structures.

The architecture produces embeddings at multiple scales—atom, block, and graph—that capture fine-grained structural detail and broader functional motifs. The pretraining strategy involves denoising transformations (rotation, translation, torsion) and masked block-type prediction, enabling the model to learn chemically grounded, transferable features. ATOMICA supports downstream tasks via plug-and-play adaptation with task-specific heads, including binding site prediction and protein interface fingerprinting.

Universal Representation of Molecular Interactions

ATOMICA learns a shared latent space that encodes interaction features across all biomolecular modalities. Without explicit supervision on modality labels, the model organizes protein-protein, protein-ligand, protein-RNA, and protein-DNA interactions into chemically meaningful clusters in latent space. Embeddings show continuity across modalities—e.g., protein-peptide interactions lie close to protein-protein ones—demonstrating the model’s capacity to capture chemical and functional similarity.

This universal embedding space supports compositional reasoning, akin to word analogies in NLP. For example, the model approximates a protein-small molecule complex by algebraically combining embeddings of related complexes, reflecting learned compositionality. ATOMICA also ranks interface residues by importance using ATOMICAScore, and these predictions align well with residues involved in intermolecular bonds, outperforming sequence-only language models like ESM2 in zero-shot settings.

ATOMICA’s Interfaceome Networks

We use ATOMICA representations to construct interfaceome networks, which aregraphs linking proteins based on similarity in their interaction interfaces with ions, small molecules, nucleic acids, lipids, and proteins. These ATOMICANet modality-specific networks reveal that proteins sharing similar interface features often participate in the same disease pathways, even across different interaction types. This provides a new molecular-level view of disease mechanisms.

By analyzing disease-associated proteins across 82 diseases, ATOMICANets uncover statistically significant pathway modules in lipid, ion, and small molecule networks. The model predicts disease-relevant interactions for autoimmune neurological disorders and cancer, identifying ion channels in multiple sclerosis and lymphoma-associated proteins across all five interface types. This demonstrates ATOMICA’s value in understanding protein function and disease involvement in a modality-specific, interpretable manner.

Publication

ATOMICA: Universal Geometric AI for Molecular Interactions across Biomolecular Modalities
Ada Fang, Zaixi Zhang, Andrew Zhou, and Marinka Zitnik
In Review 2025 [arXiv]

@article{fang2025atomica,
  title={ATOMICA: Universal Geometric AI for Molecular Interactions across Biomolecular Modalities},
  author={Fang, Ada and Zhang, Zaixi and Zhou, Andrew and Zitnik, Marinka},
  journal={},
  url={},
  year={2025}
}

Code and Data Availability

Pytorch implementation of ATOMICA is available in the GitHub repository. Datasets are also available at Harvard Dataverse repository.

Authors

Latest News

Mar 2025:   On Biomedical AI in Harvard Gazette

Read about AI in medicine in the latest Harvard Gazette and New York Times.

Mar 2025:   TxAgent: AI Agent for Therapeutic Reasoning

TxAgent is an AI agent for therapeutic reasoning that consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939 and validated clinical insights. [Project website] [TxAgent] [ToolUniverse]

Mar 2025:   Multimodal AI predicts clinical outcomes of drug combinations from preclinical data

Mar 2025:   KGARevion: AI Agent for Knowledge-Intensive Biomedical QA

KGARevion is an AI agent designed for complex biomedical QA that integrates the non-codified knowledge of LLMs with the structured, codified knowledge found in knowledge graphs. [ICLR 2025 publication]

Feb 2025:   MedTok: Unlocking Medical Codes for GenAI

Meet MedTok, a multimodal medical code tokenizer that transforms how AI understands structured medical data. By integrating textual descriptions and relational contexts, MedTok enhances tokenization for transformer-based models—powering everything from EHR foundation models to medical QA. [Project website]

Feb 2025:   What If You Could Rewrite Biology? Meet CLEF

What if we could anticipate molecular and medical changes before they happen? Introducing CLEF, an approach for counterfactual generation in biological and medical sequence models. [Project website]

Feb 2025:   Digital Twins as Global Health and Disease Models

Jan 2025:   LLM and KG+LLM agent papers at ICLR

Jan 2025:   Artificial Intelligence in Medicine 2

Excited to share our new graduate course on Artificial Intelligence in Medicine 2.

Jan 2025:   ProCyon AI Highlighted by Kempner

Thanks to Kempner Institute for highlighting our latest research, ProCyon, our protein-text foundation model for modeling protein functions.

Jan 2025:   AI Design of Proteins for Therapeutics

Dec 2024:   Unified Clinical Vocabulary Embeddings

New paper: A unified resource provides a new representation of clinical knowledge by unifying medical vocabularies. (1) Phenotype risk score analysis across 4.57 million patients, (2) Inter-institutional clinician panels evaluate alignment with clinical knowledge across 90 diseases and 3,000 clinical codes.

Dec 2024:   SPECTRA in Nature Machine Intelligence

Are biomedical AI models truly as smart as they seem? SPECTRA is a framework that evaluates models by considering the full spectrum of cross-split overlap: train-test similarity. SPECTRA reveals gaps in benchmarks for molecular sequence data across 19 models, including LLMs, GNNs, diffusion models, and conv nets.

Nov 2024:   Ayush Noori Selected as a Rhodes Scholar

Congratulations to Ayush Noori on being named a Rhodes Scholar! Such an incredible achievement!

Nov 2024:   PocketGen in Nature Machine Intelligence

Oct 2024:   Activity Cliffs in Molecular Properties

Oct 2024:   Knowledge Graph Agent for Medical Reasoning

Sep 2024:   Three Papers Accepted to NeurIPS

Exciting projects include a unified multi-task time series model, a flow-matching approach for generating protein pockets using geometric priors, and a tokenization method that produces invariant molecular representations for integration into large language models.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics