Molecular interactions fundamentally influence all aspects of chemistry and biology. Prevailing machine learning approaches emphasize modeling molecules in isolation or typically model molecular interactions restricted to a specific type of interaction such as protein-ligand and protein-protein interactions.
Here, we present ATOMICA, a geometric deep learning model that learns representations of atomic structures of intermolecular interactions that are generalizable across interactions between many biomolecular modalities (small molecules, metals, amino acids, and nucleic acids) and is pretrained on 2,105,703 molecular interaction interfaces. We leverage an all-atom, denoising and masking pre-training objective to learn representations at the scale of atoms, blocks (amino acids, nucleic acids, chemical motifs), and whole interaction interfaces between molecular entities.
We demonstrate synergies between pre-training on molecules of different modalities in the quality of learned representations, with for the first time scaling laws observed when integrating datasets from across molecular modalities. In a self-supervised manner, ATOMICA representations capture critical residues on interaction interfaces, composition of interacting molecules, and chemical similarity.
ATOMICA is a general-purpose model that we adapt to fingerprint protein interfaces binding to ion, ligand, nucleic-acid, lipid and protein interfaces for which we show that proteins with similar interactions are likely to be associated with the same diseases. Thus, motivating ATOMICANet networks that capture protein disease pathways for 27 diseases across ion, small molecule, and lipid interfacomes, and is predictive for identifying disease proteins for autoimmune neuropathies and lymphoma. Applying ATOMICA to the dark proteome, we predict 2,646 high-confidence ligands and explore novel functions and binding motifs of previously functionally unknown protein clusters, including a cluster of Bacteria C4 zinc fingers and clusters of transmembrane cytochrome subunits, illustrating ATOMICA to help with understanding biomolecular interactions across the protein universe.
Motivation
Current machine learning models in molecular biology often treat molecules in isolation or focus narrowly on specific types of molecular interactions, such as protein-ligand or protein-protein binding. These siloed models use separate architectures for different biomolecular classes, limiting their ability to transfer knowledge across modalities. This restriction hampers generalizability and reduces performance in low-data domains like rare interactions or uncharacterized proteins.
ATOMICA addresses these limitations by unifying the modeling of intermolecular interactions across small molecules, ions, nucleic acids, peptides, and proteins. It leverages fundamental physicochemical principles common to all biomolecular interactions, such as hydrogen bonding, van der Waals forces, and π-stacking, to build a universal representation. The model is designed to operate at the atomic scale and capture multi-scale structural and chemical relationships, enabling accurate and transferable representations of molecular interfaces across diverse contexts.

ATOMICA Model
ATOMICA is a hierarchical geometric deep learning model trained on over 2.1 million molecular interaction interfaces. It represents interaction complexes using an all-atom graph structure, where nodes correspond to atoms or grouped chemical blocks, and edges reflect both intra- and intermolecular spatial relationships. The model uses SE(3)-equivariant message passing to ensure that learned embeddings are invariant to rotations and translations of molecular structures.

The architecture produces embeddings at multiple scales—atom, block, and graph—that capture fine-grained structural detail and broader functional motifs. The pretraining strategy involves denoising transformations (rotation, translation, torsion) and masked block-type prediction, enabling the model to learn chemically grounded, transferable features. ATOMICA supports downstream tasks via plug-and-play adaptation with task-specific heads, including binding site prediction and protein interface fingerprinting.

Universal Representation of Molecular Interactions
ATOMICA learns a shared latent space that encodes interaction features across all biomolecular modalities. Without explicit supervision on modality labels, the model organizes protein-protein, protein-ligand, protein-RNA, and protein-DNA interactions into chemically meaningful clusters in latent space. Embeddings show continuity across modalities—e.g., protein-peptide interactions lie close to protein-protein ones—demonstrating the model’s capacity to capture chemical and functional similarity.
This universal embedding space supports compositional reasoning, akin to word analogies in NLP. For example, the model approximates a protein-small molecule complex by algebraically combining embeddings of related complexes, reflecting learned compositionality. ATOMICA also ranks interface residues by importance using ATOMICAScore, and these predictions align well with residues involved in intermolecular bonds, outperforming sequence-only language models like ESM2 in zero-shot settings.

ATOMICA’s Interfaceome Networks
We use ATOMICA representations to construct interfaceome networks, which aregraphs linking proteins based on similarity in their interaction interfaces with ions, small molecules, nucleic acids, lipids, and proteins. These ATOMICANet modality-specific networks reveal that proteins sharing similar interface features often participate in the same disease pathways, even across different interaction types. This provides a new molecular-level view of disease mechanisms.
By analyzing disease-associated proteins across 82 diseases, ATOMICANets uncover statistically significant pathway modules in lipid, ion, and small molecule networks. The model predicts disease-relevant interactions for autoimmune neurological disorders and cancer, identifying ion channels in multiple sclerosis and lymphoma-associated proteins across all five interface types. This demonstrates ATOMICA’s value in understanding protein function and disease involvement in a modality-specific, interpretable manner.

Publication
ATOMICA: Universal Geometric AI for Molecular Interactions across Biomolecular Modalities
Ada Fang, Zaixi Zhang, Andrew Zhou, and Marinka Zitnik
In Review 2025 [arXiv]
@article{fang2025atomica,
title={ATOMICA: Universal Geometric AI for Molecular Interactions across Biomolecular Modalities},
author={Fang, Ada and Zhang, Zaixi and Zhou, Andrew and Zitnik, Marinka},
journal={},
url={},
year={2025}
}
Code and Data Availability
Pytorch implementation of ATOMICA is available in the GitHub repository. Datasets are also available at Harvard Dataverse repository.