ProCyon banner

A multimodal foundation model for protein phenotypes

Author Affiliations
1Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA 2Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA 3Department of Brain Sciences, Imperial College London, London, UK 4Centre for Neuroimaging Sciences, King's College London, London, UK 5Department of Chemistry, MIT, Cambridge, MA, USA 6Department of Computing, Imperial College London, London, UK 7Institute of Computational Biology, Computational Health Center, Helmholtz Munich, Munich, Germany 8TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany 9School of Data Science, University of Virginia, VA, USA 10School of Computation, Information and Technology, Technical University of Munich, Garching, Germany 11Department of Neurology, Brigham and Women's Hospital, Boston, MA, USA 12Harvard Stem Cell Institute, Cambridge, MA, USA 13Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, MA, USA 14Broad Institute of MIT and Harvard, Cambridge, MA, USA 15Harvard Data Science Initiative, Cambridge, MA, USA
*Co-first authors

+Present address: Department of Computer Science, Stanford University, Stanford, CA, USA

Present address: Acceleration Consortium, University of Toronto, Toronto, ON, Canada

Corresponding author. Email: marinka@hms.harvard.edu
ProCyon capabilities

We present ProCyon, a foundation model for modeling, generating, and predicting protein phenotypes. ProCyon supports flexible queries with interleaved protein and natural text inputs, enabling a vast array of applications, including:

  • Functional annotation for proteins
    • Supporting completely unseen or poorly characterized proteins, or zero-shot generalization to novel proteins
    • Able to generate annotations beyond pre-defined vocabularies and ontologies, or zero-shot generalization to novel phenotypes
  • Generation of detailed phenotype descriptions for arbitrary proteins ("protein captioning")
  • Protein retrieval from flexible natural language prompts, without restriction to specific keywords or pre-defined terms
  • Generalization to novel protein-phenotype tasks unseen during training, or zero-shot task transfer, such as
    • Complex queries combining phenotypes from distinct knowledge domains, e.g. disease association and therapeutic interactions
    • Identifying protein domains targted by a given small molecule drug
    • Modeling the phenotypic effect of protein coding mutations


These capabilities are enabled by instruction-tuning on a wide range of biological knowledge, allowing ProCyon to reason over protein phenotypes across scales, ranging from molecular functions up to organism-level disease associations.

Importantly, ProCyon is an open model, with open and accessible training data, open-source training code, reproducible training recipes, transparent evaluations, intermediate checkpoints, and more.

Abstract

Understanding the roles of human proteins remains a major challenge, with approximately 20% of human proteins lacking known functions and more than 40% missing context-specific functional insights. Even well-annotated proteins are often poorly characterized in diverse biological contexts, disease states, and perturbations. We present ProCyon, a foundation model for modeling, generating, and predicting protein phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions. To support this, we created ProCyon-Instruct, a dataset of 33 million protein phenotype instructions, representing a comprehensive resource for multiscale protein phenotypes. By co-training a large language model with multimodal molecular encoders, ProCyon integrates phenotypic and protein data. A novel architecture and instruction tuning strategy allow ProCyon to process arbitrarily interleaved protein-and-phenotype inputs, achieve zero-shot task transfer, and generate free-form text phenotypes interleaved with retrieved protein sequence, structure, and drug modalities in a single unified model. ProCyon achieves strong performance against single-modality models, multimodal models such as ESM3, as well as text-only LLMs on dozens of benchmarking tasks such as contextual protein retrieval and question answering. We extensively evaluate ProCyon for biological applications, including identifying protein domains that bind small molecule drugs, predicting peptide binding with enzymes, and assessing the functional impact of Alzheimer's disease mutations. ProCyon enables conditional retrieval of proteins linked to small molecules through complementary mechanisms of action. It generates candidate phenotypes for under-characterized proteins recently implicated in Parkinson's disease, facilitating hypothesis generation for poorly understood proteins and biological processes. ProCyon paves the way toward an effective, general solution for functional protein biology that can enable new insights into the human proteome.

ProCyon Model

ProCyon model architecture

ProCyon is an 11-billion parameter multimodal model fusing state-of-the-art large language models and protein representation learning methods. ProCyon supports multimodal protein inputs interleaved within textual prompts, enabling diverse queries about protein phenotypes and function. By using dedicated protein representation modules rather than a controlled vocabulary, ProCyon generalizes effectively to zero-shot proteins, novel therapeutic modalities, and protein regions, including domains and peptides.

ProCyon-Instruct Dataset

To train ProCyon, we create the ProCyon-Instruct dataset. Some highlights:
  • 677,154 protein-phenotype pairs
  • 48,920 unique phenotypes and 56,753 proteins, domains, and peptides
  • Captures phenotypic information for proteins across five knowledge domains
We transform this data into an instruction tuning dataset, expressing each protein-phenotype pair as a natural language instruction. To generate a larger diversity of text in the instructions, we leverage an external LLM to rephrase the raw descriptions, resulting in a final dataset of 33,899,528 instructions for training ProCyon models.

Performance Comparison

ProCyon benchmarking results

ProCyon shows strong performance on a benchmark of fourteen biologically-relevant tasks constructed from ProCyon-Instruct and framed as either question-answering or protein retrieval tasks. ProCyon is the only model to consistently outperform both single-modality and multi-modality models across tasks. We also find that ProCyon maintains strong performance on 3,250 completely unseen phenotypes across knowledge domains, showing its ability to reason over novel scientific concepts.

ProCyon Capabilities

Free-text protein retrieval

ProCyon STING figure

ProCyon is able to successfully retrieve the STING protein given functional queries related to neuronal inflammatory stress response, a role of STING that was only described in scientific literature published after ProCyon's training data cutoff date. Increasingly precise and functionally-relevant descriptions increase the retrieval rank of STING, showing ProCyon's ability to assist in the scientific discovery process.

Functional annotation of under-studied proteins

ProCyon AKNAD1 figure

We generate a UniProt-style description for AKNAD1, a poorly-characterized protein that does not appear in the training data for ProCyon. Two of the three final descriptions show evidence in the Human Protein Atlas through subcellular localization assays and scRNA-seq studies. ProCyon's free-text outputs allow generation of descriptions unbounded by a controlled vocabulary.

Zero-shot task transfer

ProCyon cross-KD figure

We show that ProCyon exhibits zero-shot task transfer and cross-knowledge domain reasoning, exhibiting the ability to perform tasks beyond those it was explicitly trained on, and even those that require reasoning across knowledge domains. Here we show ProCyon's ability to conditonally retrieve two distinct proteins targeted by the same drug depending on which disease description is provided in the input prompt.

We hope you enjoyed this preview of how ProCyon can push the boundaries of AI for protein understanding, for more details and additional experiments, please see our full manuscript!

BibTeX


        @article {Queen2024.12.10.627665,
          author = {Queen, Owen and Huang, Yepeng and Calef, Robert and Giunchiglia, Valentina and Chen, Tianlong and Dasoulas, George and Tai, LeAnn and Ektefaie, Yasha and Noori, Ayush and Brown, Joseph and Cobley, Tom and Hrovatin, Karin and Hartvigsen, Tom and Theis, Fabian and Pentelute, Bradley L. and Khurana, Vikram and Kellis, Manolis and Zitnik, Marinka},
          title = {ProCyon: A multimodal foundation model for protein phenotypes},
          elocation-id = {2024.12.10.627665},
          year = {2024},
          doi = {10.1101/2024.12.10.627665},
          URL = {https://www.biorxiv.org/content/early/2024/12/15/2024.12.10.627665},
          eprint = {https://www.biorxiv.org/content/early/2024/12/15/2024.12.10.627665.full.pdf},
          journal = {bioRxiv}
        }