A multimodal foundation model for protein phenotypes

Owen Queen^1,*,+, Yepeng Huang^1,*, Robert Calef^1,2,*, Valentina Giunchiglia^1,3,4, Tianlong Chen^1,2, George Dasoulas¹, LeAnn Tai², Yasha Ektefaie¹, Ayush Noori¹, Joseph Brown^5,†, Tom Cobley^2,6, Karin Hrovatin^7,8, Tom Hartvigsen⁹, Fabian J. Theis^7,10, Bradley Pentelute^5,14, Vikram Khurana^11,12,14, Manolis Kellis^2,14, Marinka Zitnik^{1,13,14,15,‡;}

Author Affiliations

¹Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA ²Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA ³Department of Brain Sciences, Imperial College London, London, UK ⁴Centre for Neuroimaging Sciences, King's College London, London, UK ⁵Department of Chemistry, MIT, Cambridge, MA, USA ⁶Department of Computing, Imperial College London, London, UK ⁷Institute of Computational Biology, Computational Health Center, Helmholtz Munich, Munich, Germany ⁸TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany ⁹School of Data Science, University of Virginia, VA, USA ¹⁰School of Computation, Information and Technology, Technical University of Munich, Garching, Germany ¹¹Department of Neurology, Brigham and Women's Hospital, Boston, MA, USA ¹²Harvard Stem Cell Institute, Cambridge, MA, USA ¹³Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, MA, USA ¹⁴Broad Institute of MIT and Harvard, Cambridge, MA, USA ¹⁵Harvard Data Science Initiative, Cambridge, MA, USA
^*Co-first authors
⁺Present address: Department of Computer Science, Stanford University, Stanford, CA, USA
^†Present address: Acceleration Consortium, University of Toronto, Toronto, ON, Canada
^‡Corresponding author. Email: marinka@hms.harvard.edu

bioRxiv Code 🤗 HuggingFace

We present ProCyon, a foundation model for modeling, generating, and predicting protein phenotypes. ProCyon supports flexible queries with interleaved protein and natural text inputs, enabling a vast array of applications, including:

Functional annotation for proteins

Supporting completely unseen or poorly characterized proteins, or zero-shot generalization to novel proteins

Able to generate annotations beyond pre-defined vocabularies and ontologies, or zero-shot generalization to novel phenotypes

Generation of detailed phenotype descriptions for arbitrary proteins ("protein captioning")

Protein retrieval from flexible natural language prompts, without restriction to specific keywords or pre-defined terms

Generalization to novel protein-phenotype tasks unseen during training, or zero-shot task transfer, such as

Complex queries combining phenotypes from distinct knowledge domains, e.g. disease association and therapeutic interactions

Identifying protein domains targted by a given small molecule drug

Modeling the phenotypic effect of protein coding mutations

These capabilities are enabled by instruction-tuning on a wide range of biological knowledge, allowing ProCyon to reason over protein phenotypes across scales, ranging from molecular functions up to organism-level disease associations.

Importantly, ProCyon is an open model, with open and accessible training data, open-source training code, reproducible training recipes, transparent evaluations, intermediate checkpoints, and more.

ProCyon Model

ProCyon is an 11-billion parameter multimodal model fusing state-of-the-art large language models and protein representation learning methods. ProCyon supports multimodal protein inputs interleaved within textual prompts, enabling diverse queries about protein phenotypes and function. By using dedicated protein representation modules rather than a controlled vocabulary, ProCyon generalizes effectively to zero-shot proteins, novel therapeutic modalities, and protein regions, including domains and peptides.

ProCyon Model

To train ProCyon, we create the ProCyon-Instruct dataset. Some highlights:

677,154 protein-phenotype pairs

48,920 unique phenotypes and 56,753 proteins, domains, and peptides

Captures phenotypic information for proteins across five knowledge domains

We transform this data into an instruction tuning dataset, expressing each protein-phenotype pair as a natural language instruction. To generate a larger diversity of text in the instructions, we leverage an external LLM to rephrase the raw descriptions, resulting in a final dataset of 33,899,528 instructions for training ProCyon models.

Performance Comparison

ProCyon shows strong performance on a benchmark of fourteen biologically-relevant tasks constructed from ProCyon-Instruct and framed as either question-answering or protein retrieval tasks. ProCyon is the only model to consistently outperform both single-modality and multi-modality models across tasks. We also find that ProCyon maintains strong performance on 3,250 completely unseen phenotypes across knowledge domains, showing its ability to reason over novel scientific concepts.

ProCyon Capabilities

Free-text protein retrieval

ProCyon is able to successfully retrieve the STING protein given functional queries related to neuronal inflammatory stress response, a role of STING that was only described in scientific literature published after ProCyon's training data cutoff date. Increasingly precise and functionally-relevant descriptions increase the retrieval rank of STING, showing ProCyon's ability to assist in the scientific discovery process.

Functional annotation of under-studied proteins

We generate a UniProt-style description for AKNAD1, a poorly-characterized protein that does not appear in the training data for ProCyon. Two of the three final descriptions show evidence in the Human Protein Atlas through subcellular localization assays and scRNA-seq studies. ProCyon's free-text outputs allow generation of descriptions unbounded by a controlled vocabulary.

Zero-shot task transfer

We show that ProCyon exhibits zero-shot task transfer and cross-knowledge domain reasoning, exhibiting the ability to perform tasks beyond those it was explicitly trained on, and even those that require reasoning across knowledge domains. Here we show ProCyon's ability to conditonally retrieve two distinct proteins targeted by the same drug depending on which disease description is provided in the input prompt.

A multimodal foundation model for protein phenotypes

We present ProCyon, a foundation model for modeling, generating, and predicting protein phenotypes. ProCyon supports flexible queries with interleaved protein and natural text inputs, enabling a vast array of applications, including:

Abstract

ProCyon Model

ProCyon Model

Performance Comparison

ProCyon Capabilities

Free-text protein retrieval

Functional annotation of under-studied proteins

Zero-shot task transfer

We hope you enjoyed this preview of how ProCyon can push the boundaries of AI for protein understanding, for more details and additional experiments, please see our full manuscript!

BibTeX