Contextualizing Protein Representations Using
Deep Learning on Interactomes and Single-Cell Experiments

Protein interaction networks are a critical component in studying the function and therapeutic potential of proteins. However, accurately modeling protein interactions across diverse biological contexts, such as tissues and cell types, remains a significant challenge for existing algorithms.

We introduce AWARE, a flexible geometric deep learning approach that trains on contextualized protein interaction networks to generate context-aware protein representations. Leveraging a multi-organ single-cell transcriptomic atlas of humans, AWARE provides 394,760 protein representations split across 156 cell-type contexts from 24 tissues and organs. We demonstrate that AWARE's contextualized representations of proteins reflect cellular and tissue organization and AWARE's tissue representations enable zero-shot retrieval of tissue hierarchy. Our contextualized protein representations, infused with cellular and tissue organization, can easily be adapted for diverse downstream tasks.

We fine-tune AWARE to study the genomic effects of drugs in multiple cellular contexts and show that our context-aware model significantly outperforms state-of-the-art, yet context-agnostic, models. Enabled by our context-aware modeling of proteins, AWARE is able to nominate promising protein targets and cell-type contexts for further investigation. AWARE exemplifies and empowers the long-standing paradigm of incorporating context-specific effects for studying biological systems, especially the impact of disease and therapeutics.

Modeling interactions between proteins has been crucial for uncovering the structure, function, and therapeutic potential of proteins. Extensive efforts to develop experimental and computational technologies to construct and analyze protein interaction networks have improved the characterization of proteins. However, protein interaction networks are typically presented as generic maps without contextual information about tissues or celltypes. Despite the development of high-throughput methods for screening protein-protein interactions and sequencing technologies to measure gene expression with single-cell resolution, accurately modeling protein interactions across diverse biological contexts, such as tissues and celltypes, remains a critical experimental and computational challenge.

The roles of proteins is influenced by biological contexts in which they are found:

  • While nearly every cell contains the same genome, the expression and function of a protein depends on the cell or tissue. Further, protein expression and function can differ significantly between healthy and diseased cells/tissues.
  • The ability to model and interrogate proteins in diverse biological contexts can improve the characterization of a disease’s mechanism of action and the design of safe and efficacious drugs.

There is a growing need to develop methodologies that can effectively inject and leverage contextual information of proteins. Still, existing algorithms are limited in their capacity to model proteins with celltype or tissue specificity.

Overview of AWARE

AWARE is a self-supervised geometric deep learning model that can generate protein representations in different celltype contexts. AWARE integrates single cell transcriptomics data with a protein interaction network, celltype interaction network, and tissue hierarchy to generate protein representations with celltype resolution.

In this work, we focus on protein-coding genes and do not encode differences of protein isoforms (e.g., due to alternative splicing). Unlike existing methods that provide only one representation per protein (assuming each protein-coding gene encodes only one protein), resulting in fewer than 22,000 protein representations, AWARE generates a unique representation for each celltype that a protein is activated in.

With our 394,760 contextualized protein representations (i.e., protein representations injected with celltype-specificity), we demonstrate AWARE’s ability to integrate structured and transcriptomic data, perform transfer learning across proteins, celltypes, and tissues, and generate contextualized predictions for diverse biomedical tasks.

AWARE Algorithm

AWARE is a self-supervised geometric deep learning model that can generate protein representations in diverse celltype contexts. AWARE is trained on a set of celltype specific protein interaction networks unified by a cellular and tissue network to produce contextualized protein representations based celltype activation. Unlike existing approaches, which do not consider biological context, AWARE produces multiple representations of proteins based on context, representations of the celltypes themselves, and representations of the tissues from which the celltypes are derived and the tissue hierarchy.

Given the multi-scale nature of the model inputs, AWARE is equipped to learn protein-level, celltype-level, and tissue-level topology in a single unified embedding space. To fully leverage the multi-scale inputs, AWARE uses protein-, celltype-, and tissue-level attention mechanisms and objective functions to inject cellular and tissue organization into the embedding space. AWARE is designed such that pairs of nodes that share an edge are embedded nearby each other, protein representations of the same celltype are embedded close by (and far from protein representations of a different celltype), and protein representations are embedded close to the representation of their corresponding celltype (and far from other celltype representations).

AWARE propagates message on proteins, celltypes, and tissues using attention mechanisms specific to each node and relationship type:

  1. The protein-level objective function, which considers self-supervised link prediction on the protein interactions and celltype-identity classification on the protein nodes, enables AWARE to produce an embedding space that captures both the topology of the celltype-specific protein interaction networks and the celltype identity of proteins.
  2. The celltype- and tissue-specific components in celltype- and tissue-specific objective functions are based solely on self-supervised link prediction to learn cellular and tissue organization.
  3. Such information is propagated to the protein representations using an attention bridge, imposing tissue and cellular organization to the protein representations.

Publication

Contextualizing Protein Representations Using Deep Learning on Interactomes and Single-Cell Experiments
Michelle M. Li, Yepeng Huang, Marissa Sumathipala, Man Qing Liang, Alberto Valdeolivas, Ashwin N. Ananthakrishnan, Katherine Liao, Daniel Marbach and Marinka Zitnik
In Review 2023

@article{li2023contextualizing,
  title={Contextualizing Protein Representations Using Deep Learning on Interactomes and Single-Cell Experiments},
  author={Li, Michelle M and Huang, Yepeng and Sumathipala, Marissa and Liang, Man Qing and Valdeolivas, Alberto and Ananthakrishnan, Ashwin N and Marbach, Daniel and Zitnik, Marinka},
  journal={bioRxiv},
  url={},
  year={2023}
}

Code Availability

Pytorch implementation of AWARE is available in the GitHub repository.

We provide an interactive demo to explore AWARE’s protein representations through a visual interface in the HuggingFace Space.

Authors

Latest News

May 2023:   Congratulations to Ada and Michelle

Congrats to PhD student Michelle on being selected as the 2023 Albert J. Ryan Fellow and also to participate in the Heidelberg Laureate Forum. Congratulations to PhD student Ada for being selected as the Kempner Institute Graduate Fellow!

Apr 2023:   Universal Domain Adaptation at ICML 2023

New paper introducing the first model for closed-set and universal domain adaptation on time series accepted at ICML 2023. Raincoat addresses feature and label shifts and can detect private labels. Project website.

Apr 2023:   Celebrating Achievements of Our Undergrads

Undergraduate researchers Ziyuan, Nick, Yepeng, Jiali, Julia, and Marissa are moving onto their PhD research in Computer Science, Systems Biology, Neuroscience, and Biological & Medical Sciences at Harvard, MIT, Carnegie Mellon University, and UMass Lowell. We are excited for the bright future they created for themselves.

Apr 2023:   Welcoming a New Postdoctoral Fellow

An enthusiastic welcome to Tianlong Chen, our newly appointed postdoctoral fellow.

Apr 2023:   New Study in Nature Machine Intelligence

New paper in Nature Machine Intelligence introducing the blueprint for multimodal learning with graphs.

Mar 2023:   Precision Health in Nature Machine Intelligence

New paper with NASA in Nature Machine Intelligence on biomonitoring and precision health in deep space supported by artificial intelligence.

Mar 2023:   Self-Driving Labs in Nature Machine Intelligence

Mar 2023:   TxGNN - Zero-shot prediction of therapeutic use

Mar 2023:   GraphXAI published in Scientific Data

Feb 2023:   Welcoming New Postdoctoral Fellows

A warm welcome to postdoctoral fellows Wanxiang Shen and Ruth Johnson. Congratulations to Ruthie for being named a Berkowitz Fellow.

Feb 2023:   New Preprint on Distribution Shifts

Feb 2023:   PrimeKG published in Scientific Data

Jan 2023:   GNNDelete published at ICLR 2023

Jan 2023:   New Network Principle for Molecular Phenotypes

Dec 2022:   Can we shorten rare disease diagnostic odyssey?

New preprint! Geometric deep learning for diagnosing patients with rare genetic diseases. Implications for using deep learning on sparsely-labeled medical datasets. Thankful for this collaboration with Zak Lab. Project website.

Nov 2022:   Can AI transform the way we discover new drugs?

Our conversation with Harvard Medicine News highlights recent developments and new features in Therapeutics Data Commons.

Oct 2022:   New Paper in Nature Biomedical Engineering

New paper on graph representation learning in biomedicine and healthcare published in Nature Biomedical Engineering.

Sep 2022:   New Paper in Nature Chemical Biology

Our paper on artificial intelligence foundation for therapeutic science is published in Nature Chemical Biology.

Sep 2022:   Self-Supervised Pre-Training at NeurIPS 2022

New paper on self-supervised contrastive pre-training accepted at NeurIPS 2022. Project page. Thankful for this collaboration with the Lincoln National Laboratory.

Sep 2022:   Best Paper Honorable Mention Award at IEEE VIS

Our paper on user-centric AI of drug repurposing received the Best Paper Honorable Mention Award at IEEE VIS 2022. Thankful for this collaboration with Gehlenborg Lab.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics