Research Directions

Machine learning methods and applications

The overarching goal of our research is to develop the next generation of machine learning for data in medicine and science. Our research realizes an end-to-end scientific approach in which we:

  1. Invent ways to combine rich, heterogeneous data in their broadest sense to reduce redundancy and uncertainty and to make them amenable to comprehensive analyses.
  2. Develop methods for reasoning over rich, interconnected data, and design architectures for learning actionable representations.
  3. Translate machine learning research into innovative applications and solutions for burning biomedical questions.

Our research proves that this approach not only opens up new avenues for understanding nature, analyzing health, and developing new medicines to help people but can impact the way predictive modeling is performed today at the fundamental level.

Our research strategy results in create a suite of foundational models (e.g., pre-trained models, self-supervised models, general-purpose models, multi-purpose models, and multi-modal models) that are trained on broad data at scale and can be adapted (e.g., fine-tuned) to a wide range of downstream tasks.

Fusing biomedical knowledge and patient data

[Algorithms and methods: Knowledge graphs, multi-modal learning, data integration, multi-scale modeling]

Soon, the state of a person will be characterized with increasing precision incorporating data modalities like genetic code, behavior, therapeutics, nutrients, and the environment—the challenge is how to computationally operationalize these data to make them amenable to decision making.

Further, data are of many different types, including experimental readouts, curated annotations and metadata—no single data type can capture all the factors necessary to understand a phenomenon such as a disease. These high-dimensional datasets lead to far more complex characterizations than currently in use, requiring fundamentally new approaches.

To this end, we invent methods to fuse rich, heterogeneous data into knowledge graphs in efforts to bridge the divide between biomedical research and patient data. With this approach, we combine data in their broadest sense to reduce redundancy, resolve contradictory observations and model uncertainty. Using our methods, we were able to, for example, construct one of the largest biological network ever created, with over 2.3 billion edges and more than 2,000 modes.

Ultimately, we believe that science and medicine are among the most exciting areas for machine learning with many hard problems and applications of immense impact. For this reason, we build high-quality open-source data repositories whenever possible to bring biomedical data closer to other scientists who can now readily use these datasets in their research.

Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities
A Comprehensive Structural, Biochemical and Biological Profiling of the Human NUDIX Hydrolase Family
Gene Prioritization by Compressive Data Fusion and Chaining
Gene Network Inference by Fusing Data from Diverse Distributions
Data Fusion by Matrix Factorization

Learning trustworthy representations for complex systems and never-before-seen phenomena

[Algorithms and methods: Graph representation learning, graph neural networks, few-shot learning, transfer learning]

The success of machine learning is heavily dependent on the choice of data features on which the methods are applied. For that reason, much of the actual efforts in deploying algorithms go into engineering of features that support effective machine learning. We have already made substantial progress on developing representation learning methods that expand the scope and ease the applicability of machine learning in sciences and medicine.

The challenge, however, is that prevailing deep and representation learning algorithms are designed for data with a regular, grid-like structure (e.g., images have a 2D grid structure and sequences have a linear 1D structure). These algorithms are unable to truly exploit complex, interconnected data with irregular interactions between entities, i.e., edges, the essence of graphs. We are developing methods to address these challenges. At the technical core of our methods is the notion of vector space embeddings. We formalize this idea by specifying deep transformation functions, or graph neural networks, that map nodes, or larger graph structures, to points in a low-dimensional space, termed embeddings. Importantly, these functions are optimized to embed the input network so that performing algebraic operations in this learned space reflects topology of the network.

Our research has pioneered graph neural networks in bioinformatics and deep learning for network biology and medicine. This allowed us to apply neural networks much more broadly and set sights on new frontiers beyond classic applications of neural networks that learn on images and sequences. We show, for example, how embeddings enable repurposing of drugs for new indications and the discovery of dozens of drug combinations that are safe in patients with considerably fewer unwanted side effects than today’s treatments. Further, we have that embeddings allow for accurate molecular phenotyping by identifying drug targets, disease proteins, molecular functions and other phenotypes better than much more complex algorithms.

We are also actively developing machine learning methods for learning representations that are actionable—lend themselves to actionable hypotheses—and allow users of our models to ask what-if questions and receive predictions that are accurate, precise, robust and can be interpreted meaningfully.

Within the context of biomedical data, we are looking to advance the algorithms to train more with less data, exploit the ability of models to apply prediction prowess acquired from one data type to another type, and design contextually adaptive AI for classes of phenomena that can learn and reason about never-before-seen systems as they encounter new tasks and situations (e.g., new patients, diseases, or cell types).

Modeling Polypharmacy Side Effects with Graph Convolutional Networks
To Embed or Not: Network Embedding as a Paradigm in Computational Biology
Predicting Multicellular Function Through Multi-Layer Tissue Networks
GNN Explainer: Generating Explanations for Graph Neural Networks
Embedding Logical Queries on Knowledge Graphs
Learning Structural Node Embeddings via Diffusion Wavelets


MITxHarvard Women in Artificial Intelligence [YouTube Interview]


Slides: Graph Neural Networks in Computational Biology
Recording: Graph Neural Networks in Computational Biology

Reasoning about interconnected biology and medicine

[Algorithms and methods: Network science, model interpretability and explanations]

Networks, or graphs, pervade biomedical data—from the molecular level to the level of connections between diseases in a person, and all the way to the societal level encompassing all human interactions. These interactions at different levels give rise to a bewildering degree of complexity which is only likely to be fully understood through a holistic and integrated systems view and the study of combined, multi-level networks.

We are focusing on designing algorithmic solutions to optimize and manipulate networked systems for useful purposes and to predict their behavior, such as how genomics—nature’s experiments on people—influences human traits in the context of a particular environment. For example, using protein-protein interaction data that have only recently become available, we composed and analyzed interactome networks from 1,840 species across the tree of life, expanding the number of species from about 5 in previous studies to 1,840. This unique dataset allowed us to conduct the largest ever study of protein interactomes and quantify the resilience of interactomes—a critical property as the breakdown of proteins may lead to cell death or disease, showing, for the first time, how interactomes change during evolution, and how these changes impact their response to environmental unpredictability.

We are also actively developing methods to harness rich interaction data and network dynamics and we scale up the analyses to see structure in massive amounts of data that are too complex to be detected with other methods.

Evolution of Resilience in Protein Interactomes Across the Tree of Life
Network Enhancement as a General Method to Denoise Weighted Biological Networks
Large-Scale Analysis of Disease Pathways in the Human Interactome

Accelerating discovery of safer and more effective medicines

For centuries, the scientific method—the fundamental practice of science that scientists use to explain the natural world systematically and logically—has remained largely the same. We have already made substantial progress on using machine learning to change that.

Within the context of drug development, we develop methods that accelerate discovery of safer and more effective medicines. It can take 15 years and cost $1 billion for a new drug to reach patients as the question of identifying which diseases a new drug (compound) could treat is tremendously complex. However, diseases are not independent of each other and a large number of genes are shared between often quite distinct diseases. Similarly, the effects of drugs are not limited to the molecules to which they directly bind in the body; instead, these effects spread throughout biological networks in which they act. Therefore, the effect of a drug on a disease is inherently a network phenomenon. We are leveraging this understanding to develop AI assistants for drug discovery and development.

Network Medicine Framework for Identifying Drug Repurposing Opportunities for COVID-19
Modeling Polypharmacy Side Effects with Graph Convolutional Networks
Prioritizing Network Communities
Pre-training Graph Neural Networks
Discovering Novel Cell Types Across Heterogeneous Single-Cell Experiments
Gene Prioritization by Compressive Data Fusion and Chaining
Matrix Factorization-based Data Fusion for Drug-induced Liver Injury Prediction
Discovering Disease-disease Associations by Fusing Systems-level Molecular Data
Collective Pairwise Classification for Multi-way Analysis of Disease and Drug Data


Bayer Foundation Report (Bayer’s Early Excellence in Science Award)
Machine Learning for Drug Development
FutureDose Tech
MIT AI Cures Blog

AI infrastructure and AI-ready datasets for biomedical discovery & actionable AI

The attention of ML community to therapeutics remains relatively limited, compared to application areas such as natural language processing and computer vision, even though therapeutics offer many hard algorithmic problems and applications of immense impact. We posit that is due to the following key challenges: (1) The lack of high-quality therapeutic datasets and biomedical knowledge prevent researchers to formulate relevant therapeutic tasks as solvable machine-learning problems—the challenge is how to computationally operationalize these data to make them amenable to ML; (2) Datasets are of many different types, including experimental readouts, curated annotations and metadata, and are scattered around the biorepositories—the challenge for non-domain experts is how to identify, process, and curate datasets relevant to a task of interest; and (3) Despite promising prediction accuracies of computational models, their use in practice, such as for rare diseases and novel drugs in development, is hindered—the challenge is how to assess algorithmic advances in a manner that allows for robust and fair model comparison and represents what one would expect in a real-world deployment or clinical implementation.

To address these challenges, we founded Therapeutics Data Commons (TDC) [Slides], an open-science data platform with AI/ML-ready datasets and learning tasks for therapeutics. Therapeutics Data Commons is a collection of machine learning tasks spread across the entire range of therapeutics. Datasets and benchmarks in TDC provide a systematic model development and evaluation framework that allows more machine learning researchers to contribute to the field. We envision that TDC can considerably accelerate machine-learning model development, validation and transition into production and clinical implementation.

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics
NIMFA: A Python Library for Non-negative Matrix Factorization


Therapeutics Data Commons

Zitnik Lab  ·  Harvard  ·  Department of Biomedical Informatics