Research Directions

Our research develops AI and machine learning methods to address grand challenges in science and medicine. In doing so, we:

  1. Invent ways to infuse knowledge & structure into AI models to reduce uncertainty and enable generalization to new scenarios not seen during training.
  2. Develop methods that produce actionable & trustworthy representations and can reason over massive datasets.
  3. Translate machine learning research into innovative applications.

Our research strategy is to create foundational models (e.g., pre-trained models, self-supervised models, general-purpose models, multi-purpose models, and multi-modal models) trained on broad data at scale and can be adapted to a wide range of downstream tasks. This research opens up new avenues for understanding network and disease biology, developing safe & effective medicines, and can impact the way predictive modeling is performed today at the fundamental level.

Knowledge-guided AI: Fusing biomedical knowledge and patient data

Methods: Knowledge graphs, multi-modal learning, foundational models

Soon, the state of a person will be characterized with increasing precision by incorporating data modalities like genetic code, behaviors, therapeutics, nutrients, and the environment—the challenge is how to computationally operationalize these data to make them amenable to decision making.

Further, data are of many different types, including experimental readouts, curated annotations, and metadata—no single data type can capture all the factors necessary to understand a phenomenon such as a disease. These high-dimensional datasets lead to far more complex characterizations than are currently in use, requiring fundamentally new approaches.

To this end, we invent methods to fuse rich, heterogeneous data into knowledge graphs to bridge the divide between biomedical research and patient data. This approach combines data in their broadest sense to reduce redundancy, resolve contradictory observations, and model uncertainty. Using our methods, we could, for example, construct one of the largest biological networks ever created, with over 2.3 billion edges and more than 2,000 modes.

Ultimately, we believe that science and medicine are among the most exciting areas for machine learning, with many hard problems and applications of immense impact. For this reason, we build high-quality open-source data repositories whenever possible to bring biomedical data closer to other scientists who can now readily use these datasets in their research.

Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities
A Comprehensive Structural, Biochemical and Biological Profiling of the Human NUDIX Hydrolase Family
Gene Prioritization by Compressive Data Fusion and Chaining
Gene Network Inference by Fusing Data from Diverse Distributions
Data Fusion by Matrix Factorization

Graph AI: Learning trustworthy representations for complex and networked systems

Methods: Graph representation learning, graph neural networks, geometric deep learning, few-shot learning, transfer learning

The success of machine learning is heavily dependent on the choice of data features to which the methods are applied. For that reason, much of the actual efforts in deploying algorithms go into the engineering of features that support effective machine learning. We have already made substantial progress in developing representation learning methods that expand the scope and ease the applicability of machine learning in sciences and medicine.

The challenge, however, is that prevailing deep and representation learning algorithms are designed for data with a regular, grid-like structure (e.g., images have a 2D grid structure, and sequences have a linear 1D structure). As a result, these algorithms cannot truly exploit complex, interconnected data with irregular interactions between entities, i.e., edges, the essence of graphs. We are developing methods to address these challenges. The notion of vector space embeddings is at the technical core of these methods. We formalize this idea by specifying deep transformation functions, or graph neural networks, that map nodes, or larger graph structures, to points in a low-dimensional space, termed embeddings. Importantly, these functions are optimized to embed the input network to perform algebraic operations in this learned space to reflect the network’s topology.

Our research has pioneered graph neural networks in bioinformatics and deep learning for network biology and medicine. This allowed us to apply neural networks much more broadly and set sights on new frontiers beyond classic applications of neural networks that learn from images and sequences. We show, for example, how embeddings enable the repurposing of drugs for new indications and the discovery of dozens of safe drug combinations in patients with considerably fewer unwanted side effects than today’s treatments. Further, embeddings allow for accurate molecular phenotyping by identifying drug targets, disease proteins, molecular functions, and other phenotypes better than complex algorithms.

We are also actively developing machine learning methods for learning actionable representations—lending themselves to actionable hypotheses—and allow users of our models to ask what-if questions and receive predictions that are accurate, precise, robust, and can be interpreted meaningfully.

Within the context of biomedical data, we are looking to advance the algorithms to train more with less data, exploit the ability of models to apply prediction prowess acquired from one data type to another type, and design contextually adaptive AI for classes of phenomena that can learn and reason about never-before-seen systems as they encounter new tasks and situations (e.g., new patients, diseases, or cell types).

Modeling Polypharmacy Side Effects with Graph Convolutional Networks
To Embed or Not: Network Embedding as a Paradigm in Computational Biology
Predicting Multicellular Function Through Multi-Layer Tissue Networks
GNN Explainer: Generating Explanations for Graph Neural Networks
Embedding Logical Queries on Knowledge Graphs
Learning Structural Node Embeddings via Diffusion Wavelets


MITxHarvard Women in Artificial Intelligence [YouTube Interview]


Slides: Graph Neural Networks in Computational Biology
Recording: Graph Neural Networks in Computational Biology

Actionable AI: Reasoning about interconnected biology and medicine

Methods: Heterogeneous and multi-scale networks, model interpretability, graph explainability

Networks, or graphs, pervade biomedical data—from the molecular level to the level of connections between diseases in a person and the societal level encompassing all human interactions. These interactions at different levels give rise to a bewildering degree of complexity which is only likely to be fully understood through a holistic and integrated systems view and the study of combined, multi-level networks.

We focus on designing algorithmic solutions to optimize and manipulate networked systems for useful purposes and to predict their behavior, such as how genomics influences human traits in a particular environment. For example, using protein-protein interaction data that have only recently become available, we composed and analyzed interactome networks from 1,840 species across the tree of life, expanding the number of species from about 5 in previous studies to 1,840. This unique dataset allowed us to conduct the largest ever study of protein interactomes and quantify the resilience of interactomes—a critical property as the breakdown of proteins may lead to cell death or disease, showing, for the first time, how interactomes change during evolution, and how these changes impact their response to environmental unpredictability.

We also actively develop methods to harness rich interaction data and network dynamics. We scale up the analyses to see structure in massive amounts of data that are too complex to be detected with other methods.

Evolution of Resilience in Protein Interactomes Across the Tree of Life
Network Enhancement as a General Method to Denoise Weighted Biological Networks
Large-Scale Analysis of Disease Pathways in the Human Interactome

AI4Science: Accelerating the discovery of safe and effective medicines

For centuries, the scientific method—the fundamental practice of science that scientists use to explain the natural world systematically and logically—has remained essentially the same. We have already made substantial progress in using machine learning to change that.

Within the context of drug development, we develop methods that accelerate the discovery of safer and more effective medicines. For example, it can take 15 years and cost $1 billion for a new drug to reach patients, as the question of identifying which diseases a new drug (compound) could treat is tremendously complex. However, diseases are not independent of each other, and many genes are shared between often quite distinct conditions. Similarly, the effects of drugs are not limited to the molecules to which they directly bind in the body; instead, these effects spread throughout the molecular networks in which they act. Therefore, the effect of a drug on a disease is inherently a network phenomenon. We leverage this understanding to develop AI assistants for drug discovery and development.

Network Medicine Framework for Identifying Drug Repurposing Opportunities for COVID-19
Modeling Polypharmacy Side Effects with Graph Convolutional Networks
Prioritizing Network Communities
Pre-training Graph Neural Networks
Discovering Novel Cell Types Across Heterogeneous Single-Cell Experiments
Gene Prioritization by Compressive Data Fusion and Chaining
Matrix Factorization-based Data Fusion for Drug-induced Liver Injury Prediction
Discovering Disease-disease Associations by Fusing Systems-level Molecular Data
Collective Pairwise Classification for Multi-way Analysis of Disease and Drug Data


Bayer Foundation Report (Bayer’s Early Excellence in Science Award)
Machine Learning for Drug Development
FutureDose Tech
MIT AI Cures Blog

Modern data management: AI infrastructure and AI-ready datasets for biomedical discovery

The attention of the ML community to therapeutics remains relatively limited compared to application areas such as natural language processing and computer vision, even though therapeutics offer many challenging problems and applications of immense impact. We posit that this is due to the following key challenges: (1) The lack of high-quality therapeutic datasets and biomedical knowledge prevents researchers from formulating relevant therapeutic tasks as solvable machine-learning problems—the challenge is how to computationally operationalize these data to make them amenable to ML; (2) Datasets are of many different types, including experimental readouts, curated annotations, and metadata, and are scattered around the biorepositories—the challenge for non-domain experts is how to identify, process, and curate datasets relevant to a task of interest; and (3) Despite promising prediction accuracies of computational models, their use in practice, such as for rare diseases and novel drugs in development, is hindered—the challenge is how to assess algorithmic advances in a manner that allows for robust and fair model comparison and represents what one would expect in a real-world deployment or clinical implementation.

To address these challenges, we founded Therapeutics Data Commons (TDC) [Slides], an open-science data platform with AI/ML-ready datasets and learning tasks for therapeutics. Therapeutics Data Commons is a collection of machine learning tasks spread across the entire range of therapeutics. Datasets and benchmarks in TDC provide a systematic model development and evaluation framework that allows more machine learning researchers to contribute to the field. We envision that TDC can considerably accelerate machine-learning model development, validation, and transition into production and clinical implementation.

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics
NIMFA: A Python Library for Non-negative Matrix Factorization


Therapeutics Data Commons

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics