Tutorials, Workshops, and Symposia

Research Tutorials

Machine Learning for Drug Development

Machine learning methods leverage big datasets to support decision-making in all stages of drug development, predict how drugs affect the human body and how they interact with each other, and seek ways to boost clinical trials and detect unwanted side effects. This tutorial covers generative modeling, reinforcement learning, and representation learning with a focus on theoretical foundations of methods and their use for key drug-related problems.

A variety of machine learning methods are demonstrating their utility at all stages of drug development. These methods use big datasets created from high-throughput screening data and allow prediction of bioactivities for targets and molecular properties, identification of new molecules and repurposing of old drugs with increased levels of accuracy.

We have only just begun to realize the potential of these techniques. If methods were available for all aspects of drug development, they could be used seamlessly to predict whether a chemical compound is likely to ultimately become a drug used in patients. Much research needs to be done before this vision can be realized, modern machine learning may have a fundamental impact on the way drug development is done.

The general process of drug development involves five steps. In short, molecular compounds are filtered through a progressive series of tests, which determine their properties, toxicity, and effectiveness for later stages. Machine learning is increasingly being used to accelerate each of the steps, creating opportunities for reducing resources and time needed to develop new drugs. In this tutorial, we cover key problems in drug development that are amenable to machine learning. In doing so, we present a toolbox of AI algorithms for end-to-end drug development.

This tutorial was presented at the International Joint Conference on Artificial Intelligence (IJCAI).


Deep Learning for Network Biology

Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. This tutorial investigates key advancements in representation learning for networks over the last few years, with an emphasis on fundamentally new opportunities in network biology enabled by these advancements.

Biological networks are powerful resources for the discovery of interactions and emergent properties in biological systems, ranging from single-cell to population level. Network approaches have been used many times to combine and amplify signals from individual genes, and have led to remarkable discoveries in biology, including drug discovery, protein function prediction, disease diagnosis, and precision medicine. Furthermore, these approaches have shown broad utility in uncovering new biology and have contributed to new discoveries in wet laboratory experiments.

Mathematical machinery that is central to these approaches is machine learning on networks. The main challenge in machine learning on networks is to find a way to extract information about interactions between nodes and to incorporate that information into a machine learning model. To extract this information from networks, classic machine learning approaches often rely on summary statistics (e.g., degrees or clustering coefficients) or carefully engineered features to measure local neighborhood structures (e.g., network motifs). These classic approaches can be limited because these hand-engineered features are inflexible, they often do not generalize to networks derived from other organisms, tissues and experimental technologies, and can fail on datasets with low experimental coverage.

Recent years have seen a surge in graph neural network (GNN) approaches that automatically learn to encode network structure into low-dimensional representations, using transformation techniques based on deep learning and nonlinear dimensionality reduction. The idea behind these representation learning approaches is to learn a data transformation function that maps nodes to points in a low-dimensional vector space, also termed embeddings. Representation learning methods have revolutionized the state-of-the-art in network science and the goal of this tutorial is to open the door for these methods to computational biology and bioinformatics.

This tutorial was presented at the International Conference on Intelligent Systems for Molecular Biology (ISMB).


Biomedical Data Fusion

Because of the complex and interconnected nature of biomedical systems, any single model trained on any single dataset can touch only a small part of the entire biomedical knowledge. It is thus critical to integrate diverse sources of information to gain a comprehensive understanding of the system.

New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches.

The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation.

This tutorial was presented at the International Engineering in Medicine and Biology Conference (EMBC) and at the Basel Compuational Biology Conference ([BC]^2).


International Workshops and Conferences

AI for Science

Machine learning has advanced a wide array of scientific disciplines and addressed many problems that previously could not be tackled computationally. Despite this promise, several key challenges remain open, and this workshop brings those gaps to the foreground of AI research.
  • Gap 1: Unrealistic methodological assumptions. While ML researchers strive for methodology advances, they often make unrealistic assumptions that limit real-world adoption. For example, most state-of-the-art molecule generation ML models generate molecules that have low synthesizability.

  • Gap 2: Overlooked scientific questions. Scientific communities contend with crucial and unsolved problems, but they are not yet formulated as solvable ML tasks and are thus overlooked by the ML community.

  • Gap 3: Limited exploration at the intersection of multiple disciplines. Solutions to grand challenges often stretch across multiple disciplines. For example, protein structure prediction requires collaboration across physics, chemistry and biology.

  • Gap 4: Science of science. Core principles of the scientific method have not changed since the 17th century. Can AI reason about the organizing principles of our world in a way that is complementary to the hypothesis-experiment cycle to understand a phenomenon?

  • Gap 5: Responsible use and development of AI for science. Interest in ML across scientific disciplines has surged, but few ML models have transitioned into practical scientific applications. We plan to present a roadmap and ultimately guidelines for accelerating the translation of ML in science. Translation requires a team of engaged stakeholders and a systematic process from the beginning (problem formulation) to the end (widespread deployment) of ML-based research lifecycle.

This workshop was presented at the International Conference on Neural Information Processing Systems (NeurIPS).

National Symposium on Drug Repurposing for Future Pandemics

Pandemics demand safe and effective therapies developed and deployed at an unprecedented speed. This symposium, organized on behalf of the National Science Foundation (NSF), provides a forum for scientists and researchers from a variety of fields relevant to therapeutics. Participants discuss ways to expedite the development of therapies by compressing years of work into months or even weeks through automation, artificial intelligence and machine learning, novel data sources, and most recent biotechnology advancements.


The symposium brings together leading experts in computer science, biology, statistics, medicine, automation, and regulation. While these areas of expertise are necessary for rapid therapeutic innovation, there is seldom an opportunity for these experts to interact with each other.

Bearing in mind new opportunities and pressing challenges, the symposium provides a roadmap and put forward recommendations on transforming today’s tools into ready-to-use solutions to fight future pathogens.

We announce a new initiative, Therapeutics Data Commons (TDC), at the symposium [Slides].


Representation Learning on Graphs and Manifolds

Many scientific fields study data with an underlying graph or manifold structure—such as social networks, sensor networks, biomedical knowledge graphs, and meshed surfaces in computer graphics. Recent years have seen a surge in research on these problems—often under the umbrella terms of graph representation learning and geometric deep learning.

The need for new optimization methods and neural network architectures that can accommodate these relational and non-Euclidean structures is becoming increasingly clear. In parallel, there is a growing interest in how we can leverage insights from these domains to incorporate new kinds of relational and non-Euclidean inductive biases into deep learning.

This workshop was presented at the International Conference on Learning Representations (ICLR).


Graph Representation Learning and Beyond

Recent years have seen a surge in research on graph representation learning, including techniques for deep graph embeddings, generalizations of CNNs to graph-structured data, and neural message-passing approaches. These advances in graph neural networks and related techniques have led to new state-of-the-art results in numerous domains: chemical synthesis, 3D-vision, recommender systems, question answering, continuous control, self-driving, and social network analysis.

This workshop was presented at the International Conference on Machine Learning (ICML).

Trustworthy AI for Healthcare

Artificial intelligence for healthcare has emerged as an active research area that has made considerable progress, including achieving human-level performance for skin cancer classification, diabetic eye disease detection, chest radiograph diagnosis, and sepsis treatment. While the trends are encouraging, many open challenges prevent us from directly deploying AI solutions in hospitals and clinical environments. A major open problem is the lack of trust of biomedical practitioners in AI methods. Many AI methods make predictions in a black-box way, making decisions challenging to understand and interpret. Further, today's methods are sensitive to small perturbations and adversarial attacks, raising numerous security and privacy concerns. Finally, AI methods learn to make decisions based on training data, which can include biased human decisions or reflect historical or social inequities. These challenges raise numerous trustworthy issues that we need to address to realize the potential of AI in healthcare.

This workshop was presented at the AAAI Conference on Artificial Intelligence (AAAI).

AI in Health: Transferring and Integrating Knowledge for Better Health

Rich healthcare data connected by semantic relationships and integrated into knowledge graphs can drive biomedical discovery. Biomedical knowledge graphs can support better cohort identification for clinical trials, risk prediction, precision diagnosis, and can inform new and better decision support workflows. Dramatic increase of healthcare data offers unprecedented opportunities for evidence-based care, yet challenges related to interoperability, learning, and reasoning over healthcare data remain open.

This workshop was presented at the Web Conference (WWW).

Research and Scholarship Meetings

PhD Forum

PhD Forum provides an environment for junior PhD students to exchange ideas and experiences with peers in an interactive atmosphere and to get constructive feedback from senior researchers in data science, machine learning, and related areas.

This meeting took place at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD).

Latest News

Oct 2021:   Adverse Drug Effects During the Pandemic

The COVID-19 pandemic has reshaped health and medicine in ways both dramatic and subtle. Some of the less obvious shifts can only emerge from analysis of millions of pieces of data—patient records, medical notes, clinical encounter reports. Check out the story in Harvard Medicine News highlighting our research.

Oct 2021:   Graph-Guided Networks for Time Series

New preprint! We introduce Raindrop, a graph-guided network for learning representations of irregularly sampled multivariate time series.

Oct 2021:   Massive Analysis of Differential Adverse Events

Hot off the press in Nature Computational Science! We develop an algorithmic approach for massive analysis of drug adverse events. Our analyses of 10,443,476 adverse event reports have implications for safe medication use and public health policy, and can enable comparison of COVID-19 pandemic to other health emergencies.

Sep 2021:   Leveraging Cell Ontology to Classify Cell Types

Hot off the press in Nature Communications! We developed OnClass, an algorithm and accompanying software for automatically classifying cells into cell types that are part of the controlled vocabulary that forms the Cell Ontology.

Sep 2021:   Major New Release of TDC

We are very excited to announce a major release of Therapeutics Data Commons! In the 0.3.0 release we restructured the codebase, simplified the backend and kept user interfaces the same. We also provide detailed documentation for our TDC package.

Aug 2021:   Trustworthy AI for Healthcare at AAAI

We will be organizing a meeting on Trustworthy AI for Healthcare at AAAI 2022. Stay tuned for details and call for papers.

Aug 2021:   New Paper on Therapeutics Data Commons

Our latest paper on Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development will appear at NeurIPS. We are excited to contribute novel datasets and benchmarks in the broad area of therapeutics.

Aug 2021:   AI for Science at NeurIPS

We are organizing the AI for Science workshop at NeurIPS 2021 and have a stellar lineup of invited speakers.

Aug 2021:   Best Poster Award at ICML Comp Biology

Congratulations to Michelle for winning the Best Poster Award for her work on deep contextual learners for protein networks at the ICML Workshop on Computational Biology.

Jul 2021:   Best Paper Award at ICML Interpretable ML

Our short paper on Interactive Visual Explanations for Deep Drug Repurposing received the Best Paper Award at the ICML Interpretable ML in Healthcare Workshop. Stay tuned for more news on this evolving project.

Jul 2021:   Five presentations at ICML 2021

Jun 2021:   Theory and Evaluation for Explanations

We introduce the first axiomatic framework for theoretically analyzing, evaluating, and comparing GNN explanation methods. We formalize key properties that all methods should satisfy to generate reliable explanations: faithfulness, stability, and fairness.

Jun 2021:   Deep Contextual Learners for Protein Networks

New preprint on contextualized protein embeddings aims to characterize genes with disease-specific interactions and elucidate disease manifestation in specific cell types.

May 2021:   New Paper Accepted at UAI

Our unified framework for fair and stable graph representation learning has just been accepted at UAI. We establish a theoretical connection between counterfactual fairness and stability and use it in a framework that can be used with any GNN to learn fair and stable embeddings.

Apr 2021:   Hot Off the Press: COVID-19 Repurposing in PNAS

Hot off the press! We deployed AI/ML and network medicine algorithms to rank 6,340 drugs for their expected efficacy against SARS-CoV-2. We screened in human cells the top-ranked drugs, identifying six drugs that reduced viral infection, four of which could be repurposed to treat COVID-19.

Apr 2021:   Representation Learning for Biomedical Nets

In our survey on representation learning for biomedical networks we discuss how long-standing principles of network biology and medicine provide the conceptual grounding for representation learning, explain its successes, and inform future advances.

Mar 2021:   Receiving Amazon Research Award

We are excited about receiving Amazon Faculty Research Award on Actionable Graph Learning for Finding Cures for Emerging Diseases. Thank you to Amazon Science for supporting our research.

Mar 2021:   Michelle's Graduate Research Fellowship

Michelle M. Li won the NSF Graduate Research Fellowship Award. Congratulations!

Mar 2021:   Hot Off the Press: Multiscale Interactome

Hot off the press! We develop a multiscale interactome approach to explain disease treatments. The approach can predict drug-disease treatments, identify proteins and biological functions related to treatment, and identify genes that alter treatment’s efficacy and adverse reactions.

Mar 2021:   Graph Networks in Computational Biology

We are excited to share slides from our recent lecture on Graph Neural Networks in Computational Biology, which we gave at Stanford ML for Graphs course.

Zitnik Lab  ·  Harvard  ·  Department of Biomedical Informatics