Machine Learning for Drug Development

IJCAI 2020 Tutorial

With remarkable successes of machine learning in a variety of application areas, we are witnessing an increasing interest in applications of machine learning to drug discovery and development.

In this tutorial, we cover key advancements in machine learning over the last few years, with an emphasis on fundamentally new opportunities in drug development enabled by these advancements. We are interested in why and how these advances can help drug-related tasks. We elaborate uses of machine learning in drug development through six key tasks: (a) synthesis prediction and de novo drug design, (b) molecular property prediction, (c) virtual drug screening and drug-target interactions, (d) clinical trial recruitment, (e) drug repurposing, (f) adverse drug effects and polypharmacy.

We discuss theoretical foundations behind methods for these key drug-related tasks, illustrate various approaches based on different formulations, and summarize representative applications. We cover generative models, reinforcement learning, as well as very recent advancements in deep representation learning and embeddings. In doing so, we present a toolbox of AI algorithms for end-to-end drug development.



Drug discovery and development is a long and expensive process. It usually starts with experimental discovery of molecules and targets (i.e., de novo drug design), and validation of discoveries with in vitro experiments on cell lines, organoids, and animals before moving to clinical testing. The entire process from discovery to the regulatory approval of a new drug can take as much as 12 years and cost upwards of US$2.8 billion. Furthermore, huge uncertainty (1:5000 success rate) is associated with each drug development stage.

Machine learning methods have emerged as a promising tool to address these challenges and accelerate drug development. In the tutorial, we cover the following key drug-related tasks:

  1. Synthesis prediction and de novo drug design (i.e., designing an entirely new molecule from scratch) aims to generate chemically correct structures to assist in complex molecule synthesis.
  2. Molecular property prediction aims to identify therapeutic effects of molecules by predicting properties, such as potency, bioactivity, and toxicity, from the molecular data.
  3. Virtual drug screening and drug target identification aim to predict how drugs affect the human body by binding to target proteins and affecting their downstream activity.
  4. Clinical trial recruitment aims to identify the right doctors to help conduct the trials as well as find qualified patients to participate the trials.
  5. Drug repurposing seeks to find new uses for known drugs as well as for novel molecules through the use of drug chemical, target and side effect similarity between drugs and diseases.
  6. Adverse drug effects, polypharmacy, and drug-food interaction prediction aims to predict mechanisms causing adverse drug effects, suggest alternative drug members for the intended pharmacological effects without negative health effects, and predict the effects of food constituents on interacting drugs.

We then discuss key classes of methods for tackling these drug-related tasks:

  • Generative models. We focus on variational autoencoders (VAE) and generative adversarial networks (GAN) that are well suited for de novo molecule design. They take as input line or graph-based compound representations with known therapeutic properties, encode the compounds into latent spaces, and then decode them into new drug samples.
  • Reinforcement learning. We mainly talk about policy gradient methods, state-of-the-art methods for molecule generation that can incorporate domain-specific knowledge about molecule synthesis.
  • Deep representation learning. We present major neural architectures for learning representations of drug-related data. These methods are relevant because they achieve state-of-the-art performance on drug-related tasks. For example, the methods were used to automatically learn drug fingerprints, learn drug-protein binding affinity, and recruit patients into clinical trials. Further, graph embedding methods are used to study drug combinations and predict drug effects as they spread throughout biological networks beyond the molecules to which they directly bind.


Our half-day tutorial has the following agenda:

  • (30 min): Overview and introduction to machine learning for drug development
  • (60 min): Methods: Generative models, reinforcement learning, and learning deep representation and embeddings
  • (90 min): Applications to problems in drug development
  • (15 min): Future directions and Q&A session
  • (15 min): Hands-on exercise with demos, implementation details, tools, and tips

Tutorial materials

The tutorial slides and materials for hands-on exercises (e.g., code implementation, datasets) will be posted on this website and made available to all participants.

Tutorial info

The tutorial will be held at the IJCAI conference in Yokohama, Japan, in July 2020.

The primary goal for this tutorial is to introduce AI audiences to drug development, an external topic that can motivate and use AI research.

The target audience for this tutorial are entry-level participants with knowledge of the fundamentals of data mining and machine learning and some experience in deep learning (Intermediate). Although the first half of the tutorial focuses on introducing tasks and AI research used in this area, it is helpful to have a preliminary understanding of deep learning.

No special software or other package installation is needed to follow this tutorial.


Marinka Zitnik is Assistant Professor at Harvard and Associate Member at the Broad Institute of MIT and Harvard. Her research investigates artificial intelligence and machine learning to advance science, medicine, and health. Before Harvard, she was a postdoc in Computer Science at Stanford. She received her Ph.D. in Computer Science from University of Ljubljana while also researching at Imperial College London, University of Toronto, Baylor College of Medicine, and Stanford. Her work received several best paper, poster, and research awards from the International Society for Computational Biology. She has been named a Rising Star in EECS by MIT and also a Next Generation in Biomedicine by The Broad Institute. She has published in top ML venues (e.g., NeurIPS, ICLR) and top journals (e.g., Nature Communications, Proceedings of the National Academy of Sciences, PNAS), co-organized a tutorial in the area at ISMB 2018, a related workshop at ICLR 2019, gave invited talks on this topic in big pharma and at major conferences, and she co-edits a Special Issue on these topics at IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Cao (Danica) Xiao is the Director of Machine Learning at Analytics Center of Excellence of IQVIA. She is leading IQVIA’s North America machine learning team to drive next generation healthcare AI. Her team works on various projects on disease prediction, in silico drug modeling (e.g., adverse drug reaction detection, drug repositioning and de novo design) and clinical trial recruitment prediction. Her research focuses on using ML/AI approaches to solve diverse real world healthcare challenges. Particularly, she is interested in phenotyping on electronic health records, data mining for in-silico drug modeling, patient segmentation for neuro-degenerative diseases. Her research has been published in leading AI conferences including KDD, NIPS, ICLR, AAAI, IJCAI, SDM, ICDM, WWW and top health informatics journals such as Nature Scientific Reports and JAMIA. Prior to IQVIA, she was a research staff member in the AI for Healthcare team at IBM Research from 2017 to 2019 and served as member of the IBM Global Technology Outlook Committee from 2018 to 2019. She acquired her Ph.D. degree from University of Washington, Seattle in 2016.

Jimeng Sun is Professor in the Computer Science Department at University of Illinois Urbana-Champaign (UIUC). Prior to UIUC, he worked at Georgia Tech and IBM Research. His research is on artificial intelligence (AI) for healthcare. And the core topics include 1) Deep learning for drug discovery, 2) Clinical trial optimization, 3) Computational phenotyping, 4) Clinical predictive modeling, 5) Treatment recommendation and 6) Health monitoring. Dr. Sun has been collaborating with many healthcare organizations. He published over 200 papers with h-index 59 and filed over 20 patents. He has received SDM/IBM early career research award 2017, ICDM best research paper award in 2008, SDM best research paper award in 2007, and KDD Dissertation runner-up award in 2008. In 2019, he was recognized as Top 100 AI Leaders in Drug Discovery and Advanced Healthcare. Dr. Sun received B.S. and M.Phil. in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003, M.Sc and PhD in Computer Science from Carnegie Mellon University in 2006 and 2007.

Zitnik Lab  ·  Harvard  ·  Department of Biomedical Informatics