Open Source & AI-Ready Datasets


Precision Medicine Oriented Knowledge Graph

PrimeKG is a precision medicine-oriented knowledge graph that provides a holistic view of diseases. It integrates 20 high-quality resources to describe 17,080 diseases with 5,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scale, and the entire range of approved and experimental drugs with their therapeutic action.

PrimeKG supports drug-disease prediction by including an abundance of ’indications’, ’contradictions’ and ’off-label use’ edges, which are usually missing in other knowledge graphs. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multi-modal analyses.

View the PrimeKG Website

Physical Activity Monitoring Dataset

Dataset for Irregular Time Series Research

We are developing representation learning techniques for complex time series dataset. In the Raindrop study (ICLR’22), we introduced a graph-guided network for irregularly sampled multivariate time series. The study includes a processed sensor dataset recording daily living activities of individuals.

View the Physical Activity Monitoring Dataset

Population-Scale Patient Safety Dataset

Adverse Events of Medications across Patient Groups and the Entire Range of Human Diseases and Approved Drugs

We present a comprehensive catalog of 10,443,476 adverse event reports (involving 19,193 adverse events and 3,624 drugs) from the U.S. Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS), collected from January 2013 to September 2020. The new resource can help discover relationships between drugs and safety events, especially in cases of rare events and effects within population subgroups that differ in their risks of specific clinical outcomes and are disproportionately affected by preventable inequities.

View the Patient Safety Dataset

Subgraph Datasets

Datasets for Subgraph Representation Learning Research

We design novel synthetic and real-world social and biological datasets, consisting of underlying base graphs and many labeled subgraphs. These datasets are ready to be used for benchmarking, systematic model evaluation and comparison.

View the SubGNN Website

Therapeutics Data Commons

Machine Learning Datasets and Tasks for Drug Discovery and Development

TDC is the first unifying framework to systematically access and evaluate machine learning across the entire range of therapeutics.

At its core, TDC is a collection of AI/ML-ready datasets and learning tasks to serve as a meeting point for domain and ML scientists. TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles. All datasets and learning tasks are integrated and accessible via an open-source library.

View the TDC Website


Stanford Biomedical Network Dataset Collection

BioSNAP is a collection diverse biomedical networks, inclusing protein-protein interaction networks, single-cell similarity networks, drug-drug interaction networks.

BioSNAP datasets contain metadata on graphs and node features, and can be easily linked to external repositories of biological knowledge.

View the BioSNAP Website

Fair Graph Datasets

Graph datasets comprising of high-stakes decisions in criminal justice and financial lending domains

Graph datasets comprise of critical decisions in criminal justice (if a defendant should be released on bail) and financial lending (if an individual should be given loan) domains. These attributed graphs contain sensitive/protected attributes, which makes them suitable for studying algorithmic fairness.

View the NIFTY website


The Open Graph Benchmark

OGB is a collection of benchmark datasets, data loaders, and evaluators for graph machine learning. Datasets cover a variety of graph machine learning tasks and real-world applications.

The OGB data loaders are fully compatible with popular graph deep learning frameworks, including Pytorch Geometric and DGL. They provide automatic dataset downloading, standardized dataset splits, and unified performance evaluation.

View the OGB Website

Disease pathways

Disease pathways overlaid on the human interactome

View Disease Pathway Dataset

Multimodal cancer network

Multimodal network centered on genes frequently mutated in cancer patients

The multimodal cancer network integrates information on chemicals, diseases, molecular functions, genes, and protein.

The dataset has 21 types of biologically meaningful associations (edge types): chemical-chemical, chemical-protein, disease-chemical, disease-disease, disease-function, disease-gene, function-function, gene-gene (split into 6 edge types by interaction type), gene-protein, protein-function, and protein-protein interactions.

The network has 20 K nodes and 3.4 M edges.

View the Multimodal Cancer Network

Giga-scale biological network

The giga-scale biological network is one of the largest networks ever constructed in biology. The network integrates protein and genetic interaction data from more than two thousand species.

The network has 10 M nodes and 2.3 B edges.

View the Giga-Scale Biological Network

Tree of life

Protein interactomes across the tree of life

The dataset contains protein interactomes from 1,840 species across the tree of life. The dataset contains rich metadata about proteins, including their homology relationships

The dataset also contains metadata about species, including taxonomy of species, phylogenetic reltionships, and ecological information on environments and habitats in which species live.

View the Tree of Life dataset

Polypharmacy network

Network of drugs, proteins, and side effects

The polypharmacy network is a highly multi-relational network, consisting of protein-protein interactions, drug-protein targets, and drug-drug interactions encoded by polypharmacy side effects.

The network has 20 K nodes and 5 M edges, which are split into 1 K distinct edge types.

View the Polypharmacy Network

Tissue-specific protein dataset

The dataset contains protein-protein interaction networks specific to 107 human tissues, a tissue hierarchy of anatomical relationships between tissues, and tissue-specific gene-function annotations.

View the Tissue-Specific Protein Dataset

Human knowledge network

The human knowledge network contains interactions between proteins, diseases, biological processes, side effects, and drugs.

The network has 98 K nodes and 8 M edges, which are split into 42 distinct types of biologically relevant molecular interactions.

View the Human Knowledge Network

Latest News

May 2022:   George Named the 2022 Wojcicki Troper Fellow

May 2022:   New preprint on PrimeKG

New preprint on building knowledge graphs to enable precision medicine applications.

Apr 2022:   Webster on the Cover of Cell Systems

Webster is on the cover of April issue of Cell Systems. Webster uses cell viability changes following gene perturbation to automatically learn cellular functions and pathways from data.

Apr 2022:   NASA Space Biology

Dr. Zitnik will serve on the Science Working Group at NASA Space Biology.

Mar 2022:   Yasha's Graduate Research Fellowship

Yasha won the National Defense Science and Engineering Graduate (NDSEG) Fellowship. Congratulations!

Mar 2022:   AI4Science at ICML 2022

We are excited to be selected to organize the AI4Science meeting at ICML 2022. Stay tuned for details.

Mar 2022:   Graph Algorithms in Biomedicine at PSB 2023

Excited to be organizing a session on Graph Algorithms at PSB 2023. Stay tuned for details.

Mar 2022:   Multimodal Learning on Graphs

New preprint! We introduce REMAP, a multimodal AI approach for disease relation extraction and classification. Project website.

Feb 2022:   Explainable Graph AI on the Capitol Hill

Owen has been selected to present our research on explainable biomedical AI to members of the US Congress at the “Posters on the Hill” symposium. Congrats Owen!

Feb 2022:   Graph Neural Networks for Time Series

Hot off the press at ICLR 2022. Check out Raindrop, our graph neural network with unique predictive capability to learn from irregular time series. Project website.

Feb 2022:   Biomedical Graph ML Tutorial Accepted to ISMB

Excited to present a tutorial at ISMB 2022 on graph representation learning for precision medicine. Congratulations, Michelle!

Feb 2022:   Marissa Won the Gates Cambridge Scholarship

Marissa Sumathipala is among the 23 outstanding US scholars selected be part of the 2022 class of Gates Cambridge Scholars at the University of Cambridge. Congratulations, Marissa!

Jan 2022:   Inferring Gene Multifunctionality

Jan 2022:   Deep Graph AI for Time Series Accepted to ICLR

Paper on graph representation learning for time series accepted to ICLR. Congratulations, Xiang!

Jan 2022:   Probing GNN Explainers Accepted to AISTATS

Jan 2022:   Marissa Sumathipala selected as Churchill Scholar

Marissa Sumathipala is selected for the prestigious Churchill Scholarship. Congratulations, Marissa!

Jan 2022:   Therapeutics Data Commons User Meetup

We invite you to join the growing open-science community at the User Group Meetup of Therapeutics Data Commons! Register for the first live user group meeting on Tuesday, January 25 at 11:00 AM EST.

Jan 2022:   Workshop on Graph Learning Benchmarks

Dec 2021:   NASA: Precision Space Health System

Human space exploration beyond low Earth orbit will involve missions of significant distance and duration. To effectively mitigate myriad space health hazards, paradigm shifts in data and space health systems are necessary to enable Earth independence. Delighted to be working with NASA and can share our recommendations!

Zitnik Lab  ·  Harvard  ·  Department of Biomedical Informatics