Datasets
PrimeKG
Precision Medicine Oriented Knowledge Graph
PrimeKG is a precision medicine-oriented knowledge graph that provides a holistic view of diseases. It integrates 20 high-quality resources to describe 17,080 diseases with 5,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scale, and the entire range of approved and experimental drugs with their therapeutic action.
PrimeKG supports drug-disease prediction by including an abundance of ’indications’, ’contradictions’ and ’off-label use’ edges, which are usually missing in other knowledge graphs. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multi-modal analyses.
GraphXAI
Evaluating Explainability for Graph Neural Networks
GraphXAI is a resource to systematically evaluate and benchmark the quality of GNN explanations. A key component is a novel and flexible synthetic dataset generator called ShapeGGen that automatically generates a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) together with ground-truth explanations that address all known pitfalls of explainability methods.
Physical Activity Monitoring Dataset
Dataset for Irregular Time Series Research
We are developing representation learning techniques for complex time series dataset. In the Raindrop study (ICLR’22), we introduced a graph-guided network for irregularly sampled multivariate time series. The study includes a processed sensor dataset recording daily living activities of individuals.
Population-Scale Patient Safety Dataset
Adverse Events of Medications across Patient Groups and the Entire Range of Human Diseases and Approved Drugs
We present a comprehensive catalog of 10,443,476 adverse event reports (involving 19,193 adverse events and 3,624 drugs) from the U.S. Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS), collected from January 2013 to September 2020. The new resource can help discover relationships between drugs and safety events, especially in cases of rare events and effects within population subgroups that differ in their risks of specific clinical outcomes and are disproportionately affected by preventable inequities.
Subgraph Datasets
Datasets for Subgraph Representation Learning Research
We design novel synthetic and real-world social and biological datasets, consisting of underlying base graphs and many labeled subgraphs. These datasets are ready to be used for benchmarking, systematic model evaluation and comparison.
Therapeutics Data Commons
Machine Learning Datasets and Tasks for Drug Discovery and Development
TDC is the first unifying framework to systematically access and evaluate machine learning across the entire range of therapeutics.
At its core, TDC is a collection of AI/ML-ready datasets and learning tasks to serve as a meeting point for domain and ML scientists. TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles. All datasets and learning tasks are integrated and accessible via an open-source library.
BioSNAP
Stanford Biomedical Network Dataset Collection
BioSNAP is a collection diverse biomedical networks, inclusing protein-protein interaction networks, single-cell similarity networks, drug-drug interaction networks.
BioSNAP datasets contain metadata on graphs and node features, and can be easily linked to external repositories of biological knowledge.
Fair Graph Datasets
Graph datasets comprising of high-stakes decisions in criminal justice and financial lending domains
Graph datasets comprise of critical decisions in criminal justice (if a defendant should be released on bail) and financial lending (if an individual should be given loan) domains. These attributed graphs contain sensitive/protected attributes, which makes them suitable for studying algorithmic fairness.
OGB
The Open Graph Benchmark
OGB is a collection of benchmark datasets, data loaders, and evaluators for graph machine learning. Datasets cover a variety of graph machine learning tasks and real-world applications.
The OGB data loaders are fully compatible with popular graph deep learning frameworks, including Pytorch Geometric and DGL. They provide automatic dataset downloading, standardized dataset splits, and unified performance evaluation.
Multimodal cancer network
Multimodal network centered on genes frequently mutated in cancer patients
The multimodal cancer network integrates information on chemicals, diseases, molecular functions, genes, and protein.
The dataset has 21 types of biologically meaningful associations (edge types): chemical-chemical, chemical-protein, disease-chemical, disease-disease, disease-function, disease-gene, function-function, gene-gene (split into 6 edge types by interaction type), gene-protein, protein-function, and protein-protein interactions.
The network has 20 K nodes and 3.4 M edges.
Giga-scale biological network
The giga-scale biological network is one of the largest networks ever constructed in biology. The network integrates protein and genetic interaction data from more than two thousand species.
The network has 10 M nodes and 2.3 B edges.
Tree of life
Protein interactomes across the tree of life
The dataset contains protein interactomes from 1,840 species across the tree of life. The dataset contains rich metadata about proteins, including their homology relationships
The dataset also contains metadata about species, including taxonomy of species, phylogenetic reltionships, and ecological information on environments and habitats in which species live.
Polypharmacy network
Network of drugs, proteins, and side effects
The polypharmacy network is a highly multi-relational network, consisting of protein-protein interactions, drug-protein targets, and drug-drug interactions encoded by polypharmacy side effects.
The network has 20 K nodes and 5 M edges, which are split into 1 K distinct edge types.
Tissue-specific protein dataset
The dataset contains protein-protein interaction networks specific to 107 human tissues, a tissue hierarchy of anatomical relationships between tissues, and tissue-specific gene-function annotations.
Human knowledge network
The human knowledge network contains interactions between proteins, diseases, biological processes, side effects, and drugs.
The network has 98 K nodes and 8 M edges, which are split into 42 distinct types of biologically relevant molecular interactions.
Latest News
Nov 2023: Next Generation of Therapeutics Commons
We are building the next generation of Therapeutics Commons! We are seeking outstanding fellows who will lead AI research to advance molecular drug design and clinical drug development.
Oct 2023: Structure-Based Drug Design
Geometric deep learning has emerged as a valuable tool for structure-based drug design, to generate and refine biomolecules by leveraging detailed three-dimensional geometric and molecular interaction information.
Oct 2023: Graph AI in Medicine
Graph AI models in medicine integrate diverse data modalities through pre-training, facilitate interactive feedback loops, and foster human-AI collaboration, paving the way to clinically meaningful predictions.
Sep 2023: New papers accepted at NeurIPS
Congratulations to Owen and Zaixi for having their papers accepted as spotlights at NeurIPS! These papers introduce techniques for explaining time series models through self-supervised learning and co-designing protein pocket sequences & 3D structures.
Sep 2023: Future Directions in Network Biology
Excited to share our perspectives on current and future directions in network biology.
Aug 2023: Scientific Discovery in the Age of AI
New paper on the role of artificial intelligence in scientific discovery is published in Nature.
Jul 2023: PINNACLE - Contextual AI protein model
PINNACLE is a contextual AI model for protein understanding that dynamically adjusts its outputs based on biological contexts in which it operates. Project website.
Jun 2023: Our Group is Joining the Kempner Institute
Excited to join Kempner’s inaugural cohort of associate faculty to advance Kempner’s mission of studying the intersection of natural and artificial intelligence.
Jun 2023: Welcoming a New Postdoctoral Fellow
An enthusiastic welcome to Shanghua Gao who is joining our group as a postdoctoral research fellow.
Jun 2023: On Pretraining in Nature Machine Intelligence
Excited to share our new study on language model pretraining and general-purpose methods for biological sequences. Project website.
May 2023: Congratulations to Ada and Michelle
Congrats to PhD student Michelle on being selected as the 2023 Albert J. Ryan Fellow and also to participate in the Heidelberg Laureate Forum. Congratulations to PhD student Ada for being selected as the Kempner Institute Graduate Fellow!
Apr 2023: Universal Domain Adaptation at ICML 2023
New paper introducing the first model for closed-set and universal domain adaptation on time series accepted at ICML 2023. Raincoat addresses feature and label shifts and can detect private labels. Project website.
Apr 2023: Celebrating Achievements of Our Undergrads
Undergraduate researchers Ziyuan, Nick, Yepeng, Jiali, Julia, and Marissa are moving onto their PhD research in Computer Science, Systems Biology, Neuroscience, and Biological & Medical Sciences at Harvard, MIT, Carnegie Mellon University, and UMass Lowell. We are excited for the bright future they created for themselves.
Apr 2023: Welcoming a New Postdoctoral Fellow
An enthusiastic welcome to Tianlong Chen, our newly appointed postdoctoral fellow.
Apr 2023: New Study in Nature Machine Intelligence
New paper in Nature Machine Intelligence introducing the blueprint for multimodal learning with graphs.
Mar 2023: Precision Health in Nature Machine Intelligence
New paper with NASA in Nature Machine Intelligence on biomonitoring and precision health in deep space supported by artificial intelligence.
Mar 2023: Self-Driving Labs in Nature Machine Intelligence
New paper with NASA in Nature Machine Intelligence on biological research and self-driving labs in deep space supported by artificial intelligence.
Mar 2023: TxGNN - Zero-shot prediction of therapeutic use
New study on zero-shot prediction of therapeutic use with geometric deep learning and clinician centered design. Check out our project website and TxGNN Explorer.
Mar 2023: GraphXAI published in Scientific Data
Our approach evaluating explainability of geometric deep learning models is published in Scientific Data. Project website.
Feb 2023: Welcoming New Postdoctoral Fellows
A warm welcome to postdoctoral fellows Wanxiang Shen and Ruth Johnson. Congratulations to Ruthie for being named a Berkowitz Fellow.
Tweets