Skip to main content Link Search Menu Expand Document (external link)

Course Project

Table of contents

  1. Project Components
    1. Project Proposal (13% of the Total Grade, Week 4)
    2. Mid-Term Project Report & Presentation (15% of the Total Grade, Week 8)
    3. Final Project Report & Presentation (50% of the Total Grade, Week 13)
    4. Weekly Github Commits (10% of the Total Grade, Weeks 2-13)
  2. LaTeX Template
  3. Additional Credit Opportunity: Blog Post or Open-Source Contribution
    1. Blog Post
    2. Open-Source Contribution
  4. Google Colab
  5. Biomedical Datasets
    1. Medical Imaging Datasets
    2. Clinical and Health Records Datasets
    3. Molecular and Drug Discovery Datasets
    4. Genomics and Proteomics Datasets
    5. Multimodal Medical Data and Knowledge Graphs
  6. Project Ideas

Project Components

  • BMIF 203 Students:
    • Work in groups of 1-2 students.
  • BMI 702 Students:
    • Work in groups of 2-3 students.

Project Proposal (13% of the Total Grade, Week 4)

By the end of Week 4, your group will submit a 2-page project proposal using the course LaTeX template. The proposal should define a clear research question in medical or biological AI, explain why it matters, and summarize relevant prior work. It should also describe the planned datasets, methods, and evaluation strategy, with concrete milestones for the rest of the semester. We will evaluate proposals on problem formulation, technical soundness, feasibility, and alignment with the course scope.

Mid-Term Project Report & Presentation (15% of the Total Grade, Week 8)

In Week 8, your group will submit a 4-page midterm report and give a project presentation. The goal is to demonstrate clear progress and to get targeted feedback on the direction of the work.

Your midterm report should include: the problem statement and scope, related work, datasets and preprocessing, methods implemented or developed so far, and results to date. Include quantitative evaluation where possible, along with error analysis and discussion of limitations. If results are not yet strong, explain what you have tried, what did not work, and what you will change.

Your presentation should summarize the same elements and state specific next steps for the second half of the semester. We will evaluate the midterm on quality of data analysis, progress on method development and implementation, and the depth and breadth of preliminary results.

Final Project Report & Presentation (50% of the Total Grade, Week 13)

In Week 13, your group will submit a 9-page final report and give a final presentation followed by Q&A. This is the primary outcome for the course and will be evaluated to a high standard.

The final report must use the course LaTeX template and follow the structure of a scientific paper. It should clearly state the research problem and why it matters, position the work in the context of prior literature, and describe the data, methods, and experimental design in enough detail for others to reproduce the work. You must report results with appropriate baselines and metrics, include error analysis and ablations where relevant, and discuss limitations and failure modes. Conclude with a clear summary of what you found and what remains open.

The final presentation should communicate the same story in a focused way: motivation, approach, key results, and takeaways. You should be prepared to answer questions about assumptions, evaluation choices, and how your method behaves under realistic conditions.

Weekly Github Commits (10% of the Total Grade, Weeks 2-13)

Each group must maintain an active shared GitHub repository from Week 2 through Week 12. You are expected to make regular commits that reflect ongoing work on the project, including code, experiments, data analysis, documentation, and intermediate results.

Commits should show steady development over time, not last-minute uploads. We will evaluate this component based on clarity of commit history and evidence of engagement with the project throughout the semester.

LaTeX Template

This year, we provide an LaTeX template for all written project deliverables. The template includes separate .tex files for the project proposal, midterm report, and final report. You must use this template for all submissions.

You can copy the template into your own Overleaf account using the following link: Overleaf Project Link.

Additional Credit Opportunity: Blog Post or Open-Source Contribution

Students may earn up to 5% additional credit by completing one optional dissemination activity tied to the course project. This option is intended for teams whose work is mature enough to benefit from broader visibility or reuse beyond the course.

Before proceeding, first discuss your plan with the Faculty Instructor. This includes the choice of platform, scope of the contribution, timing, and suitability of the work for public release.

Blog Post

A blog post should communicate your project to a broad audience interested in AI, medicine, and biology.

Plan for approximately 10 minutes of reading time. The post should include clear visuals, such as figures, diagrams, tables, or simple animations, where they help explain concepts or results. You are encouraged to include external references and pointers to relevant papers, datasets, code repositories, or tools so readers can explore further.

The topic may be introductory or focused on a cutting-edge or emerging area, as long as it is grounded in your course project. The post should explain the research question, why it matters, how you approached the problem, and what you learned, including limitations and open questions where appropriate.

Open-Source Contribution

An open-source contribution is appropriate if your project produces reusable code, models, or tools. This may include releasing a standalone repository or contributing to an existing open-source project. The contribution should be substantive and clearly connected to the technical work of your course project.

All code must be clean, documented, and usable by others. This includes a clear README, instructions for installation and use, and sufficient comments to explain design choices. The target repository or framework must be discussed with the instructors in advance. You will submit a link to the repository or pull request along with a short description of your contribution.

Google Colab

To support your projects, each student will be provided with a subscription to Google Colab, which will serve as the primary computational platform for this course. Colab is user-friendly and widely used for developing and running AI models, making it a great tool for collaboration and exploration.

If you have access to additional resources through your lab or institution, you’re welcome to use them. However, rest assured that your grade will not depend on the complexity of the tools you use—what matters most is your approach, creativity, and ability to share your work with others in the course.

For those using lab-specific tools or software, please ensure that your project is sharable with instructors and classmates. This allows us to evaluate your work and provides opportunities for peer learning. If you’re working with sensitive or restricted data or software, please choose aspects of your project that can be shared openly.

Biomedical Datasets

Below is a suggested list of well-documented datasets commonly used in medical and biological research. These datasets span multiple modalities, including imaging, clinical records, and genomic or molecular data, and can be used for the course project. The list is not exhaustive. Students may use other datasets as appropriate for their project.

Medical Imaging Datasets

  • Medical Segmentation Decathlon
    • Type: Multimodal imaging (e.g., brain, liver, prostate MRI, lung CT).
    • Description: A large-scale challenge dataset that includes multiple medical imaging modalities across different organs and diseases.
    • Link: Medical Decathlon
  • CheXpert
    • Type: Chest X-rays.
    • Description: A large dataset of chest X-rays labeled for the presence of 14 common chest radiographic findings.
    • Link: CheXpert
  • LUNA16 (LUng Nodule Analysis)
    • Type: Lung CT scans.
    • Description: This dataset is used for lung nodule detection and was derived from the LIDC/IDRI dataset.
    • Link: LUNA16
  • BraTS (Brain Tumor Segmentation)
    • Type: Brain MRI.
    • Description: A dataset for the segmentation of gliomas in MRI scans.
    • Link: BraTS
  • HAM10000 (Human Against Machine with 10000 training images)
    • Type: Dermatology.
    • Description: A dataset of skin lesion images for training models to classify various types of skin cancer.
    • Link: HAM10000
  • TCIA (The Cancer Imaging Archive
    • Type: Medical imaging (various modalities).
    • Description: A large archive of cancer-related medical images, including radiology and pathology scans, organized by disease and imaging modality. Data is freely available for download and is often paired with clinical and genomic data.
    • Link: TCIA

Clinical and Health Records Datasets

  • MIMIC-III (Medical Information Mart for Intensive Care)
    • Type: ICU clinical data (structured and unstructured).
    • Description: A publicly available critical care database containing de-identified health data associated with over 60,000 ICU admissions.
    • Link: MIMIC-III
  • eICU Collaborative Research Database
    • Type: ICU patient records.
    • Description: A multi-center ICU database containing patient demographics, vital signs, medications, laboratory results, and more.
    • Link: eICU
  • PhysioNet Challenge Data
    • Type: Various medical data types (e.g., ECG, EEG, ICU records).
    • Description: A collection of datasets focused on clinical time-series data, including physiological signals and waveforms.
    • Link: PhysioNet
  • NHANES (National Health and Nutrition Examination Survey
    • Type: Comprehensive health data.
    • Description: Provides demographic, dietary, and health-related data for the U.S. population, useful for epidemiological and public health research.
    • Link: NHANES

Molecular and Drug Discovery Datasets

  • Therapeutics Data Commons (TDC)
    • Type: Protein, drug and chemical structures, drug outcomes, bioactivities profiles, chemical perturbation datasets, adverse effects.
    • Description: Datasets on a range of therapeutic modalities, including small molecules and biologics, including antibodies, peptides, miRNAs, and gene editing therapies.
    • Link: TDC
  • Molecular Instruction Tuning Dataset (ProCyon-Instruct)
    • Type: ProCyon is a groundbreaking foundation model for modeling, generating, and predicting protein phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions.
    • Description: A dataset of 33 million protein phenotype instructions, representing a comprehensive resource for multiscale protein phenotypes.
    • Link: ProCyon

Genomics and Proteomics Datasets

  • TCGA (The Cancer Genome Atlas)
    • Type: Genomics, transcriptomics, and clinical data.
    • Description: A dataset containing multi-dimensional maps of key genomic changes in 33 types of cancer.
    • Link: TCGA
  • GTEx (Genotype-Tissue Expression)
    • Type: Genomics and transcriptomics.
    • Description: Data for studying tissue-specific gene expression and regulation in multiple human tissues.
    • Link: GTEx
  • ClinVar
    • Type: Genomic variant data.
    • Description: ClinVar is a freely accessible, public archive of reports of the relationships among human genetic variants and diseases.
    • Link: ClinVar
  • CellXGene
    • Type: Single-cell gene expression datasets to study the mechanisms of human health and disease.
    • Description: 1,720+ single-cell datasets across 970+ cell types and spanning 100+ million unique cells.
    • Link: CZI CellxGene

Multimodal Medical Data and Knowledge Graphs

  • Unified Clinical Vocabulary Embeddings
    • Type: Clinical vocabulary embeddings generated using graph transformer neural networks that provide a new representation of clinical knowledge unified across seven medical vocabularies.
    • Description: A resource of 67,124 clinical vocabulary embeddings derived from a clinical knowledge graph tailored to electronic health record vocabularies, spanning over 1.3 million edges.
    • Link: Unified Clinical Vocabulary Embddings
  • PheKG Knowledge Graph
    • Type: Clinical knowledge graph.
    • Description: A knowledge graph providing a new representation of clinical knowledge by unifying seven medical vocabularies.
    • Link: PheKG knowledge graph
  • PrimeKG Precision Medicine Knowledge Graph
    • Type: Medical knowledge graph.
    • Description: PrimeKG describes 17,080 diseases with 5,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scale, and the entire range of approved and experimental drugs with their therapeutic action.
    • Link: PrimeKG and TxGNN
  • BioCypher Biological Knowledge Graphs
    • Type: Biological knowledge graphs.
    • Description: BioCypher framework support users in creating custom KGs.
    • Link: BioCypher
  • BioSNAP Network Dataset Collection
    • Type: Networks, multi-relational networks.
    • Description: BioSNAP is a collection diverse biomedical networks, inclusing protein-protein interaction networks, single-cell similarity networks, drug-drug interaction networks.
    • Link: BioSNAP
  • UK Biobank
    • Type: Extensive medical, genetic, and environmental data.
    • Description: A large-scale biomedical database containing in-depth genetic and health information from half a million UK participants.
    • Link: UK Biobank
  • OASIS (Open Access Series of Imaging Studies)
    • Type: Neuroimaging.
    • Description: A project aimed at making MRI datasets of the brain freely available to the scientific community.
    • Link: OASIS

Project Ideas

We provide a list of project ideas to help you get started. This list is not exhaustive. While we list an initial point of contact for each idea, you may discuss your project with any member of the course staff. We encourage you to attend office hours to discuss project ideas, get feedback, and troubleshoot technical issues.

  • Benchmarking AI Agents for Scientific Tool Use with ToolUniverse: Use ToolUniverse to benchmark AI agents on real scientific tool use across data analysis, simulation, and modeling tasks. Study success rates, failure modes, tool selection errors, and robustness across domains. (Contact: Marinka Zitnik)

  • Learning Tool-Use Policies for Multi-Step Scientific Reasoning: Develop agents that learn when and how to invoke scientific tools from ToolUniverse](https://aiscientist.tools/) to solve multi-step biomedical problems. Analyze trade-offs between autonomous planning, tool chaining, and human-in-the-loop control. (Contact: Marinka Zitnik)

  • Protein Function AI Scientist: Build an AI agent that predicts protein function by combining sequence, structure, interaction networks, and literature. Evaluate on held-out proteins using precision@k across functional categories, and compare against biological foundation models. Evaluate conditional generation of phenotypic descriptions: given a protein and a predicted function, generate a concise phenotype-level summary and score it against reference descriptions using automated text metrics and manual checks. (Contact: Marinka Zitnik)

  • Neurology AI Scientist: Develop an AI agent for neurology that reasons over single-cell genomics data. Evaluate on public single-cell datasets by measuring reproducibility across studies (train on one dataset, test on another), recovery of known disease-associated cell types, genes, and pathways, and enrichment against external references such as GWAS loci or curated gene sets. Background: PROTON. (Contact: Marinka Zitnik)

  • Cancer AI Scientist: Create an AI agent that integrates molecular, cellular, and clinical cancer data to stratify patients and generate testable hypotheses about mechanisms and therapeutic sensitivity. Evaluate on cancer cohorts with outcomes and treatment information (for example TCGA or cBioPortal studies) using endpoints such as overall survival, progression-free survival, response, or toxicity. Report stratification performance with time-to-event metrics (C-index, log-rank between predicted groups). For therapy hypotheses, evaluate whether predicted sensitive groups show higher response rates or improved outcomes under the matched treatments (with appropriate confounding controls). Background: Medea. (Contact: Marinka Zitnik)
  • Rare Disease AI Scientist: Develop an AI agent for rare disease drug repurposing to propose candidate therapies and their mechanistic rationale. The agent should produce ranked drug hypotheses with supporting evidence, including target pathways, gene-disease links, and safety or contraindication signals. Evaluate the system on retrospective rare-disease cases with known or clinician-supported treatments, using metrics such as top-k hit rate. Include a clinician review of a blinded subset to score clinical plausibility, and run tests for robustness under incomplete patient information. (Contact: Marinka Zitnik)

  • Integrating EHRs with LLMs for precision medicine: Combine electronic health records with large language models for personalized healthcare prediction. (Contact: Marinka Zitnik)

  • Geometric Deep Learning for Molecular Design and Optimization: Apply geometric deep learning methods to design and optimize molecular structures for drug design. (Contact: Marinka Zitnik)

  • Molecular Search Engines: Develop a molecular search engine for efficient retrieval of molecules, proteins, or compounds from large datasets. The system should support chemical structure and natural lanaguge queries by learning unified multimodal representations from sequence, structure, chemical graphs, and annotations. Evaluate performance using benchmark query–target pairs, and targeted cases. (Contact: Marinka Zitnik)

  • Longitudinal Clinical Reasoning with LLMs: Explore fine-tuning methods that help large language models reason over longitudinal clinical data for patient-level tasks such as treatment prediction and disease progression modeling. (Contact: Shvat Messica)

  • Uncertainty-aware LLMs for Medical Tasks: Advance methods to model uncertainty of medical AI models, such as abstaining, hedging, or requesting additional information when appropriate. (Contact: Shvat Messica)

  • Multimodal Foundation Models for Electronic Health Records: Build a model that integrates multiple EHR modalities, including clinical notes, structured data, and medical images, to support clinical prediction and decision support. (Contact: Shvat Messica)

  • What Pre-training Method is Optimal for Medical Image Data?: Self-supervised and contrastive pre-training work well in computer vision, but most methods are developed and ranked on natural images. Prior medical SSL results are hard to compare because studies vary in datasets, architectures, model size, compute, and training time. In this project, students will run a controlled benchmark of modern medical image pre-training methods using matched architectures, datasets, training duration, and compute budgets. They will compare contrastive SSL (SimCLR-style), distillation-based SSL (DINO-style), masked reconstruction or generative methods (MAE-style), joint embedding and predictive approaches, and vision-language pre-training on paired image–report data, then evaluate learned representations on downstream tasks such as disease classification, segmentation, radiology report generation, and visual question answering. (Contact: Mohammed Baharoon)

  • Can an Autonomous Agent Win a Medical Imaging Competition?: Students will build an end-to-end agent designed to place first in one selected medical imaging competition by specializing in that competition’s data formats, preprocessing, training recipe, evaluation, and submission protocol. The agent must run the full workflow autonomously: data preparation, model training, validation, debugging, and generating submission-ready prediction files. Target competitions may involve classification, segmentation, detection, or generation across modalities such as X-ray, CT, MRI, and pathology. This project requires extensive iteration, since agents may run for long periods to train and evaluate models. Students are expected to iterate weekly and should not plan to complete the work in a short or last-minute window. (Contact: Mohammed Baharoon)

  • Soft-CLIP for Medical Imaging Using Report-Similarity Targets: CLIP-style vision-language pre-training treats each image-report pair as a single hard positive, even though many radiology reports share findings and could serve as similar supervision. This project tests a two-stage “soft” approach. Students will first train (or adopt) a baseline CLIP model on paired images and reports, then refine it using pseudo-positives derived from report similarity based on shared positive findings: for each image, the model scores a pool of candidate reports, selects the top K, and learns to align the image with these semantically similar reports, not just the original pair. Students will compare retrieval and transfer to standard CLIP on one or two downstream tasks (e.g., classification or report generation) and study design choices such as how to compute report similarity and how to limit confirmation bias (e.g., mixing in the original report or using confidence thresholds). (Contact: Mohammed Baharoon)

  • Negation-Aware and Uncertainty-Aware Radiology Report Generation: Train a vision-text model that generates the Impression section from a chest X-ray using MIMIC-CXR (and CXR-PRO if you use it). Add an objective that penalizes negation errors, such as turning “no pneumothorax” into a positive finding. Add selective generation so the model can abstain or say it lacks confidence when uncertainty is high. Evaluate by extracting clinical findings from generated reports and comparing to the reference, measuring negation accuracy on edited or hard cases, and reporting calibration and coverage–error curves for abstention. Contact: Rishabh Goel)

  • Shift-Robust ICU Early Warning System: Identify an ICU prediction task such as deterioration within 6–12 hours, mortality, or a sepsis proxy. Train strong baselines on MIMIC-IV, then test on eICU-CRD to measure the generalization drop across hospitals. Add one or two methods to improve reliability under shift, such as site-level recalibration, importance reweighting, conformal prediction, or drift detection with a “do not deploy” rule. Contact: Rishabh Goel)

  • Target-Conditioned Molecule Generation with Toxicity Constraints: Train an activity predictor on ChEMBL and a toxicity predictor on a dataset such as Tox21. Build a generator (VAE, diffusion, or SMILES LLM) that conditions on high predicted activity for a target and low predicted toxicity. Add a simple explanation method to inspect what substructures drive the predictors and whether they match known motifs. Evaluate generation quality (validity, uniqueness, novelty, diversity) and constraint satisfaction (fraction that meet activity and toxicity thresholds). Contact: Rishabh Goel)