Course Project

Overview
Project Proposal and Peer Review Process (10% of the Total Grade, Weeks 4-5)
Mid-Term Project Presentations (10% of the Total Grade, Week 8)
Final Project Presentations and Report (63% of the Total Grade, Week 13)
What We Hope You’ll Gain
Project Milestones
What to Submit
1. Proposal Submission Guidelines (Due Week 4)
2. Final Report Submission Guidelines (Due Week 13)
Additional Credit Opportunity: Blog Post or Open-Source Contribution
1. Blog Post (Optional for Additional Credit)
2. Open-Source Package Contribution (Optional for Additional Credit)
Computational Resources
1. Google Colab
Project Ideas - Open Medical Datasets
Project Ideas - Examples

Overview

The term project is an exciting opportunity for you to work in small groups and explore an area of biomedical AI. This project is designed to foster creativity, collaboration, and critical thinking while allowing you to apply the AI methods and concepts you’re learning in class to a real-world challenge. We hope this will be an engaging and rewarding process as you work together to develop your ideas.

BMIF 203: Work in groups of 1-2 students. For exceptions, contact the Course Instructor.
BMI 702: Work in groups of 2-3 students. For exceptions, contact the Course Instructor.

Project Proposal and Peer Review Process (10% of the Total Grade, Weeks 4-5)

Week 4 (5% of the Total Grade)

By the end of Week 4, your group will submit a 2-page project proposal that outlines your chosen research topic, any relevant background, and its potential to impact healthcare using AI. Don’t worry if you’re still working out some details—that’s what this stage is for! Your proposal should present a clear research question or problem, along with your planned approach and any initial thoughts on how AI can be applied to address it.

Once the proposals are submitted, they will be anonymized and distributed among your peers for review. Each student will be responsible for reviewing two anonymized proposals. We’ll model this review process after the criteria used for journal article reviews, so you’ll be asked to provide thoughtful, constructive feedback that encourages improvement and deeper exploration. You may suggest additional questions for the group to consider or point them to relevant resources and references to guide their project development.

The feedback process is a crucial part of learning how to think critically about research, both your own and others. Your reviews should focus on supporting your classmates in refining their ideas in a respectful and thoughtful way. The aim here is to create a positive, collaborative environment where we can all learn from one another.

Week 5 (5% of the Total Grade)

After all reviews are collected, we will anonymize and consolidate the feedback before returning it to your group. You’ll then have the chance to decide how you want to address the feedback in your final proposal submission, due during Week 5. You’re free to implement the suggestions from your peers or proceed with your original plan, but either way, this process is meant to help you think more deeply about your project’s direction.

Mid-Term Project Presentations (10% of the Total Grade, Week 8)

Midway through the semester, your group will give a midterm project presentation to showcase your progress. This is your chance to share your initial findings, challenges, and next steps, while receiving valuable feedback from your peers and instructors. Presentations will be informal and aimed at helping you progress with your project—so don’t worry if you’re not fully finished yet! We’ll use this session to troubleshoot any difficulties you’re encountering and provide pointers to help guide you through the second half of the project.

Final Project Presentations and Report (63% of the Total Grade, Week 13)

In the final weeks of the course, you’ll present your completed project in a 10-minute presentation, followed by a 5-minute Q&A session. This is your chance to showcase the research and insights you’ve developed throughout the semester.

Additionally, you’ll submit a written NeurIPS-style report detailing your research question, methods, results, and conclusions. We’re setting high standards for this report because we believe in your potential to rise to the challenge. This is a wonderful dress rehearsal for the moment when you’ll need to submit your own research in the future. The report should follow the structure of a formal scientific article, including an abstract, introduction, methodology, results, discussion, and conclusion. There’s no page limit, but we encourage you to be thorough and concise in your exploration and articulation of your findings.

The final presentations and reports are meant to celebrate the hard work you’ve done and prepare you for the professional standards you’ll encounter in your academic career.

What We Hope You’ll Gain

We hope this project will give you a sense of ownership over your learning and the ability to see firsthand how AI can be applied to important challenges in healthcare. Along the way, you’ll sharpen your research skills, collaborate with peers, and gain valuable experience in both giving and receiving feedback. Most importantly, we want you to feel confident and inspired as you work through the complexities of applying AI to real-world medical problems.

Remember, this project is about learning and growth, so don’t be afraid to explore new ideas, make mistakes, and ask for help when needed. We’re excited to see where your creativity and hard work will take you!

Project Milestones

1. Project Proposal (Due Week 4, with Edits in Week 5)

Your project proposal is an essential first step in outlining your research plan and preparing for the term project. It should include:

Hypothesis or Research Question: Clearly define the research question or hypothesis your project will investigate.
Background Research: Summarize relevant studies or literature that supports your project idea and provides context.
Dataset: Identify the dataset(s) you plan to use. Explain how it is suitable for your research question.
Baseline Model: Select a baseline model that will serve as the starting point for comparison (e.g., a commonly used AI model in healthcare).
Proposed Methodology: Outline the techniques or models you plan to implement and describe why they are appropriate for your project.
Resources: List any tools, software, or compute resources you will need to complete your project.
Challenges and Contingency Plans: Describe potential challenges or difficulties you might face (use a softer term like “hurdles” or “areas of uncertainty”) and include a contingency plan to address these.

This proposal will set the foundation for your project and will be reviewed by your peers for feedback.

2. Midterm Presentation (Due Week 8)

The midterm presentation is a checkpoint to assess progress and receive feedback. Focus on communicating your problem clearly and presenting your current work. The presentation should include:

Problem Statement and Background: Reiterate the problem you’re addressing and its significance in healthcare. Provide a quick summary of your background research.
Methodology: Explain the approach you’re using to solve the problem, even if it’s still evolving.
Baseline Results: Share any preliminary results you have from the baseline model, along with initial observations.
Challenges and Proposed Solutions: Outline any hurdles or limitations you’re facing. Provide possible solutions you are considering to address these issues.
Singular Ask for Feedback: Pose one specific question or challenge you want feedback on from the class. This could be related to methodology, data, model choice, etc.

This presentation is an opportunity to gather input and fine-tune your project direction before finalizing your work.

3. Final Report and Presentation (Due Week 13)

The final report should be written in the style of a NeurIPS conference paper, adhering to high academic standards. It must include:

Introduction: Provide a clear and concise overview of your research question and its importance in healthcare AI.
Background: Summarize the literature and context that support your research.
Methods: Describe in detail the models and techniques you implemented, with explanations of their appropriateness for your problem.
Results: Present the results of your experiments, including figures, tables, and performance metrics.
Discussion: Analyze the implications of your results, including limitations and potential improvements.
Conclusion: Summarize the key findings and suggest next steps for future research.

Your final presentation should be equivalent to a conference presentation. It should include clear visuals, be engaging, and succinctly cover your entire project from start to finish. You’ll have 10 minutes to present, followed by 5 minutes for questions.

What to Submit

Proposal Submission Guidelines (Due Week 4)

This is your opportunity to present a clear and concise research plan while adhering to professional standards. Here are the formatting details:

Length: The proposal should be no more than 2 pages, single-spaced, with 1” margins, and size 12 font (preferably Times New Roman or Arial). A third page is allowed for figures and tables. Unlimited space for references.
Sections to Include:
- Title: Clearly state the title of your project at the top of the page.
- Hypothesis/Research Question: Clearly articulate the research question or hypothesis you will explore.
- Background/Significance: Provide context and justification for your research, explaining its importance to healthcare and AI.
- Dataset: Identify the dataset you will use and briefly explain its appropriateness for your project.
- Baseline Model: Indicate the baseline model you will use for comparison.
- Proposed Methodology: Detail the techniques or models you plan to implement and why they are suitable for addressing the research question.
- Resources: List the tools, software, and computational resources you will need.
- Challenges/Contingency Plans: Identify potential challenges (e.g., data limitations, model performance issues) and your plan to overcome them.
- References: References are not included in the page limit and should be placed at the end of the proposal, following a standard citation format (e.g., APA or IEEE).

This format will help ensure that your proposal is both professional and well-organized, allowing you to communicate your ideas effectively within the constraints provided.

Final Report Submission Guidelines (Due Week 13)

Your final report should follow the NeurIPS style guide. While there is no page limit, we encourage you to be thorough yet concise. Here are the key formatting details:

Style: Use the official NeurIPS style guide (templates are available on the NeurIPS website) for formatting your final report. This includes formatting for headings, figures, tables, equations, and references.
Sections to Include:
- Title: The title of your project.
- Abstract: A brief summary of the research question, methodology, results, and conclusions (250 words max).
- Introduction: Provide an overview of the research question, the significance of the problem, and the objectives of your project.
- Background: Summarize relevant literature and prior work in AI and healthcare that supports your research.
- Methods: Describe the models, techniques, and experiments conducted. Be specific about how the dataset was processed, the architecture of your models, and the evaluation metrics used.
- Results: Present the outcomes of your experiments, using tables and figures to visualize data and findings.
- Discussion: Analyze the results, addressing any limitations, challenges, and possible improvements.
- Conclusion: Summarize the key findings and their implications for healthcare, and suggest directions for future work.
- References: Include a comprehensive list of all sources cited in your report.

You are encouraged to aim for clarity, technical precision, and high-quality visuals to support your findings. This will also prepare you for potential future submissions.

Additional Credit Opportunity: Blog Post or Open-Source Contribution

To further enrich your project experience, you can earn additional credit by either writing a blog post to share your project with the broader AI and healthcare community or contributing to an open-source AI package. These are optional, but they provide a fantastic opportunity to showcase your work and make an impact beyond the classroom.

By completing one of these optional tasks, you can earn up to 5% additional credit, and gain visibility for your work.

Blog Post (Optional for Additional Credit)

Students who opt to write a blog post can share their project insights with a wider audience, helping others learn from your work.

Platform: We recommend publishing on Medium or a personal blog. If your post is particularly compelling, we may feature it on the course’s official page.
Content: Your blog post should summarize your research project in an accessible and engaging way. Include:
- Overview of the research question and its significance.
- Summary of methods and findings.
- Key insights and lessons learned.
- Code snippets (where relevant) and visualizations that help explain your approach.
Length: Aim for a blog post that takes about 5-10 minutes to read.
Submission: Submit a draft link for review before publishing.

Open-Source Package Contribution (Optional for Additional Credit)

If your project involves developing reusable tools or models, consider contributing them to an open-source package (e.g., via GitHub).

Platform: Publish your code on GitHub or contribute to an existing open-source AI package (e.g., PyTorch, TensorFlow, Hugging Face).
Documentation: Ensure that your code is well-documented and includes:
- ReadMe file explaining what the package does and how to use it.
- Instructions for installation and running the code.
- Detailed comments within the code to guide users.
Submission: Submit a pull request (PR) to the repository and provide a link to the PR along with a summary of your contribution.

Computational Resources

Google Colab

To support your projects, each student will be provided with a subscription to Google Colab, which will serve as the primary computational platform for this course. Colab is user-friendly and widely used for developing and running AI models, making it a great tool for collaboration and exploration.

If you have access to additional resources through your lab or institution, you’re welcome to use them. However, rest assured that your grade will not depend on the complexity of the tools you use—what matters most is your approach, creativity, and ability to share your work with others in the course.

For those using lab-specific tools or software, please ensure that your project is sharable with instructors and classmates. This allows us to evaluate your work and provides opportunities for peer learning. If you’re working with sensitive or restricted data or software, please choose aspects of your project that can be shared openly.

Project Ideas - Open Medical Datasets

For your projects, you will need access to reliable and well-documented datasets that are commonly used in medical research. Below is a curated list of trusted and easily accessible open-source datasets, spanning a variety of medical data types, including imaging, clinical records, and genomics. These datasets will provide a solid foundation for applying AI techniques in healthcare, allowing you to explore real-world challenges and make meaningful contributions through your work.

Medical Imaging Datasets

Medical Segmentation Decathlon
- Type: Multimodal imaging (e.g., brain, liver, prostate MRI, lung CT).
- Description: A large-scale challenge dataset that includes multiple medical imaging modalities across different organs and diseases.
- Link: Medical Decathlon
CheXpert
- Type: Chest X-rays.
- Description: A large dataset of chest X-rays labeled for the presence of 14 common chest radiographic findings.
- Link: CheXpert
LUNA16 (LUng Nodule Analysis)
- Type: Lung CT scans.
- Description: This dataset is used for lung nodule detection and was derived from the LIDC/IDRI dataset.
- Link: LUNA16
BraTS (Brain Tumor Segmentation)
- Type: Brain MRI.
- Description: A dataset for the segmentation of gliomas in MRI scans.
- Link: BraTS
HAM10000 (Human Against Machine with 10000 training images)
- Type: Dermatology.
- Description: A dataset of skin lesion images for training models to classify various types of skin cancer.
- Link: HAM10000
TCIA (The Cancer Imaging Archive
- Type: Medical imaging (various modalities).
- Description: A large archive of cancer-related medical images, including radiology and pathology scans, organized by disease and imaging modality. Data is freely available for download and is often paired with clinical and genomic data.
- Link: TCIA

Clinical and Health Records Datasets

MIMIC-III (Medical Information Mart for Intensive Care)
- Type: ICU clinical data (structured and unstructured).
- Description: A publicly available critical care database containing de-identified health data associated with over 60,000 ICU admissions.
- Link: MIMIC-III
eICU Collaborative Research Database
- Type: ICU patient records.
- Description: A multi-center ICU database containing patient demographics, vital signs, medications, laboratory results, and more.
- Link: eICU
PhysioNet Challenge Data
- Type: Various medical data types (e.g., ECG, EEG, ICU records).
- Description: A collection of datasets focused on clinical time-series data, including physiological signals and waveforms.
- Link: PhysioNet
NHANES (National Health and Nutrition Examination Survey
- Type: Comprehensive health data.
- Description: Provides demographic, dietary, and health-related data for the U.S. population, useful for epidemiological and public health research.
- Link: NHANES

Molecular and Drug Discovery Datasets

Therapeutics Data Commons (TDC)
- Type: Protein, drug and chemical structures, drug outcomes, bioactivities profiles, chemical perturbation datasets, adverse effects.
- Description: Datasets on a range of therapeutic modalities, including small molecules and biologics, including antibodies, peptides, miRNAs, and gene editing therapies.
- Link: TDC
Molecular Instruction Tuning Dataset (ProCyon-Instruct)
- Type: ProCyon is a groundbreaking foundation model for modeling, generating, and predicting protein phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions.
- Description: A dataset of 33 million protein phenotype instructions, representing a comprehensive resource for multiscale protein phenotypes.
- Link: ProCyon

Genomics and Proteomics Datasets

TCGA (The Cancer Genome Atlas)
- Type: Genomics, transcriptomics, and clinical data.
- Description: A dataset containing multi-dimensional maps of key genomic changes in 33 types of cancer.
- Link: TCGA
GTEx (Genotype-Tissue Expression)
- Type: Genomics and transcriptomics.
- Description: Data for studying tissue-specific gene expression and regulation in multiple human tissues.
- Link: GTEx
ClinVar
- Type: Genomic variant data.
- Description: ClinVar is a freely accessible, public archive of reports of the relationships among human genetic variants and diseases.
- Link: ClinVar
CellXGene
- Type: Single-cell gene expression datasets to study the mechanisms of human health and disease.
- Description: 1,720+ single-cell datasets across 970+ cell types and spanning 100+ million unique cells.
- Link: CZI CellxGene

Multimodal Medical Data and Knowledge Graphs

Unified Clinical Vocabulary Embeddings
- Type: Clinical vocabulary embeddings generated using graph transformer neural networks that provide a new representation of clinical knowledge unified across seven medical vocabularies.
- Description: A resource of 67,124 clinical vocabulary embeddings derived from a clinical knowledge graph tailored to electronic health record vocabularies, spanning over 1.3 million edges.
- Link: Unified Clinical Vocabulary Embddings
PheKG Knowledge Graph
- Type: Clinical knowledge graph.
- Description: A knowledge graph providing a new representation of clinical knowledge by unifying seven medical vocabularies.
- Link: PheKG knowledge graph
PrimeKG Precision Medicine Knowledge Graph
- Type: Medical knowledge graph.
- Description: PrimeKG describes 17,080 diseases with 5,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scale, and the entire range of approved and experimental drugs with their therapeutic action.
- Link: PrimeKG and TxGNN
BioCypher Biological Knowledge Graphs
- Type: Biological knowledge graphs.
- Description: BioCypher framework support users in creating custom KGs.
- Link: BioCypher
BioSNAP Network Dataset Collection
- Type: Networks, multi-relational networks.
- Description: BioSNAP is a collection diverse biomedical networks, inclusing protein-protein interaction networks, single-cell similarity networks, drug-drug interaction networks.
- Link: BioSNAP
UK Biobank
- Type: Extensive medical, genetic, and environmental data.
- Description: A large-scale biomedical database containing in-depth genetic and health information from half a million UK participants.
- Link: UK Biobank
OASIS (Open Access Series of Imaging Studies)
- Type: Neuroimaging.
- Description: A project aimed at making MRI datasets of the brain freely available to the scientific community.
- Link: OASIS

Project Ideas - Examples

We expect students to develop research projects aligned with their areas of interest, past experiences, and technical expertise. The goal is to empower you to explore topics that excite you while applying the skills you will learn in this course.

We encourage you to attend office hours with the course staff to discuss your ideas, seek feedback, and receive additional guidance. Office hours are a valuable opportunity to refine your project scope and ensure your approach is feasible.

To help you get started, we recommend exploring the list of open medical datasets. You are welcome to design your project using one or more datasets from this list, which provides a variety of resources suited for medical AI research.

Additionally, we provide a selection of project ideas to spark inspiration. Please note that this is a non-exhaustive list. Although we list the initial point of contact below, you are welcome to discuss your project idea with any member of the course staff.

Time series representation learning for forecasting and patient outcomes: Explore time series models to forecast clinical events and classify patient outcomes based on longitudinal data. (Contact: Marinka Zitnik)
LLM-based agents for medical and biological data analysis: Develop large language model-powered agents to analyze and interpret complex medical and biological datasets. (Contact: Marinka Zitnik)
Benchmarking LLMs for medical and scientific tasks: Evaluate the performance of large language models across medical and scientific use cases, identifying strengths and limitations. (Contact: Marinka Zitnik)
Integrating EHRs with LLMs for precision medicine: Combine electronic health records with large language models for personalized healthcare prediction. (Contact: Marinka Zitnik)
Geometric deep learning for molecular design and optimization: Apply geometric deep learning methods to design and optimize molecular structures for drug design. (Contact: Marinka Zitnik)
Molecular search engines: Design molecular search engines for efficient retrieval of molecular information across large datasets. (Contact: Marinka Zitnik)
Benchmarking foundation models for perturbation prediction: Evaluate genomic foundation models for predicting outcomes of molecular and genetic perturbations using bulk and single-cell transcriptomics datasets. (Contact: Yepeng Huang)
Multimodal integration for drug response prediction: Develop deep learning pipelines to integrate multimodal molecular data for drug response prediction and drug repurposing. (Contact: Yepeng Huang)
Knowledge graphs for perturbation modeling: Integrate biomedical knowledge graphs with perturbation response prediction methods. (Contact: Yepeng Huang)
Systematic evaluation of drug targets using transcriptomics and dependency datasets: Use AI to match genetic perturbations with compound-induced phenotypes to study on-target and off-target effects of drugs. (Contact: Yepeng Huang)
Protein design with EVO and ESM3: Use generative models like EVO and ESM3 to design novel protein sequences tailored to specific applications, followed by machine learning-based validation, including property prediction and interaction validation with docking models. (Contact: Yasha Ektefaie)
Improving protein language models: Investigate the role of sequence similarity and variation in improving PLM performance, evaluate thresholds for optimal improvement, and model systematic approaches to quantify these factors for downstream tasks. (Contact: Yasha Ektefaie)
Investigating the “hardness” of biological problems: Develop a framework to classify and evaluate the computational “hardness” of biological problems, validate classifications with computational evidence, and challenge anthropocentric notions of problem difficulty. (Contact: Yasha Ektefaie)
haplotype variant effect prediction with genomic language models: Explore whether genomic language model predictions for haplotypes align with ClinVar pathogenicity classifications by aggregating predictions, systematically comparing with ClinVar, and analyzing discrepancies to uncover clinically relevant patterns. (Contact: Courtney A Shearer)
Protein language model mutability as a predictor for clinical variant effects: Investigate whether mutability profiles from protein language models correlate with ClinVar mutation impacts by mapping position-specific mutability scores to clinical variants and analyzing statistical relationships, potentially contributing to ProteinGym benchmarks. (Contact: Courtney A Shearer)