Error Controlled Decisions for Safe Use of Medical Foundation Models

Foundation models show promise in medicine, but clinical use requires outputs that clinicians can act on under pre-specified error budgets, such as a cap on false-positive clinical calls. Without error control, strong average accuracy can still lead to harmful errors among the very cases labeled confident and to inefficient use of follow-up testing.

Here we introduce StratCP, a stratified conformal framework that turns foundation-model predictions into decision-ready outputs by combining selective action with calibrated deferral. StratCP first selects a subset of patients for immediate clinical calls while controlling the false discovery rate among those calls at a user-chosen level. It then returns calibrated prediction sets for deferred patients that meet the target error rate and guide confirmatory evaluation. The procedure is model agnostic and can be applied to pretrained foundation models without retraining.

We validate StratCP in ophthalmology and neuro-oncology across diagnostic classification and time-to-event prognosis. Across tasks, StratCP maintains false discovery rate control on selected patients and produces coherent prediction sets for deferred patients. In neuro-oncology, it supports diagnosis from H&E whole-slide images under a fixed error budget, reducing the need for reflex molecular assays and lowering laboratory cost and turnaround time. StratCP lays the groundwork for safe use of medical foundation models by converting predictions into error-controlled actions when evidence is strong and calibrated uncertainty otherwise.

Overview of StratCP

Foundation models (FMs) perform well across clinical tasks, including retinal imaging, whole-slide histopathology, and clinical question answering. Some FMs have also been tested in prospective or health system settings, including language models evaluated on diagnosis and risk prediction tasks and a pathology FM assessed in silent use for EGFR screening. Routine clinical use, however, shifts the requirement from performing well on benchmarks to producing outputs that support safe action under pre-specified error budgets. The key challenge is not only how accurate a model is on average, but when it is appropriate to act on its output. Clinicians need guidance on when to make a call, when to defer, and what follow up to order. This requires decision policies with pre-specified error budgets. Without them, model driven decisions can trigger unnecessary or harmful interventions, delay appropriate care, and waste limited diagnostic resources.

For safe clinical use of foundation models, uncertainty quantification should provide two guarantees. First, it should identify cases where a prediction is reliable enough to act on. Second, it should return calibrated candidate diagnoses for the remaining cases to guide confirmatory testing or expert review. The first requires error control within the acted upon selected subset. The second requires coverage within the deferred group. With guarantees aligned to clinical decisions, a clinician can make a call under a specified error budget or defer with calibrated differential diagnosis sets, enabling safe action when evidence is sufficient and follow up when it is not.

To meet these guarantees, we introduce StratCP, a stratified conformal framework that combines selective action with calibrated deferral and can wrap any trained FM without retraining.

  • In the action arm, StratCP selects a subset of cases for immediate decisions while controlling the false-discovery rate (FDR), the expected fraction of incorrect acted-upon predictions, at a pre-specified level for task-specific outputs, such as tumor subtype or long-survivor status.

  • In the deferral arm, StratCP returns conformal prediction sets with selection-conditional coverage. This ensures that, among deferred patients, the true diagnosis falls within the set at the target frequency.

When available, StratCP can use clinical diagnostic guidelines to produce clinically coherent prediction sets without sacrificing coverage.

StratCP Provides Guarantees for Safe Medical Decisions

StratCP builds on conformal prediction, a widely used approach for uncertainty quantification. Given a specified error tolerance, such as 5%, and a set of new patients, StratCP proceeds in two steps. First, it selects cases where the model prediction can be used for an immediate call, while keeping the error rate among these acted-upon cases below the pre-specified budget (for example, fewer than 5% of selected predictions are incorrect). Second, for cases not selected for an immediate call, StratCP returns a prediction set of plausible labels, such as normal, mild, with a guarantee that the true disease status lies in the set for 95% of deferred patients (see Methods for details and theoretical guarantees).

Step 1: Select confident predictions under an error budget: StratCP turns the model output into candidate decisions, such as assigning a specific disease status. For each candidate diagnosis, StratCP uses expert-labeled reference data to compute a confidence measure for each new patient and sets a decision threshold by testing against the reference data. Cases whose confidence exceeds this threshold are marked as confident. These confident calls can then be used directly for decision making under the chosen error budget.

Step 2: Calibrated prediction sets for deferred cases: The second step of StratCP returns prediction sets for patients whose FM-predicted diagnoses are not confident enough for an immediate call. After Step 1, deferred cases differ from the full reference set because they are enriched for ambiguity. StratCP accounts for this shift by calibrating uncertainty using only expert-labeled cases with similar ambiguity, following recent work on post-selection conformal inference. Given an unconfident case, StratCP finds a subset of expert-diagnosed patients whose FM outputs are similarly ambiguous, using the same confidence criterion as in Step 1. It then forms a prediction set by including each candidate diagnosis for which the corresponding test statistic falls within the expected range when compared to these similarly ambiguous expert-labeled cases.

Optional utility enhancement using diagnostic-guideline knowledge: When diagnostic-guideline knowledge about relationships among labels is available, StratCP includes an optional module that shapes prediction sets to better match clinical reasoning for deferred cases. StratCP changes the order in which candidate labels enter the prediction set by combining model scores with a CP utility graph derived from diagnostic guidelines that encodes which label combinations are preferred. Starting from the most probable model label, it iteratively adds the label that increases set utility the most, until the set reaches the size required for valid coverage.

Safe Use of Medical Foundation Models With StratCP

We evaluate StratCP in ophthalmology and oncology, pairing it with a vision foundation model on retinal images and a pathology foundation model on H&E whole-slide images for diagnosis, biomarker prediction, and prognosis. StratCP controls error among acted-upon (selected) cases and returns valid coverage for deferred cases, so the outputs map to decision points.

In eye-condition diagnosis, both StratCP and CP meet 95% coverage, but StratCP yields more actionable calls on average while keeping the selected-set error rate near the target.

In IDH mutation status prediction, StratCP keeps the acted-upon error rate within the 5% budget for IDH-mutant calls (FDR 0.046), whereas a standard conformal prediction approach overspends the budget on its acted-upon subset (FDR 0.110).

In time-to-event prognosis, StratCP reaches the nominal 95% coverage among selected long-survivors (0.952), while a threshold baseline falls short (0.797).

We also show that utility graphs derived from expert diagnostic guidelines can shape prediction sets without sacrificing conformal coverage. This produces differential diagnosis sets that respect clinical adjacency, such as neighboring diabetic retinopathy stages or related CNS tumor categories, and better match the follow-up actions a clinician would take.

Finally, in a neuro-oncology application, StratCP provides error-controlled diagnoses for adult-type diffuse glioma. By finalizing a subset of cases without reflex molecular testing within the error budget, it can reduce confirmatory assays, lowering laboratory cost and shortening time to diagnosis.

Publication

Error Controlled Decisions for Safe Use of Medical Foundation Models
Ying Jin*, Intae Moon*, Marinka Zitnik
In Review 2026 [arXiv]

@article{jin26error,
  title={Error Controlled Decisions for Safe Use of Medical Foundation Models},
  author={Jin, Ying and Moon, Intae and Zitnik, Marinka},
  journal={In Review},
  url={},
  year={2026}
}

Code and Data Availability

Pytorch implementation of StratCP is available in the GitHub repository.

Authors

Latest News

Dec 2025:   AI + Validation in Molecular, Organoid, and Clinical Systems

Dec 2025:   Digital Twinning

A piece in Harvard Gazette on digital twins, cellular chatbots, and building digital twins at a cellular scale.

Nov 2025:   Protein Structure Tokenization

Nov 2025:   Generative AI Model for Spatial Biology

Nov 2025:   AI Cell Models

A piece in Science explores how AI cell models could transform biomedicine (if they work as promised) and highlights ToolUniverse. ToolUniverse lets AI co-scientists test, analyze, and build on AI cell models.

Oct 2025:   Is AI sycophancy holding science back?

A piece in Nature explores how AI sycophancy, in which models agree too much with users instead of reasoning on its own, could affect the use of AI in medical research.

Oct 2025:   Our research featured by Kempner and Crimson

A news story about PDGrapher in Harvard Crimson. ToolUniverse featured on the Kempner Institute blog.

Oct 2025:   A Scientist's Guide to AI Agents in Nature

A piece on AI agents in Nature highlights ongoing projects in our group, including methods for evaluating scientific hypotheses, challenges in benchmarking AI agents, and the open ToolUniverse ecosystem.

Sep 2025:   ToolUniverse: AI Agents for Science and Medicine

New paper: ToolUniverse introduces an open ecosystem for building AI scientists with 600+ scientific and biomedical tools. Build your AI co-scientists at https://aiscientist.tools.

Sep 2025:   Democratizing "AI Scientists" with ToolUniverse

Our new initiative: Use Tool Universe to build an AI scientist for yourself from any language or reasoning model, whether open or closed. https://aiscientist.tools

Sep 2025:   InfEHR in Nature Communications

Collaboration with Ben and Girish on clinical phenotype resolution through deep geometric learning on electronic health records published in Nature Communications.

Sep 2025:   PDGrapher in Nature Biomedical Engineering

New paper in Nature Biomedical Engineering introducing PDGrapher, a model for phenotype-based target discovery. [Harvard Medicine News]

Sep 2025:   AI and Net Medicine: Path to Precision Medicine

Aug 2025:   CUREBench - Reasoning for Therapeutics

Update from CUREBench: 650+ entrants, 100+ teams and 500+ submissions. Thank you to the CUREBench community. Working on AI for drug discovery and reasoning in medicine? New teams welcome. Tasks, rules, and leaderboard: https://curebench.ai.

Aug 2025:   Drug Discovery Workshop at NeurIPS 2025

Excited to organize a NeurIPS workshop on Virtual Cells and Digital Instruments. Submit your papers.

Aug 2025:   AI for Science Workshop at NeurIPS

Excited to organize a NeurIPS workshop on AI for Science. This is our 6th workshop in the AI for Science series. Submit your papers.

Jul 2025:   Launching CUREBench

Launched CUREBench, the first competition in AI reasoning for therapeutics. Colocated with NeurIPS 2025. Start at https://curebench.ai.

Jul 2025:   Launching TxAgent Evaluation Portal

Launched TxAgent evaluation portal, our global evaluation of AI for drug decision-making and therapeutic reasoning. Participate in TxAgent evaluations! [TxAgent project]

Jul 2025:   SPATIA Model of Spatial Cell Phenotypes

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics