Act or Defer - Error-Controlled Decision Policies for Medical Foundation Models

Clinical deployment of foundation models requires decision policies that operate under explicit error budgets, such as a cap on false-positive clinical calls. Strong average accuracy alone does not guarantee safety: errors can concentrate among patients selected for action, leading to harm and inefficient use of healthcare resources.

Here we introduce StratCP, a stratified conformal framework that turns foundation model predictions into decision-ready outputs through error-controlled selection and calibrated deferral. StratCP first selects a subset of patients for immediate clinical action while controlling the false discovery rate at a user-specified level. For the remaining patients, it returns prediction sets that achieve target coverage conditional on deferral, supporting confirmatory testing or expert review. When clinical guidelines define relationships among disease states, StratCP incorporates a utility graph to produce clinically coherent prediction sets without sacrificing coverage guarantees.

We evaluate StratCP in ophthalmology and neuro-oncology across diagnosis, biomarker prediction, and time-to-event prognosis. Across tasks, StratCP controls the false discovery rate among selected patients and provides valid, selection-conditional coverage for deferred patients. In neuro-oncology, it enables H&E-based diagnosis under a fixed error budget, reducing reliance on reflex molecular assays and lowering laboratory cost and turnaround time. StratCP establishes error-controlled decision policies for safe deployment of medical foundation models.

Overview of StratCP

Medical foundation models (FMs) achieve strong performance across imaging, pathology, and clinical prediction tasks. Several models have progressed to prospective evaluation and health system integration. Clinical deployment, however, requires more than high average accuracy. It requires decision policies that operate under explicit error budgets.

The central question is not only how accurate a model is, but when it is appropriate to act on its prediction. Clinicians must decide when to make a call, when to defer, and what follow-up to pursue. Without explicit error control, errors can concentrate among acted-upon patients. This can trigger unnecessary interventions, delay appropriate care, and waste limited diagnostic resources.

Safe clinical use of foundation models requires two guarantees:

  • Action guarantee: Identify patients for whom a prediction supports immediate action under a pre-specified error budget.
  • Deferral guarantee: Provide calibrated prediction sets for remaining patients, so the true disease status lies within the set at a target frequency.

The first requires error control within the acted-upon subset. The second requires coverage conditional on deferral. Marginal accuracy or marginal conformal coverage alone does not ensure these properties.

To address this gap, we introduce StratCP, a stratified conformal framework that converts foundation model outputs into decision-ready recommendations. StratCP operates as a post-processing layer and does not require retraining the underlying FM.

  1. Action arm: StratCP selects patients for immediate clinical action while controlling the false discovery rate (FDR), the expected fraction of incorrect selected predictions, at a user-specified level (e.g., 5%). Selection can be defined per disease status (e.g., tumor subtype, IDH mutant, favorable early survival).

  2. Deferral arm: For remaining patients, StratCP returns prediction sets that achieve selection-conditional coverage. Among deferred patients, the true disease status lies within the set at the target rate (e.g., 95%).

When diagnostic guidelines define relationships among disease states, StratCP can incorporate this structure through a utility graph. This produces clinically coherent prediction sets without sacrificing coverage guarantees.

StratCP Provides Guarantees for Safe Medical Decisions

StratCP builds on conformal prediction but extends it from marginal guarantees to action-aligned guarantees. Given a user-specified error tolerance (e.g., 5%) and labeled calibration data from the deployment setting, StratCP proceeds in two stages.

Action Arm: Select Confident Predictions Under an Explicit Error Budget

StratCP converts model outputs into candidate clinical actions, such as assigning a specific diagnosis or declaring favorable early survival (≥18 months).

For each candidate action, StratCP:

  • Computes a confidence score using expert-labeled calibration data.
  • Constructs conformal p-values.
  • Applies false discovery rate control to select patients for whom expected prediction error remains within the pre-specified budget.

Predictions that pass this threshold enter the action arm. Clinicians can act on these decisions with an explicit FDR guarantee (e.g., ≤5% incorrect among selected patients).

Deferral Arm: Calibrated Prediction Sets for Deferred Patients

Patients not selected in the action arm enter the deferral arm. These patients were not chosen for immediate action because their model scores did not meet the error-controlled threshold. As a result, the deferred group is not a random subset of the population. It is enriched for harder cases, those with lower top-class probability, smaller separation between leading labels, or survival scores close to the action cutoff.

This creates a distribution shift. Standard conformal prediction calibrates over the entire population, but here we need guarantees that apply specifically within the deferred subset.

StratCP addresses this by recalibrating only on calibration cases that would also have been deferred under the same action rule. Concretely, it reapplies the action-arm selection procedure to the calibration data and forms a reference group consisting of cases whose model scores fall below the selection threshold. It then constructs prediction sets using conformity scores computed within this deferral-matched reference group.

This ensures that coverage guarantees hold conditional on deferral, not just on average across all patients.

For each deferred patient, StratCP:

  • Identifies patients who would also have been deferred under the same action rule. StratCP re-applies the action-arm selection procedure to the calibration data and forms a reference group consisting only of patients whose model scores fall below the action threshold. These patients reflect situations where the model is not confident enough to support immediate action.

  • Computes conformity scores within this reference group. For each candidate label (or survival time), StratCP evaluates how well it aligns with the model’s prediction using a conformity score (e.g., cumulative probability for classification, model-based CDF values for survival). These scores quantify how typical or atypical each outcome is relative to similar deferred patients.

  • Constructs a prediction set with selection-conditional coverage. StratCP calibrates a cutoff using only the deferred reference group and includes all candidate labels whose conformity scores fall within this calibrated range. This guarantees that, among deferred patients, the true disease status lies in the prediction set at the target frequency (e.g., 95%).

This ensures that among deferred patients, the true disease status lies within the prediction set at the target rate. In time-to-event settings, StratCP returns a one-sided lower prediction bound on survival time, with coverage conditional on deferral.

Utility Enhancement Using Diagnostic Guidelines

Standard conformal prediction constructs prediction sets using model scores alone. Labels enter the set in descending order of predicted probability. While this preserves statistical coverage, it can group clinically unrelated disease states, especially when model probabilities are close.

In clinical settings, however, labels are structured. Diabetic retinopathy follows an ordinal severity scale. CNS tumor subtypes are organized by WHO grade. Many eye conditions share downstream management pathways. Prediction sets should reflect this structure so that deferred outputs align with how clinicians reason about follow-up.

StratCP includes an optional utility module that integrates this clinical knowledge without changing the coverage guarantee.

  • Encodes guideline relationships as a utility graph. Disease statuses are treated as nodes in a graph, and edges quantify how desirable it is for two labels to appear together. Edge weights can reflect ordinal adjacency, shared tumor grade, or overlap in recommended management actions derived from guidelines.

  • Reorders candidate labels by combining predicted probability with utility. StratCP starts from the highest-probability label and iteratively adds high-probability alternatives that maximize clinical coherence with labels already selected.

  • Returns the smallest prefix that satisfies coverage. After reordering, StratCP applies standard conformal calibration and returns the minimal set needed to achieve the target coverage level.

This procedure reshapes prediction sets for clinical coherence while preserving finite-sample guarantees.

Safe Use of Medical Foundation Models With StratCP

We evaluate StratCP in ophthalmology and neuro-oncology, pairing it with RETFound on retinal images and UNI on H&E whole-slide images across diagnosis, biomarker prediction, tumor subtyping, and time-to-event prognosis. StratCP controls false discoveries among selected patients in the action arm and provides selection-conditional coverage for deferred patients, so its outputs map to clinical decision points:

  • In multi-class eye-condition diagnosis (39 classes), StratCP and standard conformal prediction both achieve valid coverage, but StratCP supports action on more patients under the same 5% error budget (119.2 selected with FDR 0.048 vs. 97.5 with FDR 0.040).

  • In IDH mutation status prediction, StratCP stays within the 5% budget for IDH-mutant calls (FDR 0.046), while standard conformal prediction exceeds the budget on its selected subset (FDR 0.096).

  • In time-to-event prognosis for adult-type diffuse glioma, StratCP meets the 95% target among selected favorable early survivors (0.954), whereas a Top-1 threshold baseline falls short (0.777). StratCP also selects substantially more patients than marginal conformal selection (36.6 vs. 2.1), improving efficiency under the same error budget.

  • Utility graphs derived from diagnostic guidelines reshape deferred prediction sets without sacrificing coverage, improving clinical coherence for ordinal stages and WHO-grade relationships.

  • In adult-type diffuse glioma triage, StratCP supports H&E-only diagnosis for a subset of patients under a fixed 5% error budget, reducing reliance on reflex molecular testing and lowering turnaround time and assay cost.

Publication

Act or Defer: Error-Controlled Decision Policies for Medical Foundation Models
Ying Jin*, Intae Moon*, Marinka Zitnik
In Review 2026 [medRxiv]

@article{jin26error,
  title={Act or Defer: Error-Controlled Decision Policies for Medical Foundation Models},
  author={Jin, Ying and Moon, Intae and Zitnik, Marinka},
  journal={In Review},
  url={https://www.medrxiv.org/content/10.64898/2026.02.23.26346927},
  year={2026}
}

Code and Data Availability

Pytorch implementation of StratCP is available in the GitHub repository.

Authors

Latest News

Feb 2026:   Overton Prize

Our research has been recognized with the 2026 Overton Prize.

Feb 2026:   Foundation Models that Can 'Act or Defer'

Feb 2026:   Reasoning Model for Longitudinal Data

Feb 2026:   Context Switching AI in Nature Medicine

Jan 2026:   Zoom-Out and Zoom-In Retrieval for LLMs

Much of the world’s knowledge lies outside public web text accessible to LLMs, including internal ontologies, curated catalogs, drug safety tables, patient health data, and lab knowledge bases. ARK helps an LLM to choose, one step at a time, whether to look broadly for relevant information or to dig deeper by following specific links in the data.

Jan 2026:   AI Scientist for Therapeutic Discovery

Jan 2026:   AI Scientists - LLMs Using Scientific Tools

Excited about this academic collaboration with Anthropic on adding connectors to ToolUniverse to make Claude even more powerful for scientific discovery.

Dec 2025:   AI + Validation in Molecular, Organoid, and Clinical Systems

Dec 2025:   Digital Twinning

A piece in Harvard Gazette on digital twins, cellular chatbots, and building digital twins at a cellular scale.

Dec 2025:   Virtual Cells and Instruments

We are excited to meet hundreds of researchers attending our AI Virtual Cells and Instruments: A New Era in Drug Discovery and Development workshop at NeurIPS 2025.

Dec 2025:   CUREBench

Excited to see 1,622 researchers from around the world entering our CUREBench Challenge with 398 participating teams that made 3,383 submissions to the competition and submitted 8,457,500+ AI reasoning traces for therapeutics. Join us at the Award Ceremony at NeurIPS.

Dec 2025:   AI For Science at NeurIPS

Join us and hundreds of other scientists at the 6th AI for Science workshop at NeurIPS.

Nov 2025:   Protein Structure Tokenization

Nov 2025:   Generative AI Model for Spatial Biology

Nov 2025:   AI Cell Models

A piece in Science explores how AI cell models could transform biomedicine (if they work as promised) and highlights ToolUniverse. ToolUniverse lets AI co-scientists test, analyze, and build on AI cell models.

Oct 2025:   Is AI sycophancy holding science back?

A piece in Nature explores how AI sycophancy, in which models agree too much with users instead of reasoning on its own, could affect the use of AI in medical research.

Oct 2025:   Our research featured by Kempner and Crimson

A news story about PDGrapher in Harvard Crimson. ToolUniverse featured on the Kempner Institute blog.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics