The Effect Vanishes Before It Reaches the Patient

The trial

39,849

Patient visits

Clinics randomized

−16%

Diagnostic errors (reviewed)

≈0

Change in 14-day outcome

AI Consult ran GPT-4o quietly in the background, flagging issues at key moments in the visit. On the measures closest to the model — what a physician-reviewer later scored on the chart — the tool clearly helped. Then follow the same tool one step further, toward the thing the patient actually experiences.

The impact gradient

Same tool, three distances from the model

Bars show the measured effect at each stage. The two left bars are relative reductions in clinician error, scored on chart review. The right bar is the outcome that reached the patient's body: 14-day treatment failure, 2.2% with AI vs 2.0% without — a difference the trial could not distinguish from zero.

The mechanism

Why the signal drains out

Between the model's correct flag and the patient sits a person who has to act on it. In this trial, clinicians fully followed only 19.5% of the appropriate red alerts. An alert nobody acts on is an override by omission — the same failure the newsletter describes at the payer's appeal desk, relocated to the exam room. Four out of five correct, high-priority warnings changed nothing the patient would feel.

The 80/20 that survives this trial: stop reporting "accuracy." Report action rate — the fraction of the model's correct, high-priority flags that actually changed a decision. It's the only number that separates a tool that helps a patient from a tool that improves a note. Build it into the eval on day one.

The critical lens

Hold the generalization loosely. This trial ran in Nairobi primary care, not a US health system, and 14-day treatment failure was already low (~2%) in both arms — a floor that leaves little room to detect improvement, especially for conditions that resolve on their own. "No significant difference" is not "no effect"; it's "the study, at this size and in this setting, couldn't find one at the patient level." That caveat cuts both ways: it's exactly why a documentation win or a benchmark score is not evidence your tool changes outcomes. The burden is to measure the last step, not assume it.

The gain was real where the model writes — the note, the differential a reviewer grades. It faded at every step toward the patient because each step needs a human to act, and the humans acted on one flag in five. "Human in the loop" isn't a safety feature if the human is the resistor the signal dies in.

Read the Nature Medicine trial → Or read today's newsletter — "The Override Problem" →