clinicians.build · July 3, 2026

The Effect Vanishes Before It Reaches the Patient

The most rigorous test yet of embedded LLM decision support moved the needle where a reviewer scores the chart — and lost it, step by step, on the way to the bedside. Watch the gain decay as it travels toward the patient.

Primary source: Nature Medicine — Penda Health "AI Consult" cluster-randomized trial
GPT-4o decision support · ~39,849 visits · 15 primary-care clinics · Nairobi, Kenya

39,849
Patient visits
15
Clinics randomized
−16%
Diagnostic errors (reviewed)
≈0
Change in 14-day outcome

AI Consult ran GPT-4o quietly in the background, flagging issues at key moments in the visit. On the measures closest to the model — what a physician-reviewer later scored on the chart — the tool clearly helped. Then follow the same tool one step further, toward the thing the patient actually experiences.

Same tool, three distances from the model

Bars show the measured effect at each stage. The two left bars are relative reductions in clinician error, scored on chart review. The right bar is the outcome that reached the patient's body: 14-day treatment failure, 2.2% with AI vs 2.0% without — a difference the trial could not distinguish from zero.

Why the signal drains out

Between the model's correct flag and the patient sits a person who has to act on it. In this trial, clinicians fully followed only 19.5% of the appropriate red alerts. An alert nobody acts on is an override by omission — the same failure the newsletter describes at the payer's appeal desk, relocated to the exam room. Four out of five correct, high-priority warnings changed nothing the patient would feel.
The 80/20 that survives this trial: stop reporting "accuracy." Report action rate — the fraction of the model's correct, high-priority flags that actually changed a decision. It's the only number that separates a tool that helps a patient from a tool that improves a note. Build it into the eval on day one.

Hold the generalization loosely. This trial ran in Nairobi primary care, not a US health system, and 14-day treatment failure was already low (~2%) in both arms — a floor that leaves little room to detect improvement, especially for conditions that resolve on their own. "No significant difference" is not "no effect"; it's "the study, at this size and in this setting, couldn't find one at the patient level." That caveat cuts both ways: it's exactly why a documentation win or a benchmark score is not evidence your tool changes outcomes. The burden is to measure the last step, not assume it.

The gain was real where the model writes — the note, the differential a reviewer grades. It faded at every step toward the patient because each step needs a human to act, and the humans acted on one flag in five. "Human in the loop" isn't a safety feature if the human is the resistor the signal dies in.