The most rigorous test yet of embedded LLM decision support moved the needle where a reviewer scores the chart — and lost it, step by step, on the way to the bedside. Watch the gain decay as it travels toward the patient.
Primary source: Nature Medicine — Penda Health "AI Consult" cluster-randomized trial
GPT-4o decision support · ~39,849 visits · 15 primary-care clinics · Nairobi, Kenya
AI Consult ran GPT-4o quietly in the background, flagging issues at key moments in the visit. On the measures closest to the model — what a physician-reviewer later scored on the chart — the tool clearly helped. Then follow the same tool one step further, toward the thing the patient actually experiences.
Bars show the measured effect at each stage. The two left bars are relative reductions in clinician error, scored on chart review. The right bar is the outcome that reached the patient's body: 14-day treatment failure, 2.2% with AI vs 2.0% without — a difference the trial could not distinguish from zero.
Hold the generalization loosely. This trial ran in Nairobi primary care, not a US health system, and 14-day treatment failure was already low (~2%) in both arms — a floor that leaves little room to detect improvement, especially for conditions that resolve on their own. "No significant difference" is not "no effect"; it's "the study, at this size and in this setting, couldn't find one at the patient level." That caveat cuts both ways: it's exactly why a documentation win or a benchmark score is not evidence your tool changes outcomes. The burden is to measure the last step, not assume it.