620 real physician questions, 30 specialties, graded head-to-head by 149 doctors. A condensed look at who won where — and an honest look at what a benchmark of short, pre-digested questions can and can't tell you.
Primary source: Real-POCQi paper (arXiv 2606.28960) · dataset · CC-BY-4.0
Each dot is a specialty: horizontal = how point-of-care its questions are (management/dosing/diagnosis vs. knowledge recall), vertical = OpenEvidence's win rate, size = number of questions. It looks like a downward slope — the more bedside the question, the less OE dominates. But drag the sample-size floor and watch that slope evaporate.
OpenEvidence wins nearly three of four head-to-heads — a real result, blind-graded by specialty-matched physicians. The question this interactive asks is not whether it wins, but what the contest actually measured.
Every Real-POCQi question is a clean paragraph — a tidy vignette with the salient labs already pulled and the noise already stripped. Real point-of-care doesn't arrive that way. To see the gap, we ran a realistic case against OpenEvidence itself (the benchmark's winner), logged in, on July 2.
A 62-year-old kidney-transplant patient rolls into the ED febrile. The chart has 1,200+ notes and 1,000+ prior labs: an old ESBL E. coli isolate buried in a culture from 8 months ago, a tacrolimus trend across dozens of draws, a shifting creatinine baseline. Nothing is labeled. The clock is running.
"62M, transplant on tacro/MMF/pred, 2 days fever, dysuria, Cr 1.4→2.4, tacro trough 11.8, prior ESBL E. coli, WBC 14.2, lactate 1.9, UA +LE/+nitrites. ED workup, empiric antibiotics, adjust immunosuppression?" — this is what a benchmark question looks like. The synthesis already happened.
It answered well. It called it a "complicated UTI (likely transplant pyelonephritis)," flagged that the ESBL history points to IV carbapenem empiric therapy, ordered urine + blood cultures and a renal allograft ultrasound, and caught the supratherapeutic tacrolimus (11.8) needing dose reduction. Guideline-concordant, fast, well-cited.
Of 620 questions, 370 (60%) are genuine clinical decisions, 236 (38%) are ambiguous (clinical facts phrased as "what is…"), and 14 (2%) are pure literature recall (trial enrollment %, mechanism of action). Even the point-of-care 60% are single-turn text with no attached record — the cleanest possible version of a messy job.