clinicians.build · July 2, 2026

The Real-POCQi Explorer

620 real physician questions, 30 specialties, graded head-to-head by 149 doctors. A condensed look at who won where — and an honest look at what a benchmark of short, pre-digested questions can and can't tell you.

Primary source: Real-POCQi paper (arXiv 2606.28960) · dataset · CC-BY-4.0

620
Real questions
30
Specialties
149
MD graders
5,780
Head-to-head ratings
74.9%
OpenEvidence win

Does OpenEvidence win more where the questions are more bedside?

Each dot is a specialty: horizontal = how point-of-care its questions are (management/dosing/diagnosis vs. knowledge recall), vertical = OpenEvidence's win rate, size = number of questions. It looks like a downward slope — the more bedside the question, the less OE dominates. But drag the sample-size floor and watch that slope evaporate.

Minimum questions per specialtyshowing specialties with ≥ 1 questions
<15 questions (noisy)   ≥15 questions
correlation (shown)
specialties shown
questions covered

Overall win rate, all 5,780 ratings

OpenEvidence wins nearly three of four head-to-heads — a real result, blind-graded by specialty-matched physicians. The question this interactive asks is not whether it wins, but what the contest actually measured.

Point-of-care is two halves. The benchmark tests one.

Every Real-POCQi question is a clean paragraph — a tidy vignette with the salient labs already pulled and the noise already stripped. Real point-of-care doesn't arrive that way. To see the gap, we ran a realistic case against OpenEvidence itself (the benchmark's winner), logged in, on July 2.

The reality · what's actually in front of you

A 62-year-old kidney-transplant patient rolls into the ED febrile. The chart has 1,200+ notes and 1,000+ prior labs: an old ESBL E. coli isolate buried in a culture from 8 months ago, a tacrolimus trend across dozens of draws, a shifting creatinine baseline. Nothing is labeled. The clock is running.

↓ a human reads the chart and distills it ↓
The question · after a clinician does the hard part

"62M, transplant on tacro/MMF/pred, 2 days fever, dysuria, Cr 1.4→2.4, tacro trough 11.8, prior ESBL E. coli, WBC 14.2, lactate 1.9, UA +LE/+nitrites. ED workup, empiric antibiotics, adjust immunosuppression?" — this is what a benchmark question looks like. The synthesis already happened.

↓ now the tool answers ↓
OpenEvidence · live, July 2, 2026

It answered well. It called it a "complicated UTI (likely transplant pyelonephritis)," flagged that the ESBL history points to IV carbapenem empiric therapy, ordered urine + blood cultures and a renal allograft ultrasound, and caught the supratherapeutic tacrolimus (11.8) needing dose reduction. Guideline-concordant, fast, well-cited.

OpenEvidence earned its win — on the half it was handed. But the risky, time-consuming half of point-of-care is building that paragraph out of 1,200 documents: knowing the 8-month-old ESBL matters, that 1.4 is the real baseline, that 11.8 is high for this patient. Real-POCQi never tests that half — and neither does the tool. "Beats physicians 75% of the time" is a claim about answering a question a clinician already did the hard work to ask.

60% point-of-care — but read the fine print

Of 620 questions, 370 (60%) are genuine clinical decisions, 236 (38%) are ambiguous (clinical facts phrased as "what is…"), and 14 (2%) are pure literature recall (trial enrollment %, mechanism of action). Even the point-of-care 60% are single-turn text with no attached record — the cleanest possible version of a messy job.

Read the source paper →