The Real-POCQi Explorer — clinicians.build

The benchmark

620

Real questions

Specialties

149

MD graders

5,780

Head-to-head ratings

74.9%

OpenEvidence win

One dot per specialty

Does OpenEvidence win more where the questions are more bedside?

Each dot is a specialty: horizontal = how point-of-care its questions are (management/dosing/diagnosis vs. knowledge recall), vertical = OpenEvidence's win rate, size = number of questions. It looks like a downward slope — the more bedside the question, the less OE dominates. But drag the sample-size floor and watch that slope evaporate.

Minimum questions per specialtyshowing specialties with ≥ 1 questions

<15 questions (noisy) ≥15 questions

–

correlation (shown)

–

specialties shown

–

questions covered

The matchup

Overall win rate, all 5,780 ratings

OpenEvidence wins nearly three of four head-to-heads — a real result, blind-graded by specialty-matched physicians. The question this interactive asks is not whether it wins, but what the contest actually measured.

The catch

Point-of-care is two halves. The benchmark tests one.

Every Real-POCQi question is a clean paragraph — a tidy vignette with the salient labs already pulled and the noise already stripped. Real point-of-care doesn't arrive that way. To see the gap, we ran a realistic case against OpenEvidence itself (the benchmark's winner), logged in, on July 2.

The reality · what's actually in front of you

A 62-year-old kidney-transplant patient rolls into the ED febrile. The chart has 1,200+ notes and 1,000+ prior labs: an old ESBL E. coli isolate buried in a culture from 8 months ago, a tacrolimus trend across dozens of draws, a shifting creatinine baseline. Nothing is labeled. The clock is running.

↓ a human reads the chart and distills it ↓

The question · after a clinician does the hard part

"62M, transplant on tacro/MMF/pred, 2 days fever, dysuria, Cr 1.4→2.4, tacro trough 11.8, prior ESBL E. coli, WBC 14.2, lactate 1.9, UA +LE/+nitrites. ED workup, empiric antibiotics, adjust immunosuppression?" — this is what a benchmark question looks like. The synthesis already happened.

↓ now the tool answers ↓

OpenEvidence · live, July 2, 2026

It answered well. It called it a "complicated UTI (likely transplant pyelonephritis)," flagged that the ESBL history points to IV carbapenem empiric therapy, ordered urine + blood cultures and a renal allograft ultrasound, and caught the supratherapeutic tacrolimus (11.8) needing dose reduction. Guideline-concordant, fast, well-cited.

OpenEvidence earned its win — on the half it was handed. But the risky, time-consuming half of point-of-care is building that paragraph out of 1,200 documents: knowing the 8-month-old ESBL matters, that 1.4 is the real baseline, that 11.8 is high for this patient. Real-POCQi never tests that half — and neither does the tool. "Beats physicians 75% of the time" is a claim about answering a question a clinician already did the hard work to ask.

How bedside is it, really?

60% point-of-care — but read the fine print

Of 620 questions, 370 (60%) are genuine clinical decisions, 236 (38%) are ambiguous (clinical facts phrased as "what is…"), and 14 (2%) are pure literature recall (trial enrollment %, mechanism of action). Even the point-of-care 60% are single-turn text with no attached record — the cleanest possible version of a messy job.