clinicians.dev · July 2, 2026

The Citation Effect

Same answer, different score. When physicians could see citations, OpenEvidence jumped from 71% to 83%. Every other model got worse. Explore the finding that upends how we evaluate clinical AI.

Data: jjfenglab/Real-POCQi · Paper: arXiv 2606.28960

Two ways to read the same answer

149 physicians graded head-to-head comparisons of four AI models answering 620 real clinical questions. Each pair was scored in two render modes:

Text only — the answer, nothing else. 3,945 ratings.
With citations — the answer plus source citations. 1,835 ratings.

Same question. Same answer text. The only difference: whether the physician could see where the model got its information. What happened next was not what you'd expect.

Win rates with and without citations

How much did citations matter?

+12.0pp
OpenEvidence win rate with citations

OpenEvidence was the only model that benefited from showing citations. Its win rate jumped from 70.7% to 82.6% — a 12 percentage-point boost. The specialized clinical tool had real sources to show, and physicians trusted them.

-9.1pp
Gemini 3.1 Pro win rate with citations

Gemini dropped from 48.1% to 35.3%. Claude Opus 4.8 dropped from 40.0% to 30.9%. GPT-5.5 barely moved (32.2% → 34.1%). When general-purpose models showed citations, physicians liked them less — either the sources were weaker, or seeing them made the answers easier to question.

Evaluation depends on presentation

If you evaluate clinical AI by answers alone, you're measuring one thing. If you evaluate by answers with citations, you're measuring something else entirely — and the ranking can shift.

The gap isn't about answer quality. It's about trust architecture. OpenEvidence was built to cite and verify. The general models were not. When you make citations visible, the tool that was engineered for source quality gets rewarded — and the tools that weren't get penalized.

An AI that gives the right answer but can't show its work may score fine on a blind test — and lose the moment a clinician can see the receipts.

Win rate by specialty (text + citations combined)