Same answer, different score. When physicians could see citations, OpenEvidence jumped from 71% to 83%. Every other model got worse. Explore the finding that upends how we evaluate clinical AI.
Data: jjfenglab/Real-POCQi · Paper: arXiv 2606.28960
149 physicians graded head-to-head comparisons of four AI models answering 620 real clinical questions. Each pair was scored in two render modes:
Same question. Same answer text. The only difference: whether the physician could see where the model got its information. What happened next was not what you'd expect.
OpenEvidence was the only model that benefited from showing citations. Its win rate jumped from 70.7% to 82.6% — a 12 percentage-point boost. The specialized clinical tool had real sources to show, and physicians trusted them.
Gemini dropped from 48.1% to 35.3%. Claude Opus 4.8 dropped from 40.0% to 30.9%. GPT-5.5 barely moved (32.2% → 34.1%). When general-purpose models showed citations, physicians liked them less — either the sources were weaker, or seeing them made the answers easier to question.
If you evaluate clinical AI by answers alone, you're measuring one thing. If you evaluate by answers with citations, you're measuring something else entirely — and the ranking can shift.
The gap isn't about answer quality. It's about trust architecture. OpenEvidence was built to cite and verify. The general models were not. When you make citations visible, the tool that was engineered for source quality gets rewarded — and the tools that weren't get penalized.