clinicians.build · June 30, 2026

The Readiness Gap Simulator

Five frontier models all ace the medical benchmark. Then add real-world noise — a rephrase, an irrelevant detail, a shuffled answer order — and watch the leaderboard fall apart.

Companion to today's newsletter · Pattern after Nature Medicine robustness work

Apply a stress test
Each toggle perturbs the question without changing the medicine. Watch the red marker — the model's stressed score — drop away from its pristine benchmark score.
Accuracy on USMLE-style questions
benchmark   under stress
Benchmark leader
Most robust under stress
Toggle a stress test above to begin. Right now every model is sitting on its benchmark score — the number that lands it in the press release.

The discovery: the model that tops the benchmark is rarely the one left standing under stress. Rankings reshuffle, and the gap is widest for the models that were "best" on paper — including the med-tuned specialist that overfit the test.

Read today's newsletter →