Five frontier models all ace the medical benchmark. Then add real-world noise — a rephrase, an irrelevant detail, a shuffled answer order — and watch the leaderboard fall apart.
Companion to today's newsletter · Pattern after Nature Medicine robustness work
The discovery: the model that tops the benchmark is rarely the one left standing under stress. Rankings reshuffle, and the gap is widest for the models that were "best" on paper — including the med-tuned specialist that overfit the test.
Read today's newsletter →