One patient on Camila (progestin-only). Prior notes say she was on Junel Fe (estrogen). 120 LLM runs across 3 models. Same question, every time. Watch them disagree.
Experiment companion to clinicians.dev ยท All responses uncensored, uncurated
glm-5.2 is rock-solid โ 39/40 runs correctly identified Camila as progestin-only and ruled PERC negative. The one non-perfect run still got the right answer, just didn't explicitly say "no estrogen."
gemma4 never said PERC positive, but ~60% of runs didn't explicitly address the estrogen/progestin distinction. It gets the right answer but can't always explain why.
minimax is genuinely inconsistent. At temp=0.7, 9/20 runs were confused by the Junel Fe history, and 5/20 said PERC negative but reasoned that estrogen was present โ a dangerous "right answer, wrong reason" pattern. Same prompt, same seed, same patient โ different clinical reasoning every time.