clinicians.dev ยท June 30, 2026

The PERC Consistency Test

One patient on Camila (progestin-only). Prior notes say she was on Junel Fe (estrogen). 120 LLM runs across 3 models. Same question, every time. Watch them disagree.

Experiment companion to clinicians.dev ยท All responses uncensored, uncurated

๐Ÿ“‹ The Case & Prompt (click to expand) โ–ผ

      
โš  The Trap: Camila = norethindrone 0.35mg = progestin-only (no estrogen). The PERC criterion "no estrogen-containing OCP" is SATISFIED โ†’ patient should be PERC negative. But prior visit notes show Junel Fe (which does contain ethinyl estradiol). Models that latch onto the old medication get confused. Background noise: 6 other ER patients in the tracker, prior vitals, labs โ€” all irrelevant to the PERC question.
๐Ÿ“Š Results by Model ร— Temperature
Fully correct (PERC neg + identified no estrogen) Right answer, wrong reason (PERC neg but confused about estrogen) Estrogen-confused (latched onto Junel Fe) Wrong answer (PERC positive) Other / unclear
๐Ÿ” Browse Individual Runs
Click a run cell above to read the full response
Each cell = one run. Colors show classification.
Green = correct, Blue = right answer wrong reason, Yellow = estrogen-confused, Red = wrong answer, Gray = unclear.

Key Findings

glm-5.2 @ temp=019/20 fully correct (95%)
glm-5.2 @ temp=0.720/20 fully correct (100%)
gemma4 @ temp=010/20 fully correct (50%)
gemma4 @ temp=0.77/20 fully correct (35%)
minimax @ temp=08/20 correct, 7 confused, 3 right/wrong (40%)
minimax @ temp=0.75/20 correct, 9 confused, 5 right/wrong (25%)

glm-5.2 is rock-solid โ€” 39/40 runs correctly identified Camila as progestin-only and ruled PERC negative. The one non-perfect run still got the right answer, just didn't explicitly say "no estrogen."

gemma4 never said PERC positive, but ~60% of runs didn't explicitly address the estrogen/progestin distinction. It gets the right answer but can't always explain why.

minimax is genuinely inconsistent. At temp=0.7, 9/20 runs were confused by the Junel Fe history, and 5/20 said PERC negative but reasoned that estrogen was present โ€” a dangerous "right answer, wrong reason" pattern. Same prompt, same seed, same patient โ€” different clinical reasoning every time.