GPT is impressive with a consistent 0% false positive rate across models, yet it...

stared · 2026-02-25T21:44:15 1772055855

Rerun it for "high" and "xhigh" effort settings, and GPT-5.2-Codex still get 0% false positive, while getting at the level of other best models for localization of backdoors: https://quesma.com/benchmarks/binaryaudit/

sdenton4 · 2026-02-22T17:09:11 1771780151

It would be really cool if someone developed some standard language and methodology for measuring the success of binary classificaiton tasks...

Oh, wait, we have had that for a hundred years - somehow it's just entirely forgotten when generative models are involved.