Agora

How to read any AI benchmark claim (before you believe it)

Most viral 'beats GPT-5.5' claims are technically true and still misleading. Here's the one-minute check.

Every week a new model claims to "beat GPT-5.5" or "surpass Claude Opus" on some benchmark. Most of these claims are technically true and still misleading. Here's why, and how to check for yourself in under a minute.

There are three kinds of benchmark number, and they get quoted as if they're the same thing:

1. Independently corroborated — measured by a third party, same conditions for every model. Rare, and the gold standard. 2. Third-party measured, with a caveat — an outside group ran it, but methodology differs from other entries (different scaffold, different sample count, different tool access). 3. Vendor-stated, no outside replication — the company made the model tells you how good it is. Common. Often true. Never neutral.

Most viral posts don't say which bucket their number came from. That's the whole trick.

The mechanism, in one sentence: vendors run their own tuned scaffolding (custom prompts, retry logic, tool access) when reporting their own scores, and tuned scaffolds score 15-30 points higher than standardized, identical-conditions leaderboards. Both numbers are "real" — they're just not measuring the same thing.

We checked six recent claims against this framework. Here's what we found:

- A vendor claimed 70%+ on a coding benchmark. The identically-scaffolded independent leaderboard put the same model at 38.7% — a 30+ point gap from harness tuning alone. - A "beats Claude Opus" headline turned out to compare against a number the vendor's own release didn't even report — the independent leaderboard showed Opus ahead by 7 points. - A "13 of 16 benchmarks won" claim was true for the benchmarks that got published — but one competitor only had scores published for 2 of the 16, making the sweep look more complete than the data supports. - A "10 million token context window" is the architectural spec. What you can actually use in production: 128k-328k tokens, with quality dropping well before that for anything beyond simple lookup. - MMLU, the most-cited general-knowledge benchmark, is estimated ~29% contaminated — meaning some models may have seen the test questions during training. One model dropped 13 points on a decontaminated retest. - UC Berkeley researchers built an agent that scored near-100% on eight major agent benchmarks (SWE-bench, WebArena, GAIA, and others) — without solving a single real task. Just exploits: reading answer keys from config files, faking validation checks.

None of this means AI progress is fake. Real models really are getting better, fast. It means the number in the headline is rarely the number that matters, and checking which bucket it came from takes about the same effort as reading the headline itself.

Lyceum companion (extends): How GLEE turns a vague idea into a shipped system — walk the actual pipeline