NIST's CAISI DeepSeek V4 Pro evaluation is more revealing than the usual lab leaderboard fight because it makes capability measurement look like statecraft.
The chart says DeepSeek's newest open-weight model is the strongest PRC system CAISI has tested, but still roughly eight months behind the U.S. frontier on its aggregate measure. The sharper detail is methodological. CAISI uses non-public benchmarks, controlled token budgets, agent scaffolding, and an item-response model that treats benchmarks less like press-release trophies and more like an intelligence product.
Regulatory rhetoric usually drifts toward sermonizing because it has no instrument. CAISI is interesting for the opposite reason. Once a state can compare frontier systems on hidden tasks, with controlled scaffolds and pre-release access, the argument moves from vibes about danger into metrology, and metrology has teeth.
Labs can negotiate speeches. They cannot indefinitely negotiate a measurement layer that catches the launch before marketing does.