Google Gemini 3.1 Pro posts record benchmark scores, evaluation suites invite contamination and gaming, leaderboard science remains optional

Images

Image Credits:Jagmeet Singh / TechCrunch techcrunch.com

In this photo illustration, the logo of 'OpenAI' is displayed on a mobile phone screen in front of a computer screen displaying the photograph of Elon Musk. techcrunch.com

India flag with 'AI' displayed on smartphone screen in foreground techcrunch.com

OpenAI India techcrunch.com

Sarvam Indus chat app techcrunch.com

Google’s latest Gemini Pro release is being marketed the way modern AI is marketed: as a string of benchmark trophies, carefully photographed from flattering angles. TechCrunch reports Google has released Gemini 3.1 Pro in preview and touted “record” scores on independent evaluations, including a suite called Humanity’s Last Exam. Mercor CEO Brendan Foody also claimed in a social media post that Gemini 3.1 Pro tops the APEX-Agents leaderboard, a benchmark meant to approximate performance on professional “knowledge work.”

None of this is necessarily false; it’s just incomplete in the way leaderboard culture demands. Benchmarks are not laws of physics. They are software artifacts—prompts, rubrics, datasets, graders—whose incentives are visible to the model builders and whose weaknesses are routinely exploited, sometimes accidentally, often not.

Three technical issues dominate the gap between “record score” and “robust scientific signal.” First is data contamination: if evaluation items (or close paraphrases) leak into training corpora, a model can appear to “reason” while merely recalling. Second is prompt leakage and eval-gaming: when benchmark formats become standardized, models and post-training pipelines can overfit to the scoring procedure rather than the underlying task. Third is shifting baselines: new model versions are frequently compared to predecessors under subtly different inference settings, tool access, or system prompts, turning an “apples to apples” claim into “apples to a fruit salad.”

TechCrunch notes that onlookers see Gemini 3.1 Pro as a step up from Gemini 3, and that the model wars are accelerating as OpenAI and Anthropic ship competing systems. That competitive tempo is precisely what makes benchmark rigor more important, not less: when releases are monthly, “independent” evaluations can become an informal co-development loop between labs and benchmark authors.

A stricter interpretation framework is available. Pre-registered evaluations would lock metrics and scoring before model release. Held-out private test sets—audited, access-controlled, and periodically refreshed—would reduce contamination and prompt-tuning. And reproducibility would require disclosure of evaluation conditions (system prompts, tool use, sampling parameters) plus meaningful transparency about training data filtering and provenance.

Companies will object that disclosure invites copying. True. It also invites verification, which is the point. Until then, “record benchmarks—again” is mostly a statement about who is best at the current game, not who built the most generally capable system. Science is what remains when the leaderboard is gone.