When AI Wins Our Tournaments, Merit Becomes a Protocol

Breaking the scoreboards

On September 17, 2025, Google DeepMind disclosed that Gemini 2.5 Deep Think achieved gold-level performance at the International Collegiate Programming Contest World Finals. It solved 10 of 12 problems under the same five-hour rules as the students, and it even cracked one problem no human solved. If there was a single scoreboard moment that signals a regime change, this is it. The first high-stakes claim deserves a primary source, so read DeepMind’s own note on the ICPC gold-level performance.

This was not a bolt from the blue. Two months earlier, an advanced version of the same lineage hit gold-medal scores on International Mathematical Olympiad problems. Coverage summarized how both Google and OpenAI reached that threshold, while noting differences in process and validation. For external context, see Reuters on the IMO gold milestone.

These two events do more than put another trophy on the shelf. They end a culture that treated benchmark wins as the main currency of progress. Once a capable system can already meet the bar for elite contests, the winning edge moves elsewhere. The center of value shifts from solving to specifying, and from isolated speed to orchestrating a system under rules that everyone can trust.

The end of benchmark culture

Benchmarks rose because they were legible. You could tell who solved more problems, faster and with fewer errors. When systems cross human gold-level lines, the scoreboard is no longer the bottleneck. The questions change:

Who wrote the problem, and what constraints made it meaningful?
What rules kept the playing field fair for mixed human and AI teams?
Which disclosures let others reproduce or audit a performance?

Think of a marathon where jetpacks appear. If the course and rules were designed for shoes, we either ban jetpacks or we rewrite the race. Banning leaves ability on the table. Rewriting, carefully, lets us measure what now matters. The goal is not to preserve nostalgia, it is to preserve meaning.

From skill to spec: where value moves

In competitions, value used to accrue to the quickest correct solution. With systems that can mass generate candidate solutions, the scarce asset becomes the specification that channels that capability. Three skills climb the ladder:

Problem finding: discovering questions that are worth answering and resilient to shortcutting.
Systems taste: deciding which tools to combine, and how to stage a plan so that errors get exposed early.
Orchestration: turning a messy workflow into a protocol that a team or an agent can execute under constraints.

Solving remains essential, but its returns compress. Specifying and orchestrating become the leverage points. This complements an earlier theme we explored in model pluralism wins the platform war, where value accrues to the teams that choose and chain tools with judgment.

Protocolized merit

If we want to keep competitions, credentials, and hiring legible in a mixed era, we should treat merit as a protocol. That means the way we decide who wins is spelled out, enforced, and inspectable. Concretely, organizers should introduce three divisions, each with explicit rules:

Human-only: no model access, no code autocompletion, no retrieval outside provided materials. Devices are locked down. This preserves a control group that shows what individuals can do unaided.
AI-assisted: participants may use registered tools within declared limits. Think of this as the everyday working mode for students and professionals. We score the quality of the orchestration, not only the final answer.
AI-open: anything goes within safety and legality. This division rewards world-class system building and auditable engineering, and it will likely produce the frontier records.

Each division has value. Human-only keeps our baselines and helps educators. AI-assisted reflects normal practice and forces us to measure the skill of directing machines. AI-open shows what is possible when we stop pretending the machines are not in the room.

This structure connects to the policy layer that actually governs capability in practice. If you want the rules to be more than vibes, design them like software. We argued in policy stack is AI’s real power layer that institutions win or lose on the clarity of their rulebooks as much as on their models.

Declared thinking budgets

The next pillar is the thinking budget: an upfront declaration of resources you plan to spend in search of a solution. Today this sounds like a technical detail. It is not. It is the backbone of comparability.

Budgets can be specified in several units:

Tokens: the number of input and output tokens, including hidden chain-of-thought tokens if used.
Tool calls: how many times the system can call a solver, a code interpreter, or a search API.
Wall-clock time: the real time you will let the system think or iterate.
Energy or compute: an estimate of joules consumed or accelerator time, with coarse bins where exact metering is hard.
Memory footprint: the size of external context windows, caches, or retrieval corpora.

A declared budget lets judges make apples-to-apples comparisons. If two entrants solve the same set of problems, the one that did it within a tighter budget demonstrates better systems taste or better orchestration. Over time, we can create quality-adjusted solve rates that factor in both difficulty and resource discipline.

Budgets also prevent an invisible arms race. If all that matters is who spent more, results drift toward compute maximalism. That drift is costly and hard to audit. It also privileges those with expensive hardware and deep pockets. By treating the budget as a first-class rule, we reward elegance over brute force and make the leaderboard more accessible.

Auditable enforcement, not vibes

How do we enforce this without turning every contest into a surveillance exercise? Use familiar ingredients with a technical twist:

Sandbox containers with metered APIs for model calls and tool invocations.
Logged traces with cryptographic signatures that hash every request and response.
Randomized audits where judges re-run a subset of traces to verify determinism or bounded stochasticity.

None of this is exotic. Cloud providers, offline judges, and organizers can collaborate on sealed runtimes that meter participation without revealing proprietary weights or prompts. Over time, the enforcement layer becomes part of the game’s design, just as anti-doping moved from an afterthought to a standard in sports.

Disclosure that matters

Publication and hiring will need disclosure standards that separate real craftsmanship from demo theater. A minimal competitive disclosure could include:

System description: which models, which versions, how they were combined.
Prompts and instructions: redacted if needed, but with structure and intent preserved.
Retrieval sources: what documents or datasets were accessible at solve time.
Search and tool policy: what external calls were permitted, with counts.
Safety and guardrails: what content filters or refusal policies were active.
Budget receipts: the logged usage against the declared limits.

This creates a shared grammar. Reviewers can replicate runs, hiring managers can evaluate the skill of the orchestrator, and students can learn from worked systems, not only worked solutions.

Education: teach problem finding and systems taste

If high school and university classrooms keep grading only solo speed, they will produce graduates mismatched to practice. A course refresh for the mixed era could look like this:

Problem-spec labs: students receive messy, underspecified prompts. Their task is to rewrite them into unambiguous specifications with test cases and rejection criteria. Grading focuses on specificity and coverage.
Orchestration studios: teams build small pipelines that decompose a task into stages, with sanity checks at each step. Marks are awarded for clear interfaces, graceful failure modes, and budget discipline.
Mixed rounds: one week is human-only. The next is AI-assisted with declared budgets and full trace submission. Students learn what to do themselves and what to ask a machine to do, and they learn to document the boundary.
Reflection briefs: after each assignment, each student writes what went wrong and why. The habit to analyze system failures is as important as the final grade.

Admissions can follow suit. Alongside test scores, applicants submit protocol portfolios: a small repository that shows a pipeline, its budget, and a replayable trace. This rewards students who can shape tasks and collaborate with tools without punishing those who still shine in human-only settings.

Hiring: new resumes, new tryouts

Hiring has already drifted from puzzles to take-home tasks. The next step is to ask for working protocols. Instead of a whiteboard sprint, candidates could be asked to do the following:

Red-team a prompt for leakage and bias, then propose patches and tests.
Build a two-stage solver for a real task, with a fifty-thousand token cap and three tool calls.
Explain trade-offs between model size, retrieval depth, and time to first correct answer.

Recruiters then score not just correctness, but orchestration clarity, budget discipline, and disclosure quality. Titles will shift too. Expect protocol designers, evaluation engineers, and tournament operations specialists to appear alongside software engineers and data scientists.

Research: from proofs to protocols

Academic communities that run shared tasks have a head start. They can evolve their venues with three concrete changes:

Mixed-division tracks: keep human-only tracks for science, but add AI-assisted and AI-open tracks with separate leaderboards and awards.
Budgeted leaderboards: sort results by quality-adjusted solves under declared budgets. Reward elegant methods that do more with less.
Artifact and trace review: extend artifact review to include model traces and budget receipts. Reviewers should be able to replay a subset of runs in a sealed environment.

This shifts the incentive from one-off stunts to robust, reproducible systems. It also reduces the arms race to hoard compute, since legibility under constraints is now a first-class achievement.

Designing games for mixed play

We do not need to give up on difficulty. We need to redesign what difficulty means. Some examples of problem classes that stay meaningful in a mixed era:

Specification stress tests: problems where poor spec leads to consistent failure, but clear spec unlocks a straightforward solve. The skill is to write the spec.
Adversarial triage: sets where some instances are designed to trap common heuristics. The skill is to design workflows that detect traps before committing resources.
Limited tools: enforced small model plus limited retrieval, so decomposition and cache design matter more than brute force sampling.

Judging can evolve too. Instead of only counting problems solved, competitions can award points for protocol clarity, budget adherence, and reproducibility. A team that solves fewer problems but with exceptional discipline could beat a team that solves more by burning through uncapped resources.

Integrity without nostalgia

Every new protocol invites new ways to cheat. Assume that, then plan for it. Practical defenses include:

Hardware attestation for devices used in AI-assisted rounds, combined with offline sealed models for human-only rounds.
Model provenance tags, so unapproved external services cannot be smuggled in.
Differential scoring that punishes out-of-division behavior, such as hidden human help in AI-open or unlogged tool calls in AI-assisted.

The goal is not to re-create a museum exhibit of pre-AI labor. The goal is to create games where mixed human and AI skill can be measured cleanly and repeated anywhere.

A new credential stack

When solving is abundant, credentials must shift to what is scarce. Expect new badges to appear on resumes and transcripts:

Problem-finding certificates: assessed by blind panels that rate novelty, clarity, and impact of proposed tasks.
Systems taste ratings: based on peer review of pipeline design, including fail-safes and monitoring.
Orchestration portfolios: live repositories of protocols, traces, and budgets that others can run.

These credentials complement, not replace, existing degrees. They are legible, replayable, and directly predictive of workplace performance.

Budgets meet bandwidth in practice

As organizers harden budgets, the physical layer begins to matter more. If memory, packaging, and interconnects are the real bottlenecks, then elite orchestration turns into smart placement of compute and data. For readers who want the hardware context behind these trade-offs, revisit how the AI arms race is bandwidth. When the budget includes context-window sizes and retrieval depth, teams with strong systems taste will arrange caches and sharding to minimize waste.

What organizers, schools, and companies can do now

Organizers

Announce divisions for the next season and publish draft rules for comment.
Provide sealed runtimes and budget meters for AI-assisted and AI-open rounds.
Require protocol disclosures and publish anonymized traces for a subset of entries.

Schools

Add problem-spec labs and orchestration studios to core curricula.
Grade budget discipline and trace clarity alongside correctness.
Let students submit protocol portfolios for capstone credit.

Companies

Shift interviews to budgeted tasks with trace review.
Maintain internal leaderboards for AI-assisted workflows with transparent scoring.
Recognize protocol designers and evaluation engineers as first-class roles.

These moves also build organizational muscle for a world where assistants collaborate across tools. The practice of composing multiple agents, tools, and data sources will feel familiar if you have spent time with model pluralism wins the platform war. The same orchestration that wins contests will ship products faster and with fewer surprises.

The uncomfortable relief of clarity

For years we argued about whether benchmarks were good or bad. That debate is over. The scoreboard has moved. The most advanced systems can now meet the gold lines our tournaments set for elite humans. That is not a reason to hide the machines. It is a reason to rewrite the rules so humans and AIs can compete, collaborate, and be measured in ways that matter.

Protocolized merit gives us that map. Divisions that reflect reality. Budgets that make comparisons fair. Disclosures that reward craft over theater. The result is a new kind of legibility, one that values the human skills that rise when solving is abundant: finding the right problems, shaping them cleanly, and orchestrating systems that work under pressure.

We do not need to wait. The next season is already scheduled. The machines are already warming up. Change the games, and we will learn more about both them and us. In a world where answers are cheap, the people and teams who design the rules become the ones who set the pace. That is the future hiding in plain sight on the new scoreboards. Merit, at last, becomes a protocol.