The Supervisor Era: Humans Edit, Models Write the Code

Breaking: the center of software just moved

This is not a novelty moment. It is a regime shift. Reports this month suggest that for many teams at a leading AI lab, most code is already produced by models. The claim is striking: public comments attributed to Anthropic’s chief executive describe Claude generating about 90 percent of code for many teams, with human engineers focused on editing, supervision, and the hard final miles. See context in this coverage of the 90 percent of code written by AI.

At the same time, OpenAI’s recent DevDay showcased infrastructure that assumes this future. AgentKit consolidates agent building, guardrails, and evaluations. Codex is presented as generally available with enterprise controls and governance. The consistent message on stage and on the product pages was simple: let agents do the work, measure them interactively, and wire them to production safely. Review the highlights in the OpenAI DevDay AgentKit and Codex.

Taken together, these moves mark a shift from authorship to supervision. The teams that professionalize supervision will outpace those that still treat model output as a novelty or a toy.

From artisanal code to managed production

The right analogy is not a faster typist at a typewriter. It is a factory that moved from hand assembly to highly automated lines. Humans still set standards, configure machinery, test outputs, and choose what to build. Throughput comes from well orchestrated systems.

In the Supervisor Era, models are the line workers. Humans are the line leads and quality engineers. The craft of engineering does not vanish. It moves up a level: from writing lines of code to designing specifications, evaluation suites, safety constraints, and orchestration patterns that transform a model from clever autocomplete into a reliable contributor.

For background on why infrastructure becomes the bottleneck, see our analysis of how compute becomes the new utility.

The spec becomes the unit of work

When models write most of the code, the most valuable artifact is not the code. It is the spec that defines the problem, constraints, interfaces, and acceptance criteria. A good spec is a contract that a model can execute against. A weak spec creates loops, flakiness, and post hoc firefighting.

Here is a spec template that increasingly serves as the default unit of work:

Problem statement: a paragraph capturing user need, not solution desire
Inputs and outputs: types, units, maximum sizes, error cases, latency targets
Interfaces: function signatures or API schemas with concrete examples
Constraints: security, privacy, resource limits, and compliance requirements
Evaluation plan: both deterministic unit tests and interactive trials
Rollout and rollback: feature flags, sampling strategy, and data capture
Traceability: spec author, reviewers, and approved model versions

This is not documentation garnish. It is the primary production input. Code generation becomes a byproduct of specs, tests, and tools. The supervisor’s job starts by shaping the spec and ends when the model output passes the evaluation plan.

Meet your new org chart: model roles, not just job titles

Traditional org charts map people to teams. The Supervisor Era adds a second map: a model org chart that clarifies which agents own which responsibilities.

A typical model org chart for a product team:

Implementer: generates new functions and modules from specs with access to relevant repositories and context packs
Refactorer: reduces duplication, improves clarity, enforces architectural patterns
Test writer: synthesizes unit tests, property checks, and scenario packs from specs and diffs
Reviewer: performs static analysis, checks for secret leaks, flags insecure patterns
Doc writer: summarizes changes into release notes, migration guides, and API examples
Performance tuner: rewrites hot paths under defined latency and cost ceilings

Each role carries a tool belt: repository access, sandboxed runtimes, code search, vulnerability scanners, and policy guards. Humans remain accountable for outcomes. Day to day throughput, however, lives in these model roles, which means your team now manages an internal marketplace of agent capabilities.

For a look at how automation reshapes expectations between people and software, read our essay on the new software social contract.

Compute budgets replace headcount math

Yesterday’s delivery plans centered on headcount. Today’s plans allocate compute budgets to agent workflows. A compute budget caps tokens, calls, or time that a set of agents can spend to reach a goal. It is similar to allocating hours to contractors, only the contractor is a reproducible model with a detailed trace.

Budgeting forces clarity:

What steps are costed for this piece of work
Which steps are single shot and which require loops with feedback
What happens if the budget is exceeded: cut scope, escalate, or revise the spec

Cost controls move from afterthought to first class control surface. Supervisors monitor cost per green build, cost per merged pull request, and cost per point of test coverage. Finance leaders begin to think in terms of marginal compute per unit of business outcome.

Interactive evaluations become the scoreboard

Static benchmarks are weak signals for agentic work. The real questions are practical: Can the agent follow a changing spec. Does it recover from tool errors. Does it stop when a guardrail fires. This is why interactive evaluations have moved to the center.

What interactive evals look like in practice:

A harness spins up a realistic environment with sandboxes, seeded data, and simulated network partitions.
The agent attempts a task with a trace recorder on.
A judge process, possibly a model paired with rules, scores the run on completion, safety, and resource use.
The harness can perturb the environment mid run to test recovery and adherence to policy.

Engineers then use these evals as both gates and feedback loops. Gates decide whether a workflow can ship. Feedback loops refine prompt packs, tools, and spec clarity.

The mindset echoes safety cultures in other industries. We have argued that AI is entering an aviation style safety era. Interactive evals are the flight simulators that make it real.

Guardrails and governance move left

When models contribute code, safety cannot bolt on at the end. Inputs, outputs, and traces require continuous screening. The guardrails that matter most are simple and explicit:

Privacy and secrecy: block secrets, personal data, and regulated identifiers from entering prompts or leaving outputs
Security: enforce allowlists for network and file access, require code signing for build scripts
Model policy: define which model families are permitted for which tasks, quarantine unapproved versions
Content policy: disallow classes of suggestions such as disabling logging, bypassing authentication, or scraping partner endpoints

Place guardrails where risk is highest: at tools and boundaries. Make guardrails observable in traces so supervisors can debug policy as a first class problem rather than chasing ghosts in production.

KPIs for supervisors, not just ICs

If models are first class contributors, performance indicators must reflect their contribution. A practical starter set:

Time to green: median minutes from spec approval to a passing build for the first agent generated implementation
Edit ratio: human edited lines divided by lines proposed by agents, trended by module and spec author
Escaped defects: bugs per thousand agent generated lines that reach staging or production, normalized by test depth
Spec clarity index: share of agent failures attributable to spec gaps rather than model behavior, based on postmortems
Cost per change: total compute cost to land a change, including reviews and refactors
Guardrail hit rate: proportion of runs that trigger safety rules, categorized by tool and spec pattern

These metrics are not for micromanaging silicon contributors. They turn supervision into a measurable discipline, allowing teams to tune process and allocate compute where the return is highest.

Accountability in an age of auto authorship

When a model writes the code, who owns the bug. The answer stays the same: the team that shipped the change owns the bug. What changes is the mechanism that makes ownership workable.

Every agent generated change should have a trace that includes model versions, prompts, tool calls, and test outcomes.
Every trace should record an accountable human reviewer and spec owner.
Every postmortem should classify the primary cause: spec failure, supervision failure, or model failure.

This is not a legal shield. It is an engineering practice that improves reliability and makes audits efficient. It also helps teams learn the right lessons from incidents. If half of your outages come from spec gaps, the remedy is better spec training and templates, not only newer models.

Apprenticeship without the keyboard monopoly

A real worry is how junior developers learn when models write so much of the code. Apprenticeship used to be built on repetition. You gained scars by writing routine code and seeing it break.

Teams leaning into supervision are rebuilding apprenticeship with intent:

Curriculum: rotations through spec writing, eval harness building, and guardrail design
Shadowing: juniors pair with supervisors to watch model runs and learn to debug traces and revise specs
Deliberate practice: simulated tasks in the eval harness where juniors must guide agents to success within a compute budget
Code reading: weekly sessions that compare agent proposals with human refactors to teach taste and architecture

The outcome is a developer who manages leverage rather than only typing speed. They learn to think in systems instead of in snippets.

Intellectual property and provenance norms

Two norms deserve explicit redesign:

Code provenance: require automated checks for third party borrowings in agent outputs. Treat unknown origins as a policy violation and tie provenance reports to the trace for every change.
Training boundaries: document which internal code the organization allows as retrieval context for models. Ban sensitive modules from retrieval and enforce the policy using repository tags and tooling.

Do not rely on folklore about what is fine to copy or cite. Define a policy, teach it, and enforce it with tools that run on every agent proposal and every human commit.

A day in the life of a supervisor

To make this concrete, imagine a product team adding usage based billing to an existing platform.

The spec owner assembles a spec that defines event schemas, rate limits, invoice rules, and rollbacks.
The implementer agent drafts the event pipeline and billing engine. The test writer agent generates property tests and fixture data for replay.
The reviewer agent spots a float precision risk in currency rounding and proposes a decimal library. The refactorer agent extracts pricing rules into a policy module.
The doc writer agent produces migration notes and a dashboard runbook. The performance tuner agent rewrites a hot path in the aggregation job to hit the latency target.
The supervisor runs the interactive eval harness that simulates spikes, refund cascades, and a warehouse network partition. Guardrails flag an attempted call to an unapproved endpoint. The agent is retried with an allowlisted connector.
The change ships behind a flag to five percent of customers. The supervisor watches cost per change and guardrail hit rate. All green. Full rollout proceeds.

The code matters, but the durable assets are the spec, the eval harness, the guardrails, and the trace. Those can be reused and improved for the next feature.

What to do in the next 90 days

If supervision is becoming the job, use a staged plan.

Days 0 to 30: set the foundation

Select two repeatable and low risk workflows to automate: documentation updates and refactors on a noncritical service
Create a spec template and require it for every agent task. Include evaluation plans and compute budgets
Stand up a minimal eval harness with traces and a judge process. Add two guardrails: secret detection and network allowlists
Define a model org chart for your team. Start with implementer, reviewer, and test writer

Days 31 to 60: formalize supervision

Set basic KPIs: time to green, edit ratio, cost per change
Run weekly reviews of guardrail hits and postmortems. Classify failures by spec, supervision, or model
Teach a two hour supervision class that covers spec writing, trace reading, and budget tuning. Make it required for code owners

Days 61 to 90: scale prudently

Introduce performance tuner and doc writer agent roles. Extend guardrails to include content policy and model version allowlists
Expand interactive evals to include perturbations such as timeouts and partial tool failures
Start cost forecasting by mapping compute budgets to planned work. Report cost per change jointly to engineering and finance leaders

The slight accelerationist thesis

It is tempting to wait because the tools change quickly. Waiting is the wrong move. The path is not to replace engineers. The path is to professionalize supervision. Treat models as first class contributors with specs, guardrails, evals, and KPIs. Put a compute budget next to every feature. Give supervisors the same respect and training that a great staff engineer receives.

Teams that keep coding in the artisanal style will still ship, but their velocity will stall where humans become the bottleneck for rote work, where guardrails remain ad hoc, and where evals are a unit test suite pretending to be a safety net. Teams that move supervision to the foreground will outlearn and outdeliver them.

A closing picture

In mid October 2025, the conversation changed because two threads converged. An AI lab said aloud that many of its teams already live in this future. A major platform vendor shipped a toolchain that expects you to live there too. If your organization builds software, the real choice is whether you will design your supervision system on purpose or let it emerge by accident. One path yields speed with control. The other yields surprises.

The Supervisor Era rewards teams that make hidden work visible and subjective decisions measurable. It turns specs into the currency of progress and compute budgets into the language of planning. If you want an advantage, start acting like a supervisor and teach your team to do the same. The code will follow.