Claude Sonnet 4.5 pushes agents from demos to dependable work

The week agents grew up

A year of flashy agent demos taught us two things. First, large models can click, type, and navigate. Second, most agents time out, forget context, or drift after a few hours. This week’s release of Claude Sonnet 4.5 changes the math. Anthropic’s new model pairs stronger computer use with longer unattended runs and a hardened developer stack so teams can move from pilot experiments to production workflows that meet service level expectations. Anthropic is explicit about the shift, citing state of the art results on computer use benchmarks, an observed ability to stay focused for more than 30 hours, and shipping grade SDKs that mirror the company’s own internal scaffolding for Claude Code. The key details are in Anthropic’s announcement of Sonnet 4.5, which is worth reading closely for what it implies about agent reliability and developer control in the Anthropic release post.

If you have been waiting for a sign that agents are ready to do real work without constant babysitting, this is it. Not because a single benchmark score jumped, but because the surrounding plumbing matured: checkpoints, memory, context editing, subagents, hooks, and background tasks. Those are the tools that make unattended runs observable, reversible, and safe enough to trust with specific slices of work.

What actually got better, and why it matters

Think of an autonomous agent like a junior teammate seated at a computer. To be useful, they need three things: they must see and operate the screen accurately, they must work for long stretches without losing the thread, and they must use the same tools as the rest of the team so their work is reviewable and auditable. Sonnet 4.5 moves each lever.

More precise computer use. Anthropic reports a strong jump on OSWorld style tasks that approximate real browser and desktop work. In practical terms, this means fewer off target clicks, more reliable form handling, and better step by step navigation when tasks involve pages that shift layout or require scrolling and filtering.
Longer unattended runs. Anthropic says Sonnet 4.5 maintained focus for more than 30 hours on complex multi step work. Long horizon stability is what converts agents from demo toys to a tool you can schedule overnight. It is also the difference between completing a cross repo refactor and stalling at the first failing test.
Claude Code becomes the backbone. Checkpoints let you rewind work without fear, memory and context editing prevent context bloat, and subagents plus hooks make parallelism and event driven execution first class. Crucially, the Claude Agent SDK exposes the same scaffolding Anthropic uses for its own product, which is the rare case where the vendor’s production grade plumbing becomes yours to ship.

Taken together, these changes unlock the boring but essential capabilities that enterprises ask for: guardrails that keep actions within policy, observability to answer what happened when, and recovery paths when something goes wrong.

A blueprint you can ship this quarter

Treat an agent program like you would any new service. Define scope and latency, instrument it, and keep a human handoff within reach. Here is a pragmatic plan you can implement within a quarter.

Pick a narrow, high leverage workflow

Good candidates: spreadsheet compile and clean jobs, weekly data pulls and cross site joins, calendar triage with data entry, low risk code maintenance, and internal knowledge base refreshes.
Anti patterns: sensitive purchasing, broad email access without approvals, and any task that spans legal or security boundaries on day one.

Define your service level objective and budget

Latency target: set a goal like “90 percent of runs complete in under 45 minutes.”
Quality target: pick a measurable output such as “no failed validations in the final CSV” or “all tests pass on the branch.”
Cost cap: track tokens and tool calls per job. Set automated early stops when runs exceed budget without making progress.

Build with Claude Code and the Agent SDK

Use checkpoints so operators can rewind without losing the conversation. This encourages faster iteration because risk is reversible.
Use subagents for parallel search and summarization. Give each subagent a tight context and a single responsibility, then have an orchestrator stitch results.
Add hooks for event driven hygiene. Examples: run unit tests after edits, lint before commit, or re fetch credentials if an API call fails.
Store artifacts in a predictable place. Treat the agent’s folder structure as context engineering. Keep logs, inputs, outputs, and intermediate state as separate directories.

Put observability in from the start

Logs: record each tool call with parameters, duration, and result snippet. Add a unique run id and step numbers so a task reads like a timeline.
Metrics: count steps to completion, tool success ratio, human handoff rate, retry loops, and cost per successful run.
Traces: keep screenshots only where needed for privacy, but do preserve before and after code diffs or cell diffs in spreadsheets. Screenshots are invaluable for diagnosing brittle selectors and misclicks.

Design permissions and approvals

Default to allowlists. Give the agent explicit permission to edit only the directories, repositories, and web domains it needs.
Require approvals for specific actions. Examples: submitting purchase forms, deleting calendar events, or pushing to a protected branch.

Add robust failure handling

Implement stuck detection. If the agent repeats a step more than N times or exceeds a time budget for a subtask, escalate to a human with a prefilled summary of the last actions taken.
Build a rollback path. Combine checkpoints with version control so a failed run can restore the prior state in one command.

Write a runbook and practice it

If the agent hits a login gate, it unblocks via an approved credential path. If it encounters a layout change, it switches to a semantic search strategy. If a tool consistently returns 500 errors, it falls back to a cached plan.

Teams that follow this playbook can move a single agentized workflow from a supervised pilot to a basic service level agreement in roughly 6 to 10 weeks, depending on approvals and privacy reviews. For examples of how this looks when adjacent stacks harden, see how pull requests become runtime in the GitHub PR becomes runtime piece and how cloud stacks codify agent patterns in the Amazon Bedrock AgentCore overview.

Architecture patterns that hold up in production

A dependable agent system is less about a clever prompt and more about a sound architecture. Here are patterns that reduce flakiness and limit blast radius.

One orchestrator, many focused subagents. Treat each subagent like a microservice with a single responsibility. Typical roles include “search and collect,” “normalize and dedupe,” “summarize for review,” and “apply edits.” Keep their contexts small and explicit.
Checkpoint early and often. Checkpoints are like save points in a game. After each major step, snapshot inputs, plan, and artifacts. If something goes off the rails, roll back to the last good state and continue.
Plan critique loops. Before an agent executes a long plan, ask a second lightweight verifier to check the plan for obvious errors. Do the same after key steps. This catches loops and hallucinated tool names before they cost time and money.
Semantic selectors, not brittle XPaths. When driving the browser, prefer semantic target selection and content matching over static CSS paths. Pair this with minimal, privacy aware screenshots so you can debug drift when layouts change.
Data first design. The data layer is the control plane. Store canonical inputs, outputs, validations, and costs in structured stores so you can audit, replay, and improve. For a deeper look at this philosophy, see the Agent Bricks data control plane.

Concrete use cases that earn trust fast

Revenue operations spreadsheet automation. Nightly, a Sonnet 4.5 agent pulls data from a billing export, filters out churned accounts, adds updated segments, and regenerates a pivot dashboard. Checkpoints and file creation features save the working artifacts. Quality gates verify each sheet has the expected row counts and key columns populated.
Engineering code upkeep. A subagent scans for deprecated function signatures across services. Another subagent edits call sites. A test hook runs unit suites after each batch of edits. The orchestrator opens a draft pull request with a readable change map and a run cost line item.
Customer support triage. An agent reads new tickets, checks the knowledge base, runs a targeted search, and drafts responses. Tickets above a risk score threshold get human review. Metrics track average resolution time and the handoff rate per queue.
Security hygiene. The agent pulls dependency manifests, checks against a vulnerability feed, and opens issues with suggested patches. Any patch that modifies runtime behavior requires approval. The runbook specifies who approves and what happens if a patch fails tests.

None of these require speculative research. All of them lean on Sonnet 4.5’s more reliable computer use and Claude Code’s production scaffolding.

Compare and contrast: Sonnet 4.5 vs Google’s Mariner and OpenAI’s Operator

Where Sonnet 4.5 pulls ahead today

Production scaffolding you can adopt. The Claude Agent SDK and the Claude Code runtime are battle tested inside Anthropic’s own product. Checkpoints, subagents, hooks, and background tasks give you the bones of a real service rather than a bare model endpoint.
Long horizon execution. Anthropic reports observed runs over a full day, which matters for cross site research, large refactors, and multi stage operations that do not fit into a short session.
Computer use accuracy. On benchmarks focused on real on screen work, Sonnet 4.5 shows strong scores, which reduces the unseen operator time spent redoing clicks and form entries.

Where Google’s Mariner stands out

Browser native focus. Mariner is a research prototype built for Chrome first workflows. It can manage parallel tasks inside virtual machines and is positioned to flow into the Gemini platform and Google Cloud developer tools. If your fleet is already Chrome managed and your tasks are web only, Mariner’s integration path is attractive.
Teach and repeat. Mariner emphasizes recording and replaying demonstrated tasks. For repetitive web procedures across many users, this creates a clear training and rollout story.

Where OpenAI’s Operator stands out

Consumer distribution and brand. Operator rides inside ChatGPT and is designed to perform everyday tasks through an automated browser. For teams already embedded in ChatGPT workflows, Operator can be a low friction way to validate agent value with end users.

Gaps to watch

Sonnet 4.5. Desktop automation outside the browser and advanced visual interactions are improving quickly, but specialty workflows may still require custom tooling or wrappers around computer vision and accessibility trees. Teams should validate edge interfaces like custom dashboards, legacy intranet apps, or vendor portals with unusual widgets. Independent reporting notes that longer agent sessions are possible but still benefit from explicit compaction and tool driven self checks to prevent drift over time.
Mariner. Access is gated to specific subscriber tiers and geographies, and it is evolving toward Gemini. This limits near term enterprise rollout unless you are already inside Google’s managed environments and comfortable with the prototype label.
Operator. OpenAI communicates rate limits and sensitive action restrictions that can interrupt unattended jobs. Operator may also hand control back to the user in certain flows and can get stuck on complex interfaces. For continuous, unattended back office work, these limits matter. That said, Operator’s tight integration with ChatGPT makes it a compelling front door for simpler, supervised tasks.

Bottom line: if your goal is to ship multi hour, audit friendly digital labor in the next quarter, Sonnet 4.5 plus the Claude Code stack offers the most complete, production shaped toolkit right now. If your strategy centers on Chrome only flows or end user tasks inside ChatGPT, Mariner and Operator can be valuable complements or test beds while you harden Sonnet based services for back office work.

How to decide what to build first

Use a simple scoring model:

Repeatability. How similar are today’s human steps from run to run. Higher is better.
Context locality. How many systems does the task touch. Fewer systems improve reliability.
Blast radius. What is the worst case failure. Pick low risk tasks for your first SLA.
Feedback signals. Are there crisp validations to verify success. The more objective checks you have, the easier it is to trust and improve the agent.

Rank candidate workflows by this score. Pick one with clear validations and a modest blast radius. Build the smallest end to end service, not a grab bag of tools. Ship it, measure it, and only then broaden scope.

What to measure from week one to week six

Good metrics raise signal and lower anxiety. Start with a small but telling set, then expand as the service matures.

Run success rate. Percentage of runs that reach the final validation without human help. Segment by task type.
Mean cost per successful run. Tokens, tool calls, and any external API spend. Track the 90th percentile to catch outliers.
Plan churn. How often the agent revises its plan mid run. A rising trend signals drift or brittle selectors.
Step count to completion. Useful for regression checks after changes to prompts, tools, or UI layouts.
Human handoff rate. Percent of runs that require approval or intervention. Pair with a short handoff reason code.
Time to rollback. From failure detection to restored state, measured in minutes. Check that checkpoints plus version control make this trivial.

Instrument these from the start and publish a weekly dashboard so stakeholders see progress. Nothing builds trust like a graph that trends up and to the right while costs trend down.

Risk and safety, handled like engineering

Prompt injection and web risks. Treat every web page as untrusted input. Use allowlists, sanitize tool outputs, and prefer fetch then parse over scrape and hope. Add a second model or rule based verifier for high stakes outputs.
Identity and approvals. Use short lived credentials. Give the agent a separate account with role based permissions. Log every action with principal, time, and outcome.
Privacy by default. Avoid screenshots unless needed for debugging. When you do capture them, store in a short retention bucket.
Cost control. Cap reasoning tokens and enforce early stop conditions. Summarize context aggressively with the SDK’s compaction features when runs exceed a step threshold.

None of this is exotic. It is the standard playbook for any new service, applied to an agent that now has the stamina and control to justify the effort.

FAQ for the skeptical buyer

What about reliability beyond a day. Long sessions are possible, but do not assume infinite focus. Design for periodic compaction, checkpoint resets, and quick relaunch on failure.
Can I mix Sonnet 4.5 with other models. Yes. Use Sonnet for long horizon execution and pair it with smaller verifiers for specific checks. Keep contracts between subagents clean so you can swap components.
How do I prove value to finance. Start with a pilot on a contained workflow, publish a clear SLO, and report weekly on cost per successful run. Include a counterfactual that shows prior manual time on task.
What if our browser UI changes often. Favor semantic targets and content based actions. Keep a thin map of key selectors in code and version it like any dependency.

The signal in this launch

You can tell a technology has crossed into usefulness when the support systems get as much attention as the headline scores. That is the story with Sonnet 4.5. Anthropic did not just raise a benchmark. It shipped the rails that make agents debuggable, reversible, and governable. The Verge highlighted that the model ran a multi day project and built a real application, which matters less as a demo and more as proof that the stack can keep an agent coherent from start to finish in a The Verge detailed report.

If you run a software, operations, or analytics team, the practical takeaway is simple. Pick one workflow, wire it to Claude Code with Sonnet 4.5, instrument it like a service, and put a basic SLA around it. Do it now while your competitors are still watching demos. The gap you open in the next quarter will be hard to close once agents become table stakes.

Where the ecosystem is heading

The broader market is converging on a few truths. First, agents are becoming part of existing developer and operations surfaces rather than new silos. We are seeing this in how pull requests become live execution control points in the GitHub PR becomes runtime story, and in how cloud platforms expose policy, identity, and observability primitives as first class parts of the agent loop in the Amazon Bedrock AgentCore overview. Second, the data layer is no longer a passive store. It is the governance and control plane, as argued in the Agent Bricks data control plane. Third, the winners will be the teams that operationalize quickly, measure relentlessly, and choose problems with tight feedback loops.

Claude Sonnet 4.5 lands at exactly the right moment. It gives teams the accuracy, stamina, and scaffolding to move beyond proof of concept and into accountable service delivery. If you pick the right slice of work and run the playbook above, you can turn agent hype into durable advantage in a single quarter.