Mercury’s debut: diffusion LLMs reset latency and cost
Mercury puts diffusion language models in production, cutting latency and cost for real apps. Learn how this enables instant agent loops, on-device copilots, and streaming UIs with first drafts under a second.

Breaking: diffusion LLMs hit production scale
On November 6, 2025, Inception announced a refreshed, scaled release of Mercury, a family of diffusion-based language models now available for enterprise use. The headline is simple and significant. The company claims frontier-level quality with a latency and cost profile that lets teams ship real-time agent loops, on-device copilots, and streaming UIs without blowing their budgets or their SLAs. For details on what the company says it is delivering, see the Inception Mercury November update.
Why does this matter now? Because most frustrations with assistants, copilots, and agents trace back to two dials. The first is time to the first useful token, which determines whether an interaction feels immediate or sluggish. The second is cost per action, which sets whether you can afford multi-pass reasoning, robust verification, and richer tool use. When those dials move by whole numbers rather than single digits, product design changes this quarter, not next year.
With Mercury, many teams can let an agent think more without making users wait. Planning can run three passes instead of one and still fit a one second budget. Voice sessions can stream continuously without awkward gaps. On-device scenarios for privacy become plausible without sacrificing snappiness.
How diffusion makes text fast
Traditional large language models are autoregressive. They generate one token after another, and each token waits for the full stack of computations that produced the prior token. Diffusion language models invert that rhythm. They begin with a rough, partially masked view of the final answer and then refine many positions in parallel with a trained denoiser. The practical result is fewer serial steps on the critical path and better use of modern accelerators.
A useful mental picture is a jeweler polishing a ring. Autoregressive models carve letter by letter. Diffusion models buff the whole ring in short passes, each pass improving the shine across the surface at once. Fewer stops and starts mean less wall-clock time, especially on commodity GPUs that thrive on parallel work.
If you want a compact backgrounder on the approach, the Ars Technica text diffusion primer describes the shift from left-to-right sampling to coarse-to-fine refinement and why it matters for latency.
From a systems view, diffusion LLMs offer three near-term advantages that builders can cash in now:
- Parallel refinement improves GPU occupancy. Work spreads across more cores, which raises useful work per watt.
- Fewer sequential steps lower tail latency. The longer the output, the more the sequential savings compound.
- In-place revision enables structured editing. The model can revise spans directly, which simplifies format-constrained generation and semantic edits that otherwise require complex decoding tricks.
There are tradeoffs, and we cover them later, but for interactive workloads the wins are already compelling.
What changes right now
Here are three product patterns that shift from aspirational to practical because latency and cost moved in your favor.
1) Near-instant agent loops
- What unlocks: Multi-step planning and tool use that completes within a second on common hardware. Think five tool calls, two short reasoning passes, and a final response that returns before the user loses focus.
- Concrete example: A procurement agent pulls three vendor quotes, normalizes messy line items, maps them to internal SKUs, then drafts an approval summary. With diffusion models in the hot path, planning and summarization barely register in the latency budget, so network I O becomes the dominant bottleneck.
- Implementation notes: Cap step budgets explicitly, cache tool schemas, and reuse scratchpads across passes. With faster thinking, the new failure mode is cheap but unnecessary loops. Track a hard ceiling on tokens per action and emit a trace when it is exceeded. If you are coordinating multiple workers, frameworks like LangSmith Deployment v1.0 can help you ship multi-agent flows with traceability.
2) On-device copilots
- What unlocks: Local-first experiences on laptops and phones for privacy or offline use. Parallel refinement makes compact models feel responsive, which offsets lower on-device compute.
- Concrete example: A code reviewer runs entirely on a developer laptop, analyzes a diff, and suggests tests before the pre-commit hook. A meeting assistant records locally, summarizes after the call, and never uploads raw audio.
- Implementation notes: Pair a compact diffusion model with platform runtimes like Metal on macOS or DirectML on Windows. Use mixed precision and quantization to fit memory envelopes. Preload grammar constraints for your common formats so the model spends less time revising structure.
3) Streaming UIs that feel alive
- What unlocks: Interfaces that render a plausible draft in one sweep, then sharpen details in place. The cursor becomes a progress bar for quality, not just length.
- Concrete example: A support console streams a faint but complete response in under 150 milliseconds, then tightens facts as tool results arrive. A creative writing app reveals the full paragraph, then toggles tone from crisp to playful without retyping.
- Implementation notes: Build for idempotent edits. Maintain a revision map you can diff as polishing proceeds. Separate content delivery from adornment so screen readers handle the first usable draft, then pick up refinements without rereads. For front ends that automate the browser, patterns from Browser-native agents overtake RPA carry over cleanly.
Where diffusion LLMs still lag
Diffusion LLMs are not a silver bullet. Teams shipping production systems should plan around a few gaps.
- Logprob and calibration: Autoregressive models expose token probabilities that power safety filters, scoring, and self-consistency checks. Diffusion models refine spans rather than stepping token by token, so granular scoring is less direct. Practical answer: run a small autoregressive verifier for high-stakes outputs or score with a discriminative classifier tuned to your formats.
- Long-context retrieval: Parallel refinement helps with speed, not with retrieval precision over very long contexts. Models can lose entity threads across hundreds of thousands of tokens. Practical answer: chunk aggressively, resolve entities and citations first, then expand narrative in stages.
- Determinism and traceability: Parallel edits can increase variance between runs. That is fine for creative tasks but problematic for regulated workflows. Practical answer: pin seeds, cap the number of refinement passes, and archive intermediate drafts for audit.
- Tool use maturity: Function calling works, but many existing routers assume autoregressive streaming and token-level stop conditions. Practical answer: poll for function call completion after each refinement pass and accept span-level stop signals.
- Ecosystem depth: Observability and fine-tuning libraries are richer for autoregressive decoders. Practical answer: budget glue code and prefer vendors who expose denoiser controls such as step counts and edit temperatures.
A builder roadmap for the next two quarters
If you scope carefully and watch the right metrics, you can ship with diffusion LLMs in Q4 2025 and Q1 2026. Use this checklist as a starting point and adjust for your domain.
1) Local-first agents
- Architecture: Put the diffusion model in the hot path for planning and summarization. Keep a smaller autoregressive model as a verifier and a fallback for deterministic tasks like invoice totals or compliance disclaimers.
- Data: Cache tool schemas and high-frequency prompts on device. Precompute retrieval embeddings and ship them with the client. Sync deltas opportunistically when connectivity is available.
- Metrics: Track cost per action, not cost per token. Define it as input tokens times input price plus output tokens times output price plus tool costs. Watch P95 action latency and the share of actions completed offline. If P95 exceeds target by more than 20 percent for a week, freeze features and pay down debt on traces and caching.
2) Voice-native workflows
- Architecture: Maintain a tight loop across automatic speech recognition, intent parsing, model planning, tool calls, and text to speech. For natural conversation, aim for under 250 milliseconds to speak back a first draft and under 700 milliseconds to finalize with tool results.
- Conversation craft: Write prompts for interruption handling and quick repairs, such as clarify last noun behaviors. Use short earcons for state changes. Keep the first verbal response brief, then layer richer content once results arrive. If you are tracking the space, the patterns in Voice-native AI beta align well with diffusion-first streaming.
- Guardrails: Use a compact keyword spotter for brand terms or compliance phrases so you are not relying solely on downstream filters that expect token logprobs.
3) Edge inference stacks
- Hardware: On Windows laptops, a recent GeForce RTX in the 40 series is a practical floor. On Macs, target Apple Silicon with at least 32 gigabytes of unified memory and a Metal-optimized build. On mobile, reserve diffusion for short, high-value turns and expect multi-second edits for larger tasks.
- Runtime: Choose runtimes that exploit parallel refinement. On servers, plan on TensorRT or comparable graph compilers. On clients, use platform accelerators rather than raw CUDA ports. Memory bandwidth often dominates, so profile for bandwidth, not just tera operations.
- Distribution: If you must serve from the cloud, minimize tail latency between your edge and the region. Put an edge cache for prompts and schemas close to users.
Evaluation that reflects user reality
Standard leaderboards will not tell you whether your product feels instant or reliable. Reduce evaluation to a few application-grounded tests and automate them.
- Action Latency Test: Replay a suite of 200 representative tasks, including tool calls. Measure time to first draft and time to final. Report P50, P95, and P99.
- Stability Test: Run each task five times with a fixed seed and five times with random seeds. Compute a stability score based on semantic equivalence and structured field match rate. Alert when stability degrades beyond a threshold.
- Tool Success Test: Measure function call correctness, schema adherence, and end-to-end success on tasks requiring external data. Diffusion models can be confident yet slightly misaligned on formats, so this test catches silent failures.
- Safety Test: Pair a discriminative safety classifier with a small autoregressive verifier. If diffusion reduces latency but raises false negatives, insert a verification pass only for flagged intents to preserve speed.
- Cost Test: Compute cost per action across the full workflow. Do not average token prices across input and output. Many price sheets differ by direction, which skews averages.
Automate these as a nightly regression so you can see whether speedups survive real usage and whether changes improve the whole system, not just a micro-benchmark.
Pricing and capacity planning without guesswork
Sticker pricing on Inception’s site today lists input at 0.25 dollars per million tokens and output at 1.00 dollars per million tokens for Mercury. That lets you budget with simple arithmetic.
- Single chat turn: A turn with 400 input tokens and 200 output tokens costs 0.25 times 400 divided by one million plus 1.00 times 200 divided by one million. That is 0.0003 dollars per turn, or three hundredths of a cent.
- Agent loop: A loop that plans twice, calls two tools, and answers with 300 tokens can still land under a tenth of a cent on model usage. Real costs will shift to tools and infrastructure.
Throughput estimates are also straightforward. If an instance yields roughly 1,000 tokens per second per accelerator for common tasks, you can estimate concurrency by dividing available tokens per second by tokens per action for your workflow. A help desk action that consumes 1,200 tokens end to end would need about 1.2 seconds of pure generation time, so one accelerator could handle dozens of actions per second in steady state if you parallelize across sessions. Always model tail latency from networks and tool calls, since generation may no longer be the bottleneck.
Capacity tips:
- Keep per-tenant queues so one noisy tenant does not starve others.
- Pre-render first drafts for predictable tasks during idle periods. Diffusion makes this safe because you can polish the draft when the user arrives.
- Negotiate burst capacity in your contract. When generation is cheap, your risk shifts to traffic spikes.
Rollout risks and how to de-risk them
- Integration drift: Many frameworks assume autoregressive decoders. Validate that streaming handlers, stop conditions, and tool routers work with span-level edits. Run a week of parallel canary traffic and diff traces.
- Observability gaps: Without token logprobs, some monitors lose precision. Add span-level quality checks, schema validators, and a verifier pass for critical tasks. Log intermediate drafts for a sample of sessions to investigate regressions.
- UX surprises: Users may see a complete but slightly soft draft that sharpens over time. Set expectations with a subtle shimmer or a polishing label. Train support teams on what users will see.
- Vendor risk: Diffusion LLMs are new in production. Maintain a hot backup path with a small autoregressive model so outages or regressions do not cascade. Use feature flags to switch per route.
- Compliance posture: If you promise on-device processing, audit that no intermediate drafts are uploaded. For cloud serving, pin regions and publish a clear retention policy.
What this means for builders
The story in November 2025 is not speculative. Diffusion-based language models are available with published prices and enterprise distribution. They change optimal architectures now.
- For agent systems: Faster planning means you can afford multi-pass reasoning without long waits. Design agents to think more, in smaller increments, closer to the user. If you are rolling out operations teams or field workflows, the patterns we covered echo the momentum in Browser-native agents overtake RPA as agents take on more interactive work.
- For voice: Conversation becomes continuous rather than strictly turn based. Design for interruption, overlap, and quick repairs. A model that can revise in place lets you correct mid-sentence without sounding robotic. For a survey of voice-first experiences, see Voice-native AI beta.
- For privacy and performance: Local-first is back. Move sensitive steps to the device and reserve the cloud for retrieval and heavy tools.
Hold a simple principle. When the marginal cost of thinking rounds toward zero, the winning product is the one that thinks more often, in smaller steps, closer to the user.
The bottom line
Inception’s Mercury is the first widely available proof that diffusion can work for language at commercial scale, not just in research. Lower latency and lower cost expand the design space today. Teams that adapt their architectures, evaluation harnesses, and pricing models in the next two quarters will ship experiences that feel immediate and alive while spending less. The rest will still be shaving tokens from prompts while their users wait.








