Agent Bricks Turns AI Agents Into a Production Pipeline

Breaking: agent building grows up

Databricks just took a swing at the messiest part of enterprise AI: turning a promising demo into a dependable agent that both your compliance team and your CFO can trust. In June 2025 the company introduced Agent Bricks, a beta workflow that assembles and improves task focused agents on your governed data, then lets you pick the balance between quality and cost you are willing to ship. The pitch is simple and bold: describe the task, connect your data, and the system handles evaluation, optimization, and packaging for production. See the official Agent Bricks beta announcement.

This is not a new model. It is a way to stop guessing with prompts and start engineering with measurements.

From prompt tinkering to a real pipeline

Most teams still start agents the same way. A few people refine a prompt in a notebook, paste in example questions, flip temperatures, try a different provider, and hope the next run is better. That loop is fast but fragile. Success depends on memory, shared context, and a lot of screenshots.

Agent Bricks reframes the work as an explicit pipeline that any engineer or analyst can retrace:

Task specification. You describe the job in concrete terms. For example, extract parties, effective dates, and renewal terms from vendor agreements, or triage incoming tickets into four queues with explanation notes.
Data grounding. You attach curated data from your lakehouse. This might be a Delta table of contracts, a cataloged folder of PDFs, or a vectorized knowledge base. The system samples this material to build a development and evaluation set.
Synthetic data generation. Using techniques from Mosaic AI research, the system creates domain specific synthetic examples that look and behave like your data. These examples expand coverage across edge cases you probably do not have labeled.
Task aware benchmarks. It builds custom evaluation harnesses for your task. These include rule based checks, reference answers when you have them, and large language model judges that score qualities like factuality, reasoning steps, or tone.
Search over designs. Rather than nudge one prompt by hand, Agent Bricks explores a space of prompts, tools, retrieval settings, and models. Each configuration is measured against the same test sets and cost metrics.
Human feedback and iteration. Subject matter experts can review errors, correct labels, and flag unacceptable outcomes. The feedback becomes training signal for the next cycle.
Production packaging. The winning variants are registered with lineage and metrics, wrapped with guardrails, and deployed with service level objectives and cost alerts.

The result is a Pareto frontier you can see. One variant might cost 40 percent less with a small loss of accuracy, while another squeezes a few more points of correctness at a higher token bill. You choose, explicitly.

What early users are seeing

Early design partners report strong throughput on document heavy tasks. One global pharmaceutical team used Agent Bricks to parse hundreds of thousands of clinical trial files and extract structured fields without writing application code. They stood up a working agent in under an hour, then iterated as domain experts flagged edge cases common to real trials, such as protocol amendments and country specific consent forms. The cadence changed. Instead of one big push to a brittle agent, teams now run many small experiments and ship the best tradeoff with audit trails intact.

On conversational use cases, the most immediate wins come from reducing hallucinations through task aware grading, and from faster content routing. For example, a support triage agent improves not only answer quality but also handoff quality by attaching a structured rationale and confidence that downstream humans can scan quickly. That alone can cut minutes per ticket.

The economics matter just as much. Because every variant is measured on the same test set, finance leaders can compare cost per correct answer and pick a target. When usage scales, that discipline matters more than any single model choice.

Under the hood: what Agent Bricks really automates

A few pieces make the system feel different from a bag of scripts:

Domain specific synthetic data. The generator learns the shape of your domain. If your documents use industry jargon or obscure date formats, the synthetic set captures those quirks. That makes evaluations more representative and makes it easier to catch brittle prompts before they hit production data.
Task aware evaluations. Instead of generic metrics, the harness is built around what good means for your task. A knowledge assistant is scored on groundedness and citation coverage. An extractor is scored on field level precision and recall with normalization rules. A triage agent is scored on routing accuracy and rationale completeness.
LLM judges with calibration. Large language model based graders are powerful but need guardrails. Agent Bricks blends them with rules, reference checks, and sampling strategies that reduce grader bias. It also gives teams ways to spot and correct grader drift over time.
Search across the design space. The system treats prompts, retrieval depth, tools, and models as knobs to tune. It runs controlled sweeps, not ad hoc trial and error, and records every run with lineage and metrics so you can reproduce winning recipes.
Packaging for production. Models and agents are registered with metadata, including the datasets and evaluations that justified their promotion. That makes audits and rollbacks routine, not heroic.

If you are on Azure Databricks, you likely first saw Agent Bricks surface in the May 2025 Azure Databricks release notes, which flagged the beta and described the initial text task focus. The common thread since then has been steady hardening for production: tighter integration with catalogs, better tracing in MLflow, and serverless compute that makes cost controls predictable.

The lakehouse native agent foundry pattern

Teams that succeed fastest keep everything close to their governed lakehouse. A repeatable pattern is emerging.

Ingest and curate. Land raw documents and event streams into Delta tables. Use quality rules to mark gold subsets. Keep them in your catalog with frozen versions.
Ground and index. Build retrieval indexes or embeddings from gold data, with lineage back to source tables. Register them in your catalog so every agent version knows exactly which snapshot it used.
Generate synthetic neighbors. For each golden example, generate synthetic variants that preserve structure but stress different edge cases. For example, add missing headers, odd date formats, or language switches inside a document.
Define task aware benchmarks. Codify what counts as good. For extraction, write field validators and normalization rules. For retrieval augmented chat, define groundedness and citation coverage. For triage, define allowed routes and rationale length.
Run the search. Let Agent Bricks sweep across prompts, retrieval depths, function calls, and model choices. Cap the token budget and record cost per correct example.
Register and deploy. Pick a target on the cost versus quality curve. Register the agent with its datasets, scores, and lineage. Deploy behind a simple API with guardrails and monitoring.
Close the loop with humans. Randomly sample production outputs, route them to subject matter experts, and fold corrections back into the evaluation sets on a defined cadence.

This is a factory, not a lab. The common ingredients are Unity Catalog for governance, Delta for lineage, and MLflow for traces and comparisons. The novelty is that evaluation and optimization are first class citizens rather than afterthoughts.

Governance and evaluation pitfalls to avoid

Synthetic trap. Synthetic data should expand coverage, not replace reality. Always hold out a real world golden set from your catalog that synthetic examples never touch. If your scores spike only on the synthetic set, your agent is overfitting to the generator's quirks.
Leaky tests. If you build benchmarks from production data, freeze them. Retraining on data that also sits in your evaluation set will inflate scores. Use catalog snapshots for both training and testing, and pin versions in your agent metadata.
Uncalibrated LLM judges. A single model grader can drift or latch onto superficial cues. Use multi grader ensembles, inject adversarial examples, and combine LLM judging with rule checks where possible. Periodically compare grader decisions to human labels to catch drift.
Narrow metrics. If you only track accuracy, you will be surprised by costs. Track cost per correct output, latency percentiles, and grounding rates side by side. Make promotion gates depend on all three.
Governance gaps. Agents often combine retrieval, tools, and model calls. Your approval process should apply to the whole graph. Log every external call, capture the retrieved context, and store traces with access controls. If an answer appears in court or in an audit, you need the chain of evidence.
Silent failures. Define unacceptable behavior up front. For example, any response with personal data without an explicit retrieval citation is a hard fail. Turn those rules into automatic red flags that block promotion.

Concrete examples: three agent types, three blueprints

Document extraction for regulated content

Data. Vendor contracts and amendments stored as PDFs in a cataloged volume. Gold table with 2,000 hand labeled clauses.
Objective. Extract counterparty, effective date, and renewal terms. Errors that expose revenue risk are worse than missing a field.
Evaluation. Field level precision and recall plus a risk weighted penalty for certain mistakes. A response without a provenance snippet is an automatic fail.
Optimization. Sweep across two base models, retrieval depths of 5, 10, and 20, and three prompt patterns. Include a post processor that normalizes dates.
Outcome. A middle cost variant wins by reducing revenue risk errors by 60 percent while increasing cost by 12 percent relative to baseline. This is a good trade.

Knowledge assistant for internal policies

Data. Policies and handbooks in a Delta table with versioned snapshots. Embedding index registered in catalog.
Objective. Answer employee questions with grounded citations. Penalize confident but ungrounded claims.
Evaluation. Groundedness, citation coverage, and satisfaction ratings from weekly human review of a random sample.
Optimization. Compare retrieval strategies and chunk sizes. Add a step that refuses to answer out of scope questions and routes them to HR.
Outcome. The best variant lowers ungrounded answers below 1 percent and cuts time to first answer from minutes to seconds. A small bump in token use is accepted because the analyst time saved is substantial.

Triage for customer support

Data. Recent tickets with human labels, plus a stream of call transcripts landing in a Delta Live table.
Objective. Route to four queues with short rationales. Latency and consistency matter more than long narratives.
Evaluation. Routing accuracy, rationale completeness, and 95th percentile latency.
Optimization. Prefer smaller models with tool calls to pattern match recurring issues. Enforce a hard latency budget.
Outcome. A compact model with a simple function calling pattern meets the latency budget and improves accuracy with little cost increase.

A pragmatic 30, 60, 90 day rollout plan

30 days

Pick two use cases that directly move a business metric. One document extraction and one knowledge assistant is a common pairing.
Create small, stable golden sets. For extraction, 500 documents with field labels. For chat, 200 representative questions with reference answers. Keep them in your catalog with frozen versions.
Define success. Choose two or three metrics that matter: groundedness rate, field level F1, cost per correct answer, and a latency target.
Wire telemetry. Turn on tracing for prompts, retrieved context, costs, and outcomes. Store traces with access controls.
Run the first Agent Bricks pipeline. Sweep a small search space and publish a dashboard that shows the cost versus quality frontier. Share it with stakeholders so everyone understands the tradeoffs.

60 days

Put one agent in limited production. Define a budget and a rollback. Wrap the agent with a use policy and content filters. Set daily promotion gates based on your metrics.
Close the loop with humans. Schedule weekly expert reviews of random samples. Route corrections back into evaluation sets. Track how often human corrections change the model's judgment.
Harden governance. Pin dataset versions, lock indexes, and record all external calls. Document who can promote agents and under what conditions.
Negotiate cost controls. Work with finance to pre approve spend ranges and set alerts. Decide what to do when you hit a budget wall mid month.

90 days

Expand to a second department. Reuse the same foundry pattern. Keep the evaluation and approval process identical to speed audits.
Automate regression checks. When data or dependencies change, rerun evaluations and block promotion on metric regressions over a set threshold.
Build playbooks for incidents. Define how to pause an agent, triage a failure, and communicate with affected users. Keep a one page runbook next to every deployment.
Prove ROI. Tie agent outputs to measurable outcomes. For extraction, fewer revenue risk errors and faster close. For support, lower handle time and higher satisfaction. Use the same dashboards that report quality and cost to report impact.

Where Agent Bricks fits in the ecosystem

Enterprise teams now have a menu of agent platforms that reflect different philosophies. If you want to compare the lakehouse native approach to other stacks, see our coverage of warehouse centric and cloud platform offerings:

A warehouse led view is visible in our review of Snowflake Cortex Agents go GA, which explores how warehouses double as runtimes for production agents.
A cloud platform view appears in Vertex AI Agent Engine, where code execution and agent to agent orchestration are front and center.
Team level control is evolving quickly too. For a look at multi vendor oversight, check our GitHub Agent HQ overview.

Across these approaches, three ideas are converging: ground on governed data, evaluate with task aware metrics, and package with lineage. Agent Bricks puts those ideas in one opinionated, lakehouse native workflow.

Practical tips when you start

Write the task like a ticket. If a new hire could read your task spec and build the same test set, you wrote it clearly enough.
Start with a tiny search. One prompt family, two retrieval depths, two models, one post processor. Publish the cost versus quality chart on day one to anchor the conversation.
Label the ugly cases first. Nothing moves quality faster than 50 examples that capture what used to break your process.
Treat graders as models. Calibrate them, check them, and version them. Your graders are part of the product.
Use audit friendly defaults. Log retrieved context, tool inputs and outputs, and final responses. Set retention policies that match your data classification.

What this means for teams today

You can ship faster with fewer surprises. The pipeline forces you to define good, measure it, and pick a concrete tradeoff. That slashes debate time and makes handoffs smoother.
You can be honest about costs. Cost per correct answer is a number everyone understands. Put it on a chart next to groundedness and latency. Decide together what you are willing to pay for quality.
You can scale with confidence. Once a foundry pattern is in place, adding a new agent looks like filling out a template rather than inventing a new process. Governance and evaluation ride along by default.

The bottom line

Agent Bricks is not magic. It is a disciplined way to turn messy prompt craft into measurable engineering. The automation matters, but the real breakthrough is cultural. It encourages teams to talk about accuracy, cost, and risk in the same breath, to promote on evidence rather than optimism, and to treat agents as products with owners and service levels. Run this playbook for 90 days and the question shifts from whether agents work to which ones are worth expanding. That is how trustworthy agents finally move from slides to systems.