When to Build Agents vs. Prompt Engineering: DeepMind's Decision Framework
Select talks from MLOps World & TMLS Summit | December 2025
👋 Hello there, we’re Dave, Josh, and Anu. Welcome to AI in Production Field Notes. Each week, we provide a detailed overview of real-world lessons from teams deploying enterprise-grade AI systems.
Every issue distills a real-world case study talk, unpacking the frameworks and implementation patterns used in production. No fluff.
Building LLM agents can be tempting. However, many teams tend to overengineer. They might jump straight into complex agent frameworks when prompt engineering would be enough, or they might hide behind abstractions that complicate debugging.
Sushant Mehta, a Senior Research Engineer at Google DeepMind, spent years developing post-training systems to enhance Gemini’s coding and tool-use abilities. At MLOps World 2025, he shared DeepMind’s approach for determining when agents genuinely solve your problem and when they don’t.
For our first highlighted session, we will focus on this fundamental question. We hope you find it valuable!
📊 By The Numbers
3 stages: Standard post-training pipeline: SFT → reward modeling → RL optimization
90%+ success rate: Minimum threshold for production agents (50-80% isn’t enough)
Dynamic steps : Agents don’t know upfront how many steps they’ll need (2? 7? 15?)
Binary rewards work better: Final outcome (pass/fail) scales better than scoring every intermediate step
2-5 pathways: Possible paths agents might take on same task
Hours not weeks: Cursor ships a new version of its agent every few hours after collecting live production data
10x higher latency: Multi-step agentic loops vs. single-shot inference
Unknown cost multiplier: Each agent step = unpredictable total expense
Source: Sushant Mehta, Senior Research Engineer, Google DeepMind | MLOps World 2025, Austin TX
🔥 The Problem
This is something we’ve seen a lot across teams. You’re tired of writing prompts. Your current model can access tools, make decisions, and iterate. So you decide: “Let’s build an agent.”
Three weeks later, you’ve got a framework that feels smart but produces wildly inconsistent results. Sometimes it takes 2 steps. Sometimes 7. Your costs are unpredictable. Your latency varies by 10x. And you still don’t know if it’s working better than the simpler prompt you started with.
Here’s what you probably missed: agents and workflows are completely different things.
Workflows are deterministic. Fixed sequence. Predefined steps. You know step 1 → step 2 → step 3 no matter what input you get. Think form filling: always need name, email, phone in that order.
Agents have agency. They dynamically decide which tools to use and how many steps to take. The number of steps isn’t predetermined. Each outcome shapes what comes next. The model chooses the path.
“Agents is an overloaded term. We need to spend time defining what an agent actually is, how it differentiates from simpler concepts like workflows.”
Remember: If you can solve your problem with a workflow, building an agent is just expensive overengineering.
🎯 The DeepMind Decision Framework
The core insight: you don’t start with agents. You begin with prompts. Then you escalate only if necessary.
Mehta’s model shows three transparent escalation layers. Remember: Each layer adds complexity and cost. Each layer only makes sense for specific problem types. Most teams skip straight to layer three, then wonder why their costs exploded.
Step 1: Try Simple Prompt Engineering
Write a detailed prompt with few-shot examples. Include all necessary context. Test on real examples.
If this works: Stop. You’re done. Ship it.
What it does: Give the model context and examples. It completes the task in one pass. No tools. No loops. No routing.
“A lot of simple use cases with today’s models—prompt engineering alone is actually enough.”
Example: Customer support triage. Prompt: “Given this message, classify it as billing, technical, or account. Here are three examples.” Done. No agent needed.
Why this matters: Models are getting better and better at understanding intent. A detailed prompt with a few examples often suffices. Try this first. Always.
Step 2: Add Sequential Prompts with Validation
If prompt engineering isn’t enough, break the task into explicit steps. Add validation gates between steps. The flow is still deterministic but adaptive.
If this works: Stop. Don’t overcomplicate.
What it does: Define steps upfront, then validate after each step. Complete step 1 → validate → if pass, proceed to step 2 → validate → continue.
This gives you adaptive behavior without full agency. The flow is predictable, but you catch failures early before they compound.
Example: Document creation workflow. Step 1: Generate outline. Validation: “Are all required sections present?” Step 2: Fill each section. Validation: “Is the content accurate?” Step 3: Final review.
Why this works: You know roughly what will happen. You’re not asking the model to decide whether to generate an outline—you need it to do it well. Validation prevents cascading errors.
Step 3: Build Agentic Loops (Only If Necessary)
If the task is genuinely open-ended and unpredictable, now you need true agents. The model decides which tools to use and how many steps to take.
What it does: The model decides which tools to call, in what order, and when to stop. The number of steps is not predetermined. True agency.
Common patterns:
Generator-evaluator loops: Agent writes code → sandbox runs it → evaluator reads error → suggests fix → generator tries again → loop until tests pass or max iterations
Multi-turn tool orchestration: Search → synthesize → verify → iterate based on intermediate results
Why this requires post-training: Basic models can’t plan or reliably recover from errors. You need RL with verifiable rewards. The model must learn: if code passes tests, reward 1. Otherwise, reward 0. It figures out which actions maximize success.
Requirements before building:
The model is capable enough to plan (post-trained with RL)
Clear success criteria defined
Feedback loop established to collect production data
Target: 90%+ success rate before production deployment
Step 4: Consider Model Routing (the hidden layer)
If you have multiple distinct use cases, build a router. Classify requests, route to specialized models. Cheaper, faster, and more accurate than one huge model.
What it does: The Router model classifies each request and sends it to the right specialist. Specialists can be smaller, cheaper, and domain-optimized.
Why this exists: Frontier providers use this behind the scenes. Simple query? Route to a cheap model. Complex query? Use flagship. Same logic internally: one router, five specialists per domain.
Critical (know the risk): Continuously monitor router quality. If routing is wrong 30% of the time, everything downstream fails. Build dashboards. Track routing decisions. Validate against human judgment. Update continuously.
“There’s no easy way around this (Constant Monitoring). You do not know what the distribution of real world queries could look like.”
Example: The customer support router sends billing questions to a specialized billing model (cheap, trained on 10K tickets) and technical issues to a technical model. But audit constantly: is the router actually correct on production traffic?
✅ When Agents Actually Work
If these four conditions aren’t met, don’t build an agent.
Agents are worthwhile only when specific conditions hold:
1. Your problem is open-ended and unpredictable
No predetermined sequence. Multiple valid paths to completion. Different inputs require genuinely different approaches.
Example: Customer support. Issue type determines the resolution path. Billing dispute ≠ , technical troubleshooting ≠ , and account security.
Counter-example: Form filling. Always need a name, email, and phone in that order. Use a workflow.
2. Your model is capable enough to plan
If the model makes poor decisions at step 3, those errors compound through steps 4, 5, and 6. You end up with 10-step tasks that fail 90% of the time.
“Your model should actually be capable enough to do planning reasonably well. Otherwise you’ll end up with agents that have 50%, 70%, even 80% success rates. You typically need 90%+ for production.”
3. You have clear success criteria and feedback loops
If you can’t measure whether the agent succeeded, you can’t improve it. You’re stuck.
Good examples: Code passes tests. Customer issue resolved. Document filed correctly. Clear pass/fail.
This enables the RL feedback loop. Collect production failures → retrain with verifiable rewards → improve to 90%+ success → deploy confidently.
4. The business case justifies the complexity
Agents cost more. Multiple steps = multiple API calls. Latency is unpredictable. Engineering overhead is real (monitoring, fallback logic, human-in-loop gates). Only build if the use case is worth it.
❌ When Agents Are a Mistake
If these conditions are true, you don’t need an agent; you need a simpler system. Building an agent will add cost, latency, and debugging pain with no real upside.
1. Your task is deterministic
Fixed sequence of steps. Predictable paths. Use a workflow. Specify the five steps in your prompt. Done.
2. You haven’t tried prompt engineering
“Try it first. Seriously. Spend one sprint on prompt engineering before moving to agents. Models are getting absurdly good. A detailed prompt often works.”
3. You’re using frameworks to hide complexity
“The ease of abstraction can actually hide what underlying prompts they’re using and make it very hard for you to debug.”
Start with the frontier model APIs directly. Know exactly what prompt the model sees. Frameworks make it easy to get started, but you lose transparency in the process. When debugging breaks, you’re blind.
The test: If you can’t see the exact prompt your agent receives, you don’t have enough control. Fix that first.
4. You don’t have clear success metrics
If you can’t measure success, you can’t improve. You can’t debug. You don’t know when it’s failing. Don’t build an agent without this.
5. Your model isn’t post-trained for the task
Raw models and basic fine-tuning aren’t enough for complex agentic behavior. You need RL with explicit reward signals on the specific task.
🧠 Post-Training and Why it Actually Matters for Agents
See deeper coverage in the video below:
Agents don’t succeed because models are innovative. Agents succeed because post-training teaches models how to behave intelligently. Raw pre-trained models excel at one thing: predicting the next token. That’s it. They complete sentences. They don’t follow instructions, understand human preferences, or recover from errors.
Post-training alignment adjusts these models to match the tasks we actually want. It teaches them to follow instructions, understand preferences, avoid harmful behaviours, and enhance their reasoning, coding, and tool use.
The standard pipeline:
1. Supervised Fine-Tuning (SFT): Train on high-quality human instructions to teach instruction-following and basic task completion.
2. Reward Modelling: Gather preference data (“completion A is better than completion B”) and train a reward model to score outputs.
3. Reinforcement Learning: Apply algorithms like PPO or DPO to optimize the model based on the reward signal.
The key insight for agents: RL with verifiable rewards. Instead of scoring every intermediate step (expensive, subjective), score only the final outcome. Binary: did it work or not? Reward 1 or 0. The model learns which actions maximize success probability.
This scales well. For complex 10-step tasks, you don’t need to evaluate steps 3, 4, or 5. You only care: did the agent reach the correct final state? If yes, reward 1. If no, reward 0. The RL algorithm determines the rest.
Key design decision Mehta highlighted:
Mehta emphasized two design principles that determine whether agents actually work in practice:
Use binary rewards whenever your task has a clear pass/fail outcome—they scale better and make agents far more reliable.
Design every agent workflow from the model’s point of view, not your own. Most failures come from context and interface mismatches.
When to use verifiable binary rewards:
This is essential for RL at scale. Instead of assessing every intermediate step (which is costly, subjective, and difficult to standardize), you evaluate only the final result. The model learns which sequence of actions increases the likelihood of success.
Translation:
If your task has a clear 'done' condition—such as code passing tests, a form being correctly submitted, or a customer issue being resolved—use a binary reward signal instead of a gradient.
It’s simpler, more scalable, and easier to improve.
Why building from the model’s perspective prevents most failures
Mehta’s second principle: the model’s interface determines its performance. Not your intentions.
“Make sure you have a really good agent–computer interface. Think from the agent’s perspective: what context does it have access to? What tools can it use? What do those APIs look like?”
Translation:
Stop building from your perspective. Build from the model’s.
The model only sees:
the prompt
the tool schemas
the conversation history
That’s it.
So ask:
What does the model actually see?
Is my context sufficient?
Are the tool descriptions unambiguous?
If this is a computer-use task, can the agent actually parse what’s on screen?
If your task requires 50 lines of context but the model only gets 10, it will fail.
If icons in screenshots aren’t labelled, it can’t act.
If the tool definitions are ambiguous, it will call them wrong.
This mental shift—designing from the agent’s actual perceptual world—is what prevents most agentic failures.
⚙️ Best Practices for Production
These practices reflect what teams rely on in real production environments to keep systems reliable, scalable, and maintainable.
DO: Prioritize Simplicity
Start with the most straightforward approach that could work. Models are getting better at understanding intent. A detailed prompt often suffices.
Don’t overengineer. Complexity kills debuggability.
DO: Maintain Transparency
See the exact prompt the model receives. Track reasoning traces. Log tool calls. When something breaks, you need to know why.
If you can’t debug it, you can’t fix it.
DO: Close the Feedback Loop
Collect production failures. Identify where agents fail. Retrain with RL on those specific cases. Ship improvements quickly.
Fast iteration cycles (hours to days, not weeks) compound into massive improvements.
DO: Build Comprehensive Evals
Measure performance across all dimensions. If you’re improving on one metric, make sure you’re not regressing on others.
Test in sandbox environments before production. Especially for autonomous agents that might use tools in unexpected ways.
DO: Add Guardrails
Define stopping conditions. Max iterations. Time limits. Cost caps.
Human-in-loop for high-stakes decisions. Approval gates before irreversible actions.
DON’T: Hide Behind Abstractions
Frameworks make it easy to start, but hard to debug. If you don’t know what’s happening under the hood, you lose control.
Start with direct API calls. Add abstractions only after you understand the system deeply.
DON’T: Deploy Without Success Metrics
If you can’t measure whether the agent succeeded, you can’t improve it. You’re flying blind.
Define clear pass/fail criteria before building.
🛠️ Implementation Patterns
Stack components from Mehta’s discussion:
Foundation models: Use frontier models’ APIs directly to get started. Know exactly what prompt the model receives. Avoid frameworks that obscure the underlying prompt until you are confident in your architecture.
Reward models: Train on preference data (human-annotated pairs: “completion A is better than B”). Used in RL to evaluate agent outputs.
RL algorithms: DPO (Direct Preference Optimization) and PPO (Proximal Policy Optimization) are both referenced. They are standard in modern post-training workflows.
Sandbox environments: Run generated code in isolated containers for coding agents. This allows verifiable rewards (did the code run? did the tests pass?).
Validation checks: Performed between steps in sequential workflows. Can be explicit rules, model-based scoring, or ground-truth verification.
📚 Key Resources
Post-training references:
Tulu family of models (Allen Institute for AI)
LLaMA 3 post-training report (Meta) — highly recommended
Standard RL algorithms: PPO, DPO
Core framework: Escalation model → prompt engineering → sequential prompts → agentic loops → routing
📹 Watch the full talk
Sushant Mehta has extensive real-world experience. He covers production agent design, post-training fundamentals, real-world patterns (customer support, coding agents), and the complete decision framework for scaling agents safely.
👉 MLOps World | GenAI Summit 2025 – “Building Effective Agents.”
Level: Intermediate | Best for: ML engineers, product teams deciding on agentic architecture
✉️ Next Week
How Vimeo built near-real-time multilingual captions and dubbing—combining Gemini Flash 2.0, smart chunking, and focused evals to keep subtitles and dubs tightly in sync.
📤 Worth Sharing?
Know a team building agent? Know someone debugging why their 10-step agentic loop achieves 65% success rates?
Forward this → they’ll thank you Monday.
Anupama Garani, Josh Goldstein and David Scharbach
AI in Production Field Notes | TMLS









Spot on. This practical framewrok for agents versus prompt engineering is crucial to avoid over-engineering, which is just too tempting sometimes.
Brilliant breakdown of when agents actually make sense. The binary reward signal insight is spot on tho dunno if most teams realize how much complexity they avoid by validatin at the end vs every step. What's interesting is how this maps to the debugability problem - when somethign fails at step 7, backtracking through 6 prior decisions becomes nearly impossible without that final pass/fail anchor. Fast iteration cycles really are the unlock here.