Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
The demo works. The agent calls a tool, gets a result, calls another, returns an answer. It feels like magic.
Then you put it in production. Within days: infinite loops burning $47,000 per week. Agents returning the same response 58 times consecutively. Confident wrong answers with green dashboards. Silent failures nobody catches until a user reports them.
The while loop is not the engineering problem. The while loop is three lines of code. The engineering problem is everything around it: termination conditions, context budget management, error classification, tool safety rails, and observability infrastructure. Engineers who treat these as afterthoughts build systems that cannot be trusted. Engineers who treat them as the actual product ship agents that handle hundreds of thousands of requests a day.
This paper documents what production-grade agentic engineering actually requires.
---
An LLM agent is a loop:
``
while not done:
response = llm(context)
if response.has_tool_call:
result = execute_tool(response.tool_call)
context.append(result)
else:
return response
``Every major agent runtime — LangGraph, the Vercel AI SDK, AutoGen, CrewAI — converges on this structure. The model does not execute tools. It requests them. The loop is the agent.
This distinction matters. When an agent fails, the failure is almost never inside the model weights. It is in the loop: a missing termination condition, an unhandled tool error, a context that grew past the budget, a retry strategy that amplifies rather than recovers.
The model is a component. The loop is the system.
---
The most common and most expensive failure. An agent hits a tool error, the model reasons "I should retry," and without a max_steps constraint the system retries indefinitely. Real documented cases include agents sending the same API request thousands of times, consuming unbounded tokens and budget.
Fix: Hard max_steps limit. Step counting inside the loop, not as a hope. Exponential backoff with jitter on retries. Explicit termination states, not just "the model decides when it's done."
Every tool result appended to context costs tokens. Long-running agents accumulate context until they exceed the model's window — or, worse, fill it with irrelevant intermediate steps that crowd out the original instruction.
Researchers studying long-context LLMs have documented that attention quality degrades as context length grows. The model does not attend equally to all tokens. Instructions at the beginning of a long context are regularly ignored.
Fix: Context budget as a first-class resource. Summarize intermediate steps. Prune tool results that are no longer relevant. Track remaining context budget as a metric.
Not all tool errors are the same. A network timeout is transient — retry it. An authorization error is permanent — retrying wastes tokens and time. A malformed input error means the model's output was wrong — feed it back as a correction signal, not just an error string.
Most agent implementations treat all errors identically: append the error to context, let the model decide what to do. The model has no reliable way to distinguish error categories from a raw error string.
Fix: Classify errors at the tool wrapper layer before they enter the context. Expose error_type (transient, permanent, input_error, rate_limit) alongside the error message. Write routing logic based on type, not on the model's interpretation of a stack trace.
Agents are stochastic. The same task, run twice, may produce different tool call sequences, different intermediate results, and different final outputs. This is not a bug — it is a property of the model. But it means you cannot test an agent the way you test a deterministic function.
A 2025 survey of production deployments found that 75% of teams forgo formal benchmarking entirely, relying on human review or A/B testing. Reliability remains the top development challenge, ahead of evaluation and security.
Fix: Shadow-mode validation before production promotion. Run the new agent version against a sample of production traffic with no live effects. Compare outputs. Promote only when behavior is stable.
Tool results are model inputs. If a tool returns content from an external source — a web page, a document, an API response — that content can contain adversarial instructions that redirect the agent's behavior.
This is not theoretical. Prompt injection via tool results is a documented attack vector. An agent with write access to email, code, or databases that can be injected through a read operation is a high-severity vulnerability.
Fix: Treat tool results as untrusted data. Never pass raw external content directly into the context without sanitization. Separate the instruction-carrying context from the data-carrying context where possible.
---
A 2024 taxonomy of AgentOps infrastructure identified observability as the foundational requirement — more important than the orchestration framework, more important than the model choice. Without observability, you cannot measure reliability, detect loops, diagnose failures, or improve systems over time.
The observability stack for agents has three surfaces:
Cognitive: What was the model reasoning? What tool did it decide to call and why? Trace every decision point.
Operational: How long did each step take? Which tools failed and with what error? What was the token count at each step?
Contextual: What did the agent know when it made each decision? What was in context at the moment of each tool call?
Current standards — OpenTelemetry extensions (OpenLLMetry, OpenInference), LangFuse, purpose-built frameworks like TRAIL — are all converging on span-based tracing adapted from distributed systems observability. Each agent step is a span. Tool calls are child spans. The trace is the audit log.
The 2025 AgentTrace framework demonstrated that causal graph-based root cause analysis on production traces can localize the source of agent failures with sub-second latency — practical for interactive debugging in production environments.
---
Raw accuracy is not reliability. A 2026 paper studying 15 models across benchmarks proposed four dimensions of reliability that are independent of accuracy:
The paper found that reliability gains lag noticeably behind capability progress. Models get more capable; they do not automatically get more reliable.
This matters for production deployments. The most capable model and the most reliable model are not the same model. For production agents, choose reliability over capability.
---
A single-threaded agent runs linearly — every action sees the full continuous context and all prior decisions. This avoids conflicting assumptions between parallel subagents, produces coherent decisions, and dramatically simplifies failure analysis. Parallelism increases throughput; it also multiplies failure surface.
Use multi-agent parallelism for research and exploration tasks. Use single-threaded agents for tasks where consistency and correctness matter more than speed.
Free-form while-done loops leave termination entirely to the model. Directed-graph orchestration (LangGraph, PydanticAI state machines) makes state transitions explicit. Every state has defined exit conditions. Infinite loops become a graph traversal problem with hard bounds.
Anthropic's internal multi-agent research system used this architecture. Key lesson from their engineering team: small changes to prompt structure caused large behavioral changes. Explicit state graphs reduced this sensitivity.
74% of production agent deployments include human-in-the-loop evaluation as the primary correctness signal. This is not a sign of immaturity — it is sound engineering. Define checkpoints where a human must approve before the agent takes irreversible action. Make those checkpoints explicit in the architecture, not as a fallback when things go wrong.
---
A checklist, not a philosophy:
1. Hard step limit. max_steps is not optional. Set it. Log when it fires.
2. Error taxonomy. Classify errors before they enter context. Route on type, not text.
3. Context budget tracking. Know your remaining token budget at every step. Summarize when needed.
4. Termination conditions. Define done. Not "the model decides" — define it in code.
5. Trace every step. Span-based tracing from the first day. Cognitive + operational + contextual.
6. Shadow-mode validation. Never promote a new agent version to production without shadow testing.
7. Input sanitization on tool results. Treat external content as untrusted.
8. Human checkpoints. Define the irreversible actions. Require approval before them.
The agents that fail in production do not fail because the model is wrong. They fail because the loop has no guard rails, the errors have no classification, the context has no budget, and the observability produces no signal.
Build the loop. Then build everything around the loop. That is agentic engineering.
---