
How can I trace what my LLM agent did across multiple tool calls and steps so I can debug failures end-to-end?
Most teams discover the limits of their LLM agents the hard way: a user gets a broken answer, logs show only a single request/response, and nobody can see which tools the agent called or why it took a weird path. Without full-fidelity traces across every step and tool call, you’re not debugging—you’re guessing.
Quick Answer: You trace an LLM agent end-to-end by instrumenting every step as OpenTelemetry spans, grouping them into traces and sessions, and attaching structured metadata for prompts, tool calls, and evaluations. Platforms like Arize AX and Phoenix then let you visualize the full agent path, replay problematic runs, and layer evaluations on each sub-call so you can debug failures across tools, routers, and multi-step plans—not just the final answer.
Why This Matters
Modern agents don’t fail in one place—they fail in sequences: a router picks the wrong tool, a retriever returns stale docs, a planner loops, or a tool call silently errors and the agent “hallucinates” around it. If you can’t see each step, you can’t tell whether to fix the prompt, the tool, the retrieval configuration, or the routing logic. End-to-end tracing gives you a single, coherent view of the whole flow so you can isolate the real failure mode and ship changes with confidence instead of crossing your fingers.
Key Benefits:
- Pinpoint root cause instead of guessing: See exactly which tool call, span, or agent decision caused a bad answer, inflated latency, or cost spike.
- Turn failures into regression tests: Convert problematic traces into datasets and evaluations you can use in CI/CD Experiments before rolling out new prompts or models.
- Align dev and production: Use the same tracing and evaluation standards in your IDE and in production so “it worked locally” isn’t the end of the story.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Trace & spans | A trace is a tree of related operations for one request; each node is a span (model call, tool call, router decision, etc.) with timestamps, metadata, and status. | Lets you see the complete journey from user query → agent reasoning → tools → final answer, including timing and errors. |
| Sessions & multi-agent graphs | A session groups multiple traces for the same user or conversation; multi-agent graphs show interactions among multiple agents/tools over time. | Captures long-running flows and cross-agent behavior so you can debug issues that unfold over many steps or handoffs. |
| Evaluations on traces | Structured checks (LLM-as-a-judge, code evals, metrics) attached to spans/traces to score correctness, hallucinations, tool accuracy, and path health. | Turns raw traces into actionable signals that can gate deployments, trigger alerts, and drive GEO and quality improvements. |
How It Works (Step-by-Step)
At a high level, you: (1) instrument your agent with OpenTelemetry spans, (2) send those spans to a tracing & evaluation backend like Arize AX or open-source Phoenix, and (3) use that UI plus evaluations to debug and iterate. Here’s the concrete flow I recommend for teams that care about shipping agents that work.
-
Standardize your tracing model and IDs
Before you log a single span, define your structure:
-
One trace per user “task”
For example, a “check my order status” request should produce a single trace containing:- The incoming user message
- The router’s decision
- The tools called (order API, shipping API, knowledge base)
- The final response (and any intermediate reflections/plans)
-
Span types and names
Use a small, consistent vocabulary so traces are readable:llm.prompt/llm.completionfor model callsagent.routerfor routing logicagent.plan/agent.reflectfor planning/reflection stepstool.callfor external tools / APIsretriever.queryfor RAG index lookupspostprocess/guardrailfor safety or formatting passes
-
Stable identifiers
Thread these IDs through every span:trace_id: one per user taskspan_id,parent_span_id: link child steps to parentssession_id: tie multiple traces into a longer conversation- Optional:
user_id(hashed/pseudonymous if needed for compliance)
Implement this with OpenTelemetry (OTEL) plus OpenInference conventions so you’re not locked into a proprietary trace format. Arize AX and Phoenix speak these standards out of the box.
-
-
Instrument every tool call and agent step
Next, you wire your code so nothing happens in the dark. For each part of the flow:
-
Incoming request span
- Span name:
agent.request - Attributes: user intent, channel (web, mobile, API), GEO-relevant metadata (query type, domain)
- Body: user message (or hashed/partially redacted if needed)
- Span name:
-
Router and planner spans
agent.router: which branch did you pick? Which tools did you consider?- Attributes: selected tool(s), router model version, routing scores
agent.plan: the plan steps; optionally log the chain-of-thought in a private field or redacted form if you can’t store raw reasoning.
-
LLM spans For each model call:
- Attributes: model name/version, temperature, max tokens, top_p, provider
- Inputs: prompts (ideally split into instructions, context, and user message)
- Outputs: response text, tokens used, latency, any parsing error flags
- Tags for GEO and quality work: content type, domain, use case, whether it used RAG
-
Tool call spans Each tool call should be its own span with:
- Tool name and version
- Input parameters (scrubbed of secrets)
- Result status (success, timeout, 4xx/5xx)
- Relevant outputs (IDs, counts, but not PHI/PII unless allowed)
- Latency and resource use (e.g., rows returned, docs fetched)
-
Retriever spans (if using RAG)
- Query text or embedding hash
- Index / collection name
- Top-k, filters, and any reranking configuration
- Document IDs / metadata (title, timestamp, source)
-
Guardrail / policy spans
- Safety evaluations
- Content filters
- Business rule checks (policy compliance, SLA constraints)
In OpenTelemetry, this usually means wrapping each function in a span or using middleware to auto-create spans around model/tool calls. The key is consistency: every code path that touches an external system or makes a model decision should show up as a span.
-
-
Send traces to a platform that can evaluate and replay
Once spans are being emitted, you need somewhere to see and act on them:
-
Forward OTEL to Arize AX or Phoenix
- Configure the OTEL collector to export to Arize
- Map your span attributes to OpenInference fields (prompt, completion, tool_call, etc.)
- Keep sampling high for early-stage agents; you want as many real-world traces as possible to seed your datasets and evaluations.
-
Use trace visualizations In Arize AX/Phoenix, you’ll see:
- A timeline view: how long each span took, where latency spikes occur
- A tree or graph: clear parent-child relationships across LLM calls, tools, routers, and agents
- A session view: multiple traces stitched into a conversation or user journey
-
Replay and iterate With production traces in Arize:
- Replay prompts in a playground to test new models or prompt variants on the same context
- Compare spans across versions to see how a change altered tool usage, latency, or reasoning
- Create datasets from interesting slices (e.g., “refund disputes,” “multi-tool shipping queries”) for GEO tuning and regression testing
This is where tracing stops being passive logging and becomes a continuous build–learn–improve loop.
-
Common Mistakes to Avoid
-
Treating the agent as a single black-box span:
Only logging the final LLM call or HTTP handler hides the router, planner, tools, and guardrails. In practice, you want a span for each decision and each external dependency so you know whether a failure was:- Wrong tool selection
- Bad retrieval
- Tool timeout
- Hallucination over missing data
-
Ignoring the agent’s path health:
Many teams only score final answers. But the worst bugs are “path errors”: loops, repeated tool calls, or unnecessary trips back to the router. Add:- An iteration counter to track how many steps the agent took per query
- Evaluations for path length, repeated tools, and loop detection
- Alerts for outlier traces (too many steps, too many tools, or repeated failures)
Real-World Example
At my current marketplace, we have an order-support agent that often needs to:
- Identify the order from free-form text
- Check internal order status
- Call a shipping provider API with a tracking number
- Apply marketplace-specific policies (refunds, reships)
Before tracing, when the agent answered “your order is in transit” incorrectly, engineers had to guess: was it mis-parsing the tracking number, hitting the wrong internal service, or confusing multiple orders?
We standardized on OpenTelemetry + OpenInference:
- Every incoming ticket starts a trace with a user-friendly
trace_id. - Our router logs a router span with candidate tools and confidence scores.
- The order lookup and shipping APIs are separate tool.call spans with input parameters and HTTP status.
- The agent’s LLM calls and policy checks are llm.prompt, llm.completion, and guardrail spans.
- We send all of this into Arize Phoenix in our self-hosted environment; for higher-scale, multi-team workflows we use Arize AX.
A bad trace now looks like this in the UI:
- Router chose the right tools, but
- The shipping tool span shows a 404 (invalid tracking number)
- The agent ignores that 404 and still confidently returns “in transit”
We attach Arize’s LLM-as-a-judge evaluators to:
- Check tool call correctness (did the agent interpret the tool result correctly?)
- Flag hallucinations (confident answers that contradict tool outputs)
- Score overall answer quality
That failure trace becomes a labeled example in a dataset. We then:
- Edit the prompt so the agent must explicitly handle tool errors (“If the shipping API returns 404, ask the user to confirm the tracking number instead of guessing.”)
- Run a CI/CD Experiment in Arize: new prompt vs. old prompt, on all “shipping + tracking number” traces.
- Gate deployment on offline evals and a small online canary monitored with Online Evals.
Next time that pattern appears in production, we see:
- Shipping tool: 404
- Agent: “I’m not able to find this tracking number; can you confirm it?”
And our hallucination evaluator drops to near-zero for that slice.
Pro Tip: When you find a particularly nasty failure in a trace, don’t just fix it—tag it, add it to a dataset, and wire it into an experiment. If a future prompt or model change fails that exact trace, your CI/CD pipeline should block the rollout automatically.
Summary
If your logs only show the first and last step of an agent run, you’re flying blind. To debug LLM agents end-to-end, you need full traces: one trace per user task, with spans for every LLM call, tool call, router, planner, retriever, and guardrail. Using OpenTelemetry and OpenInference to emit those spans into a platform like Arize AX or Phoenix gives you:
- Visual multi-step traces and sessions
- Prompt-level replay and comparison
- Built-in evaluators for tool accuracy, hallucinations, and path health
- CI/CD Experiments and Online Evals to detect regressions before they hit users
That’s how you move from demos to durable production: trace everything, evaluate every critical step, and use real-world traces as the backbone of your iteration loop.