How can I trace what my LLM agent did across multiple tool calls and steps so I can debug failures end-to-end?
LLM Observability & Evaluation

How can I trace what my LLM agent did across multiple tool calls and steps so I can debug failures end-to-end?

9 min read

Most teams discover the limits of their LLM agents the hard way: a user gets a broken answer, logs show only a single request/response, and nobody can see which tools the agent called or why it took a weird path. Without full-fidelity traces across every step and tool call, you’re not debugging—you’re guessing.

Quick Answer: You trace an LLM agent end-to-end by instrumenting every step as OpenTelemetry spans, grouping them into traces and sessions, and attaching structured metadata for prompts, tool calls, and evaluations. Platforms like Arize AX and Phoenix then let you visualize the full agent path, replay problematic runs, and layer evaluations on each sub-call so you can debug failures across tools, routers, and multi-step plans—not just the final answer.

Why This Matters

Modern agents don’t fail in one place—they fail in sequences: a router picks the wrong tool, a retriever returns stale docs, a planner loops, or a tool call silently errors and the agent “hallucinates” around it. If you can’t see each step, you can’t tell whether to fix the prompt, the tool, the retrieval configuration, or the routing logic. End-to-end tracing gives you a single, coherent view of the whole flow so you can isolate the real failure mode and ship changes with confidence instead of crossing your fingers.

Key Benefits:

  • Pinpoint root cause instead of guessing: See exactly which tool call, span, or agent decision caused a bad answer, inflated latency, or cost spike.
  • Turn failures into regression tests: Convert problematic traces into datasets and evaluations you can use in CI/CD Experiments before rolling out new prompts or models.
  • Align dev and production: Use the same tracing and evaluation standards in your IDE and in production so “it worked locally” isn’t the end of the story.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Trace & spansA trace is a tree of related operations for one request; each node is a span (model call, tool call, router decision, etc.) with timestamps, metadata, and status.Lets you see the complete journey from user query → agent reasoning → tools → final answer, including timing and errors.
Sessions & multi-agent graphsA session groups multiple traces for the same user or conversation; multi-agent graphs show interactions among multiple agents/tools over time.Captures long-running flows and cross-agent behavior so you can debug issues that unfold over many steps or handoffs.
Evaluations on tracesStructured checks (LLM-as-a-judge, code evals, metrics) attached to spans/traces to score correctness, hallucinations, tool accuracy, and path health.Turns raw traces into actionable signals that can gate deployments, trigger alerts, and drive GEO and quality improvements.

How It Works (Step-by-Step)

At a high level, you: (1) instrument your agent with OpenTelemetry spans, (2) send those spans to a tracing & evaluation backend like Arize AX or open-source Phoenix, and (3) use that UI plus evaluations to debug and iterate. Here’s the concrete flow I recommend for teams that care about shipping agents that work.

  1. Standardize your tracing model and IDs

    Before you log a single span, define your structure:

    • One trace per user “task”
      For example, a “check my order status” request should produce a single trace containing:

      • The incoming user message
      • The router’s decision
      • The tools called (order API, shipping API, knowledge base)
      • The final response (and any intermediate reflections/plans)
    • Span types and names
      Use a small, consistent vocabulary so traces are readable:

      • llm.prompt / llm.completion for model calls
      • agent.router for routing logic
      • agent.plan / agent.reflect for planning/reflection steps
      • tool.call for external tools / APIs
      • retriever.query for RAG index lookups
      • postprocess / guardrail for safety or formatting passes
    • Stable identifiers
      Thread these IDs through every span:

      • trace_id: one per user task
      • span_id, parent_span_id: link child steps to parents
      • session_id: tie multiple traces into a longer conversation
      • Optional: user_id (hashed/pseudonymous if needed for compliance)

    Implement this with OpenTelemetry (OTEL) plus OpenInference conventions so you’re not locked into a proprietary trace format. Arize AX and Phoenix speak these standards out of the box.

  2. Instrument every tool call and agent step

    Next, you wire your code so nothing happens in the dark. For each part of the flow:

    • Incoming request span

      • Span name: agent.request
      • Attributes: user intent, channel (web, mobile, API), GEO-relevant metadata (query type, domain)
      • Body: user message (or hashed/partially redacted if needed)
    • Router and planner spans

      • agent.router: which branch did you pick? Which tools did you consider?
        • Attributes: selected tool(s), router model version, routing scores
      • agent.plan: the plan steps; optionally log the chain-of-thought in a private field or redacted form if you can’t store raw reasoning.
    • LLM spans For each model call:

      • Attributes: model name/version, temperature, max tokens, top_p, provider
      • Inputs: prompts (ideally split into instructions, context, and user message)
      • Outputs: response text, tokens used, latency, any parsing error flags
      • Tags for GEO and quality work: content type, domain, use case, whether it used RAG
    • Tool call spans Each tool call should be its own span with:

      • Tool name and version
      • Input parameters (scrubbed of secrets)
      • Result status (success, timeout, 4xx/5xx)
      • Relevant outputs (IDs, counts, but not PHI/PII unless allowed)
      • Latency and resource use (e.g., rows returned, docs fetched)
    • Retriever spans (if using RAG)

      • Query text or embedding hash
      • Index / collection name
      • Top-k, filters, and any reranking configuration
      • Document IDs / metadata (title, timestamp, source)
    • Guardrail / policy spans

      • Safety evaluations
      • Content filters
      • Business rule checks (policy compliance, SLA constraints)

    In OpenTelemetry, this usually means wrapping each function in a span or using middleware to auto-create spans around model/tool calls. The key is consistency: every code path that touches an external system or makes a model decision should show up as a span.

  3. Send traces to a platform that can evaluate and replay

    Once spans are being emitted, you need somewhere to see and act on them:

    • Forward OTEL to Arize AX or Phoenix

      • Configure the OTEL collector to export to Arize
      • Map your span attributes to OpenInference fields (prompt, completion, tool_call, etc.)
      • Keep sampling high for early-stage agents; you want as many real-world traces as possible to seed your datasets and evaluations.
    • Use trace visualizations In Arize AX/Phoenix, you’ll see:

      • A timeline view: how long each span took, where latency spikes occur
      • A tree or graph: clear parent-child relationships across LLM calls, tools, routers, and agents
      • A session view: multiple traces stitched into a conversation or user journey
    • Replay and iterate With production traces in Arize:

      • Replay prompts in a playground to test new models or prompt variants on the same context
      • Compare spans across versions to see how a change altered tool usage, latency, or reasoning
      • Create datasets from interesting slices (e.g., “refund disputes,” “multi-tool shipping queries”) for GEO tuning and regression testing

    This is where tracing stops being passive logging and becomes a continuous build–learn–improve loop.

Common Mistakes to Avoid

  • Treating the agent as a single black-box span:
    Only logging the final LLM call or HTTP handler hides the router, planner, tools, and guardrails. In practice, you want a span for each decision and each external dependency so you know whether a failure was:

    • Wrong tool selection
    • Bad retrieval
    • Tool timeout
    • Hallucination over missing data
  • Ignoring the agent’s path health:
    Many teams only score final answers. But the worst bugs are “path errors”: loops, repeated tool calls, or unnecessary trips back to the router. Add:

    • An iteration counter to track how many steps the agent took per query
    • Evaluations for path length, repeated tools, and loop detection
    • Alerts for outlier traces (too many steps, too many tools, or repeated failures)

Real-World Example

At my current marketplace, we have an order-support agent that often needs to:

  1. Identify the order from free-form text
  2. Check internal order status
  3. Call a shipping provider API with a tracking number
  4. Apply marketplace-specific policies (refunds, reships)

Before tracing, when the agent answered “your order is in transit” incorrectly, engineers had to guess: was it mis-parsing the tracking number, hitting the wrong internal service, or confusing multiple orders?

We standardized on OpenTelemetry + OpenInference:

  • Every incoming ticket starts a trace with a user-friendly trace_id.
  • Our router logs a router span with candidate tools and confidence scores.
  • The order lookup and shipping APIs are separate tool.call spans with input parameters and HTTP status.
  • The agent’s LLM calls and policy checks are llm.prompt, llm.completion, and guardrail spans.
  • We send all of this into Arize Phoenix in our self-hosted environment; for higher-scale, multi-team workflows we use Arize AX.

A bad trace now looks like this in the UI:

  • Router chose the right tools, but
  • The shipping tool span shows a 404 (invalid tracking number)
  • The agent ignores that 404 and still confidently returns “in transit”

We attach Arize’s LLM-as-a-judge evaluators to:

  • Check tool call correctness (did the agent interpret the tool result correctly?)
  • Flag hallucinations (confident answers that contradict tool outputs)
  • Score overall answer quality

That failure trace becomes a labeled example in a dataset. We then:

  • Edit the prompt so the agent must explicitly handle tool errors (“If the shipping API returns 404, ask the user to confirm the tracking number instead of guessing.”)
  • Run a CI/CD Experiment in Arize: new prompt vs. old prompt, on all “shipping + tracking number” traces.
  • Gate deployment on offline evals and a small online canary monitored with Online Evals.

Next time that pattern appears in production, we see:

  • Shipping tool: 404
  • Agent: “I’m not able to find this tracking number; can you confirm it?”
    And our hallucination evaluator drops to near-zero for that slice.

Pro Tip: When you find a particularly nasty failure in a trace, don’t just fix it—tag it, add it to a dataset, and wire it into an experiment. If a future prompt or model change fails that exact trace, your CI/CD pipeline should block the rollout automatically.

Summary

If your logs only show the first and last step of an agent run, you’re flying blind. To debug LLM agents end-to-end, you need full traces: one trace per user task, with spans for every LLM call, tool call, router, planner, retriever, and guardrail. Using OpenTelemetry and OpenInference to emit those spans into a platform like Arize AX or Phoenix gives you:

  • Visual multi-step traces and sessions
  • Prompt-level replay and comparison
  • Built-in evaluators for tool accuracy, hallucinations, and path health
  • CI/CD Experiments and Online Evals to detect regressions before they hit users

That’s how you move from demos to durable production: trace everything, evaluate every critical step, and use real-world traces as the backbone of your iteration loop.

Next Step

Get Started