How can I replay past LLM requests to debug a bad response and compare prompt versions side by side?

Quick Answer: The fastest way to debug a bad LLM response is to replay the exact same request—same input, context, tools, and metadata—inside a trace-aware prompt playground, then iteratively tweak prompts and parameters side by side. With Arize, you can jump from a production trace into a replay, attach evaluations, and A/B test prompt versions so you know which change actually fixes the issue without introducing regressions.

Why This Matters

In production, “that one bad answer” is rarely a one-off—it’s usually a symptom of a deeper issue in your prompts, tools, or retrieval. If you can’t reliably replay a broken request and see every step an agent took, you’re guessing. When you add side-by-side prompt comparison and automated evaluations, debugging turns into a repeatable workflow: reproduce, inspect, fix, and verify before you push changes back to production.

Key Benefits:

Reproduce issues exactly: Trace and replay the full flow—user input, tools, retrieval, and intermediate steps—without trying to manually reconstruct the request.
Improve prompts with evidence: Compare prompt versions, models, and parameters on the same past requests and datasets to see which variant actually improves quality.
Prevent regressions: Turn bad responses into test cases, attach evals, and gate future prompt or agent releases through CI/CD experiments.

Core Concepts & Key Points

Concept	Definition	Why it's important
Tracing & Replay	Capturing every span in an LLM/agent request (prompts, tool calls, retrieval, outputs) and replaying that trace in a controlled environment.	Lets you debug the exact failure path instead of a rough reproduction, so fixes are grounded in real production behavior.
Prompt Playground & Comparison	An environment where you can load a past request, edit prompts/parameters/models, and see outputs from multiple variants side by side.	Makes it easy to see how small prompt changes affect outcomes and choose the best version based on real examples.
Evaluations & Experiments	Scoring LLM outputs using LLM-as-a-Judge, code checks, or human labels, then running A/B tests across variants.	Turns debugging from “looks better” into measurable improvement and avoids shipping changes that quietly break other paths.

How It Works (Step-by-Step)

One platform. From production trace to side-by-side prompt comparison and back to production, all with evals wired in.

1. Instrument and Capture the Full Flow

Before you can replay anything, you need good traces.

Standardize tracing with OTEL + OpenInference
- Use OpenTelemetry (OTEL) for spans and traces across your stack.
- Adopt OpenInference conventions so prompts, tool calls, and LLM metadata are structured consistently:
  - Spans for: user request, router decisions, RAG retrieval, each tool call, each LLM call.
  - Attributes for: model name, temperature, system/user prompts, tool arguments, retrieved docs, latency, token and cost metrics.
- Arize AX and Arize Phoenix ingest these spans directly—no proprietary tracing SDKs needed.
Log every agent decision as a span
- For complex agents, log:
  - Tool selection decisions (which tool, why).
  - Parameter extraction (structured arguments for tools).
  - Multi-step paths across sub-agents or planners.
- This is how teams like Booking get “the full flow” of multi-agent tool interactions and why replay is accurate later.
Send traces to Arize
- Configure your gateway/agent framework to export traces to Arize AX or Phoenix.
- Keep retention and rate limits aligned with your traffic and SLOs so you can always pull up the relevant past request.

2. Jump From a Bad Response to a Replay

Once a user hits a bad response in production, you don’t want to guess what happened.

Locate the failing trace
- Use Arize dashboards or search filters to find the problematic request:
  - Filter by error labels, low evaluation scores, specific users/tenants, or time windows.
  - Drill into a trace to see the full graph: session, spans, and multi-agent flow.
Inspect the trace details
- Open the trace view to see:
  - The exact user input.
  - The system and developer prompts.
  - Tools and retrieval spans (what was called, with what arguments, and what came back).
  - The final LLM output and any eval scores if you already had online evals running.
- Identify the failure pattern: hallucinated fact, missed constraint, wrong tool, or incorrect tool parameters.
Replay from trace into the prompt playground
- From the trace, open the Prompt Playground / Replay view.
- Arize automatically loads the context: prompts, model, parameters, retrieved docs, and tool outputs as they were.
- This gives you a deterministic starting point for debugging—same inputs, same context, re-run through updated prompts or models.

3. Compare Prompt Versions Side by Side

Now you have the “broken” request reconstructed. Time to iteratively improve.

Clone the original prompt into variants
- In the playground, create versions like:
  - prompt_v1 (original, for control).
  - prompt_v2 (better instructions, guardrails).
  - prompt_v3 (different decomposition or asking the model to outline its plan first).
- Keep temperature and other generation parameters controlled so you can attribute differences to the prompt, not randomness.
Run side-by-side outputs on the same request
- Execute all prompt versions on the same replayed input and context.
- View outputs side by side to compare:
  - Factual accuracy.
  - Structure and formatting.
  - Adherence to constraints (e.g., no PII, no speculation).
- For multi-step agents, check how each variant changes tool selection and path convergence (e.g., fewer unnecessary tool calls).
Add LLM feedback into the loop (Level 3 prompt optimization)
- Use LLM feedback to refine your prompts further. For example:
```
I am writing a prompt for this task. Is there any part of the prompt that I can improve? 
[ BEGIN PROMPT ]
{prompt_version}
[ END PROMPT ]
```
- Let the LLM critique and suggest refinements, then test those refinements as additional variants.
- This is especially useful after you’ve already A/B tested prompt structures and have a “good but not perfect” candidate.
Evaluate variants, not just eyeball them
- Attach evals in the playground to score each prompt version:
  - LLM-as-a-Judge templates for:
    - Answer correctness / hallucination detection.
    - Tool-call correctness (right tool, right arguments).
    - Instruction adherence (tone, formatting, safety).
    - Path convergence (“Did the agent follow a reasonable sequence of steps?”).
  - Code evals for structured outputs (JSON validity, schema adherence, deterministic transforms).
- This turns “feels better” into “v2 improves tool-call correctness by 15% on this trace and related examples.”

4. Generalize the Fix With Datasets & Experiments

Debugging one bad response is useful; turning it into a regression test is where it becomes production-grade.

Turn edge cases into datasets
- Save this trace and similar bad cases into a dataset inside Arize.
- Group by failure type: hallucinations, missed business rules, latency blowups, safety violations.
- Over time, these datasets become your golden set for agents and prompts.
Run experiments across datasets
- Promote your prompt variants (and any model changes) into an Experiment in Arize AX.
- Run them offline across your curated datasets:
  - Baseline: existing prompt/model/router.
  - Candidate: new prompt variant(s) or model versions.
- Use the same evals (LLM judges, code checks, and human annotations) to compare quality, tool correctness, and cost across many examples—not just one trace.
Wire into CI/CD to gate releases
- Add these experiments to your prompt and agent CI/CD pipeline:
  - Git change → experiment run on golden datasets → auto-pass/fail thresholds.
- Block deploys if:
  - Eval scores drop beyond a threshold on critical slices.
  - New variants increase hallucination rates or break tool-call correctness.
- This closes the loop between the production issue you debugged and future changes, so you don’t reintroduce the same bug later.

5. Monitor Online and Iterate

Even after you fix and ship, you’re not done.

Enable Online Evals on live traffic
- Use Arize’s Online Evals to have AI evaluate AI in real time:
  - Score accuracy, safety, or task success as requests come in.
  - Combine with custom metrics for latency, cost, and tool-call frequency.
- Set dashboards and alerts for key SLOs (hallucination rate, invalid tool call rate, failure mode counts).
Use annotation queues for ambiguous cases
- For high-stakes segments, pipe low-confidence or low-scoring outputs into a Human Annotation Queue.
- Annotators label correctness and tag failure modes; these get added back into datasets automatically.
- Over time, your test sets become richer, and your prompt/agent updates are trained on real edge cases—not synthetic examples.
Repeat the loop
- New issue → trace → replay → side-by-side prompt comparison → evals + experiments → CI/CD → monitoring.
- This is the build–learn–improve loop that keeps agents production-ready instead of demo-only.

Common Mistakes to Avoid

Relying on logs instead of traces:
Plain text logs of “input” and “output” miss all the middle steps. Use OTEL spans to capture tools, retrieval, and intermediate prompts so replay is faithful and debuggable.
Testing prompts only on the one broken example:
A fix that solves one user’s issue can quietly hurt others. Always promote your debugged trace into a dataset and run experiments on a broader set before you deploy.
Using opaque, black-box eval models:
If you can’t inspect or tune your evaluators, you risk cargo-cult scoring. Prefer transparent LLM-as-a-Judge templates and open-source eval models you can adapt for your domain.

Real-World Example

At my marketplace, we had a multi-agent workflow for seller policy questions. It “worked” in staging but occasionally gave contradictory answers in production when policies changed. One incident involved a seller being told an outdated return window, which became a Sev-1 because of compliance risk.

We pulled the exact trace from Arize: the router chose the wrong sub-agent, the RAG retriever pulled stale documents, and the main agent still produced a confident answer. From that trace, we opened the prompt playground and replayed the request. We cloned the existing prompts into three variants: one that enforced timestamp checks on retrieved docs, one that required citing source snippets, and one that asked the agent to outline its reasoning steps before answering.

Running all three side by side on the replayed request—and on a dataset of similar policy queries—showed that the “cite sources + enforce recency” prompt reduced hallucinations significantly. We attached LLM-as-a-Judge evals tuned to “is this answer aligned with the latest policy version?” plus code evals that validated the policy version ID in the response. After the experiment passed thresholds, we wired the new prompts into CI/CD as the gate for any future policy-agent changes. The same failure hasn’t resurfaced since, and any new prompt edit gets automatically tested against that original bad case.

Pro Tip: When you debug a failure, don’t just save the final “fixed” prompt—also snapshot the trace and link it to the experiment that validated the fix. That way, future engineers can see exactly which production incident each prompt version is designed to prevent.

Summary

Replaying past LLM requests is the difference between guessing and knowing. With robust OTEL-based tracing, you can take any bad production response, reconstruct the full flow, and replay it in a prompt playground. From there, side-by-side prompt comparison, LLM feedback, and evaluations turn debugging into a measurable loop: fix the issue on the original trace, prove the improvement on a dataset, and gate all future changes with experiments and CI/CD. That’s how you move from demo-able agents to agents that actually hit your SLOs.

Next Step

Get Started