What metrics actually measure agent reliability (task completion, tool success, exchanges-to-completion) in production?
LLM Observability & Evaluation

What metrics actually measure agent reliability (task completion, tool success, exchanges-to-completion) in production?

11 min read

Most teams don’t discover their agents are unreliable until it hurts—support queues spike, orders get misrouted, or a “helpful” agent quietly bypasses policy. The root cause is almost always the same: they’re shipping agents with demo metrics, not production metrics.

If you want agents that behave reliably under real load, you need to measure the right things in production: task completion, tool success, and how many exchanges it takes to get there. And you need those metrics wired into a feedback loop that turns offline evals into always-on guardrails.

This piece breaks down the metrics that actually reflect agent reliability in production, how to instrument them, and how Galileo operationalizes them with Evaluate, Signals, and Protect.


The Quick Overview

  • What It Is: A practical framework for measuring agent reliability in production using task completion, tool success, and exchanges-to-completion—plus the supporting metrics you need for governance.
  • Who It Is For: Teams running LLM agents and RAG systems in production who are past the “demo works” phase and now need repeatable, measurable reliability.
  • Core Problem Solved: Generic “response quality” or CSAT metrics don’t reveal where agents fail (wrong tools, partial tasks, policy drift). You need agent-specific metrics that track full journeys, tool decisions, and safety posture in real time.

How It Works

Reliable agents don’t come from intuition; they come from measurable behavior. The workflow looks like this:

  1. Define reliability as concrete metrics.
    You translate “the agent worked” into a small set of quantifiable metrics: end-to-end task completion, tool selection quality, tool success rate, exchanges-to-completion, safety violations, and drift.

  2. Instrument sessions → traces → spans.
    Every user session becomes a trace. Every agent step, tool call, and model response becomes a span. You attach evaluators—some generic, some domain-specific—to these spans so they can score behavior continuously.

  3. Turn evals into guardrails.
    Offline, you calibrate evaluators against synthetic and real data. In production, those same evaluators run as low-latency checks that can block, redact, override, or escalate when agents misbehave. That’s the eval-to-guardrail lifecycle Galileo is built around.

Let’s unpack the core metrics and how they fit into this lifecycle.


Core Reliability Metrics for Agents in Production

1. End-to-End Task Completion

What it measures:
Whether the agent fully accomplishes the user’s overall goal across a full session—not just individual turns.

Why it matters:
A support agent can give “good” responses at each step and still fail to resolve the ticket. Measuring per-turn quality without end-to-end task completion is how you ship agents that sound smart but don’t get the job done.

How to define it:

  • Binary completion:
    • task_completed = 1 if the user’s defined goal is satisfied
    • task_completed = 0 otherwise
  • Task outcome quality (optional scalar):
    • 1–5 or 0–1 score for “how well” the goal was satisfied (e.g., fully resolved vs. a partial workaround).

Instrumentation pattern:

  • Treat each user session as a trace.
  • Mark the user goal at trace start (explicitly from the UX or inferred).
  • At trace end, run an evaluator that asks:

    “Given the initial goal and the full trace, did the agent resolve the user’s request?”

This is where Galileo’s Evaluation Engine shines: you can use out-of-box evaluators for task completion or create a custom LLM-as-judge evaluator from a written description, then refine it with SME annotations and live examples.


2. Tool Selection Quality

What it measures:
Whether the agent chose the right tool (or tool chain) for the task and whether the parameters were appropriate and well-structured.

Why it matters:
Most production failures aren’t language problems; they’re decision problems:

  • Agent uses a web-search tool instead of an internal policy database.
  • Agent calls the right API with malformed parameters.
  • Agent refuses a task it could perform safely with the right tool.

You need specialized metrics that assess the selection decision and the parameter quality, not just whether the final answer “looks correct.”

Metric components:

  • Tool choice correctness:
    • tool_choice_correct = 1 if chosen tool(s) match what a reference agent or SME label says
    • 0 otherwise
  • Tool parameter quality:
    • Checks like: required fields present, types correct, ranges valid, format aligned with API spec.
    • Scores like parameters_valid = 0/1 plus a reason.

Instrumentation pattern:

  • Treat each tool call as a span in the trace.
  • Attach evaluators that:
    • Compare chosen tool to a labeled “ideal tool” when you have labeled data.
    • Validate parameters against schemas or schema-aware evaluators.
  • Use these scores to find patterns:
    • “Agent chooses SearchDocuments when it should use GetCustomerProfile 26% of the time for billing questions.”

Galileo exposes this as agent-specific metrics (tool selection quality) and lets you drill down to spans where tools were misused.


3. Tool Success Rate

What it measures:
Whether a given tool call actually helped move the task toward completion, not just whether the tool executed without throwing an error.

Why it matters:
An HTTP 200 doesn’t mean the agent used the tool effectively. You need to know:

  • Did the tool return relevant data?
  • Did the agent interpret and apply that data correctly?
  • Did using the tool improve task completion odds?

Metric components:

  • Tool execution success:
    • tool_exec_success = 1 if call completed without technical error
    • 0 otherwise
  • Tool semantic success:
    • Evaluator examines input, tool response, and subsequent agent action to score:
      • tool_semantic_success = 1 if the call produced useful information and the agent used it appropriately
      • 0 if the tool was unnecessary, returned irrelevant data, or was misinterpreted.

Instrumentation pattern:

  • For each tool span:
    • Log input, output, status, and downstream agent response.
    • Run evaluators that check:
      • Was this tool call necessary? (e.g., redundant calls or “thrashing”).
      • Did the agent’s next action reflect the tool’s result accurately?

Over time, you get a per-tool success profile that feeds directly into:

  • Prompt improvements (“don’t call web search unless X”).
  • Routing logic (“for refunds, prefer BillingAPI over FAQSearch”).

4. Exchanges-to-Completion

What it measures:
How many back-and-forth turns it takes to fully complete a task.

Why it matters:

  • User experience: Fewer exchanges usually mean less friction.
  • Cost & latency: More turns = more tokens, more tool calls, more infrastructure.
  • Reliability: High exchange counts often indicate confusion, loops, or partial answers that never quite resolve.

Metric components:

  • Exchanges-to-completion (ETC):
    • Number of user↔agent turns from session start until task completion.
  • Exchanges-after-solved:
    • Extra turns after the agent has technically resolved the issue but continues to talk.

Instrumentation pattern:

  • Mark the step where your task-completion evaluator first returns task_completed = 1.
  • Calculate:
    • ETC = turn_index_of_completion
    • extra_turns = total_turns - turn_index_of_completion

How to use it:

  • Benchmark median ETC per use case.
  • Track drift: if ETC spikes after a prompt change, you introduced confusion.
  • Combine with cost per trace to see the real dollar impact of inefficiency.

Galileo captures trace-level latency and cost, so you can correlate ETC with infra spend.


5. Interaction Quality (Per-Turn Experience)

What it measures:
How clear, helpful, and aligned each response is—beyond whether the agent eventually completed the task.

Why it matters:
Task completion without a usable experience still drives escalations and distrust. For example:

  • Agent resolves an issue but uses hostile or opaque language.
  • Agent over-asks for information it already has in context.

Metric components:

Per-response evaluators for:

  • Clarity
  • Helpfulness
  • Politeness/tone
  • Instruction-following

Instrumentation pattern:

  • Treat each model response as a span.
  • Run evaluators that output 0–1 or 1–5 scores per attribute.
  • Aggregate to:
    • Average interaction quality per session.
    • Distribution by use case or model version.

This is where you move beyond “it worked” to “it worked in a way that people trust.”


6. Safety, Security, and Policy Metrics

What they measure:
Whether the agent respects your boundaries: no PII leaks, no policy violations, no prompt injection or jailbreak success.

Why they matter:
Reliability isn’t just “does it work”; it’s “does it work safely every time.” Safety failures are where the expensive, career-limiting incidents live.

Core safety metrics:

  • PII leak rate: % of outputs that expose sensitive data.
  • Policy violation rate: % of sessions where the agent breaks org policies (e.g., unauthorized refunds, medical advice without disclaimers).
  • Prompt injection / jailbreak success rate:
    • % of adversarial inputs that cause tool misuse or policy bypass.
  • Guardrail interception rate:
    • % of risky behaviors caught and blocked/redacted/overridden before reaching the user or tools.

Galileo’s Protect module runs these as real-time guardrails with sub-200ms latency, so the same evaluators you use for offline testing become production firewalls.


7. Drift and Regression Metrics

What they measure:
How agent behavior shifts over time—across model changes, prompt updates, or new tools—and whether those changes improve or harm reliability.

Why they matter:
You don’t just deploy once; you iterate. Without drift metrics, you’re back to “hope” as your release criteria.

Examples:

  • Task completion drift:
    • Δ in task completion rate compared to previous version.
  • Tool usage drift:
    • Changes in tool selection patterns (e.g., SearchAPI usage jumps from 20% → 60% of sessions after a prompt tweak).
  • Safety drift:
    • Changes in PII or policy violation rates over time.

Galileo’s Signals automatically scans traces for these patterns, so you detect “unknown unknowns”—like a new model overusing a risky tool—before users complain.


How to Put These Metrics into a Production Workflow

Step 1: Define your reliability spec

Instead of “the agent should be good,” write something like:

  • ≥ 90% task completion on labeled eval sets and ≥ 85% in production.
  • ≥ 95% tool choice correctness for top-5 intents.
  • ≤ 3 exchanges-to-completion median for account questions.
  • ≤ 0.1% policy violation rate across all sessions.
  • 0 critical safety incidents escaping guardrails.

This becomes your release gate.

Step 2: Build and label an eval dataset

Use:

  • Synthetic data: To cover edge cases and adversarial prompts.
  • Development data: From QA and internal dogfooding.
  • Production data: Sampled traces from real users.

Add SME annotations for:

  • Whether tasks were completed.
  • Which tools should have been used.
  • Whether responses matched policy.

In Galileo, this becomes your golden test set—a living evaluation asset you can re-run on every model/prompt/config change.

Step 3: Turn evaluators into Luna / Luna-2 models

LLM-as-judge evaluators are powerful but too slow and expensive to run on 100% of traffic. Galileo’s approach:

  • Take your best evaluators (task completion, tool success, safety).

  • Distill them into compact evaluation models (Luna, Luna-2).

  • Serve them on a dedicated inference stack so they can run with:

    • Sub-200ms latency
    • 97% lower cost vs. heavyweight LLM judges
    • 100% traffic coverage

Now you’re not sampling; you’re scoring everything.

Step 4: Wire evals into real-time guardrails

With Protect, these metrics stop being passive dashboards and start controlling behavior:

  • If task completion is unlikely after N exchanges:
    • Escalate to a human or trigger a simplified fallback flow.
  • If tool selection quality is low for a risky tool:
    • Block that tool call and switch to a safer one.
  • If a safety evaluator flags PII or policy violation:
    • Redact the content, override with a safe response, or send a webhook to your incident system.

Critically, you can version and roll back guardrail policies without redeploying your app—so you can iterate on reliability as fast as you iterate on prompts.

Step 5: Use Signals to catch new failure modes

Even with good metrics, you’ll still get surprised—new prompts, new tools, new models. Signals analyzes 100% of traces to:

  • Surface clusters of failures (e.g., “multi-step billing refunds failing after step 3”).
  • Highlight tool misuse patterns.
  • Suggest new evaluators you should add (and can auto-generate as LLM judges).

You can then turn a detected pattern into a reusable evaluator and, ultimately, a guardrail policy. That’s the eval-to-guardrail loop closing on itself.


How Galileo Implements These Metrics Out of the Box

Galileo is built around these reliability metrics so you don’t have to reinvent the instrumentation:

  • Evaluate

    • 20+ out-of-the-box evaluators for RAG, agents, safety, and security.
    • Custom evaluators from natural language descriptions (LLM-as-judge) that you can refine with SME feedback and CLHF-style few-shot tuning.
    • Golden test sets, prompt versioning, and trace-level cost/latency breakdowns.
  • Signals

    • Automatically analyzes sessions → traces → spans to surface:
      • Task-completion failures
      • Tool selection issues
      • Safety drifts and cascading failures
    • Converts discovered patterns into new evaluators you can reuse.
  • Protect

    • Runs compact evaluation models (Luna / Luna-2) in real time:
      • Sub-200ms guardrailing
      • 97% lower cost monitoring at 100% of traffic
    • Intercepts risky behavior with explicit actions:
      • Block, redact, override, webhook
    • Guardrail policies with versioning and rollbacks—no glue code.

Enterprise teams deploy Galileo as SaaS, in VPC, or on-prem, backed by SOC 2 Type II and HIPAA-ready infrastructure with BAAs, so reliability metrics and guardrails can live where your data lives.


Summary

If you’re only tracking “did the model respond” or NPS, you don’t have agent reliability—you have a demo.

In production, the metrics that actually measure agent reliability are:

  • End-to-end task completion
  • Tool selection quality and tool success rate
  • Exchanges-to-completion
  • Interaction quality
  • Safety & policy adherence
  • Drift and regression across versions

Measured continuously, on 100% of traffic, and wired into real-time guardrails, these metrics turn your agents from interesting experiments into dependable systems.

That’s what Galileo was built for: taking evaluation seriously enough that it can govern agent behavior—before users feel the failure, and before a bad action executes.


Next Step

Get Started