
LLM eval tooling that can run in CI/CD to gate prompt/model/router changes before deploy
Most teams discover the hard way that the scariest bugs in LLM applications don’t show up in unit tests—they show up in production, in front of real users. If you’re changing prompts, swapping models, or tweaking a router and you can’t run LLM evals in CI/CD, you’re effectively shipping blind.
Quick Answer: You need LLM eval tooling that plugs directly into CI/CD, runs offline evaluations on every change (prompt/model/router), and gates deploys based on clear quality thresholds. With Arize AX and Arize Phoenix, you treat prompts and agents like code: traced with OTEL, evaluated with “LLM as a Judge” and code checks, and promoted only when experiments show no regressions.
Why This Matters
LLM apps and agents are inherently non-deterministic. A change that “looks good” in a playground can quietly degrade accuracy, tool selection, or safety on long-tail traffic. Without evaluation-driven CI/CD, your only feedback loop is production incidents and user complaints.
Putting evals into your CI/CD pipeline gives you a repeatable quality bar: every PR runs the same tests on the same datasets, compares against the current baseline, and fails the build when regressions show up. This is how you move from demo-ware to “Ship Agents that Work” in environments with real SLOs, compliance constraints, and cost limits.
Key Benefits:
- Catch regressions before they reach users: Offline evals flag prompt, model, or router changes that reduce accuracy, safety, or adherence to policies.
- Standardize quality across teams: Shared datasets, eval templates, and CI checks enforce consistent behavior across agents, services, and routes.
- Close the loop between production and development: Production traces and edge cases become curated datasets that drive better prompts, routing logic, and guardrails over time.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Offline LLM evaluation in CI/CD | Running LLM evals (LLM-as-a-Judge, code evals, custom metrics) on a fixed dataset as part of your build/test pipeline before deploy. | Lets you gate prompt/model/router changes using hard quality thresholds instead of eyeballing samples or relying on intuition. |
| Evaluation-driven experiments | Structured A/B or multi-variant tests (e.g., current prompt vs. new prompt, model A vs. model B, router v1 vs. v2) scored on shared evaluators. | Makes it easy to compare variants apples-to-apples and automatically pick the best one—or block a release if nothing clears the bar. |
| Open standard tracing for eval inputs | Using OpenTelemetry + OpenInference to log spans/traces for prompts, tools, and responses, then turning those logs into eval datasets. | Ensures you can reproduce real production flows, create realistic test sets, and avoid proprietary lock-in across frameworks and vendors. |
How It Works (Step-by-Step)
One platform. Offline + online evals wired into your pipeline so every change is tested before production.
At a high level, CI/CD-friendly LLM eval tooling looks like this:
-
Instrument & trace everything with OTEL/OpenInference
- Add OpenTelemetry instrumentation to your LLM calls, tools, and routers (or use Arize Phoenix for open-source tracing and evaluation).
- Adopt OpenInference-style span semantics: prompt spans, model spans, tool-call spans, router decision spans, and session-level traces.
- In development and staging, log enough metadata to reproduce test scenarios: inputs, ground-truth labels (when available), tool outputs, and expected behavior.
Outcome: every agent request is a trace that can be replayed, turned into a dataset, and evaluated consistently across versions.
-
Build evaluation datasets from traces
- Use traces from:
- Historical production interactions (with PII-handling that matches your data constraints).
- Synthetic or curated test cases for edge conditions (compliance, safety, rare tools, long contexts).
- In Arize AX or Phoenix, create datasets that capture:
- Input (user query, context, prior messages).
- Expected behavior (correct answer, correct tool, correct route, policy tags, etc.).
- Metadata slices (tenant, region, product, language) so you can slice metrics by critical cohorts.
Outcome: you now have versioned, reproducible datasets that represent the traffic and edge cases you actually care about.
- Use traces from:
-
Define evaluators: LLM-as-a-Judge + code checks
- LLM-as-a-Judge evaluators:
- Use open models or your own to auto-score behavior:
- Answer correctness / factuality vs. ground truth.
- Tool selection correctness (did it call the right tool or sequence of tools?).
- Parameter extraction accuracy (did it parse structured fields correctly?).
- Policy/safety checks (toxicity, PII leakage, hallucination risk).
- Codify them as reusable templates so CI can call them consistently.
- Use open models or your own to auto-score behavior:
- Code-based evaluators:
- Deterministic checks (JSON schema validation, regex checks, business rules).
- Numeric metrics (latency, token usage, rate-limit compliance).
In Arize AX:
- “LLM as a Judge” lets you define and run these evaluator templates at scale with no black-box scoring.
- You choose models and prompts; Arize orchestrates evaluations and stores scores.
Outcome: every candidate change is scored on the same rulebook: not just “does it look good?” but “does it meet our quantitative bar?”
- LLM-as-a-Judge evaluators:
-
Set up CI/CD Experiments to compare variants
- For each change (prompt, model, router):
- Treat the current production version as control.
- Treat the new version (e.g., new prompt or new router logic) as treatment.
- In the experiment configuration:
- Attach your evaluation dataset(s).
- Attach your evaluator set: LLM-as-a-Judge metrics, code metrics, and possibly custom metrics.
- Define success criteria:
- Minimum accuracy improvement (e.g., ≥ +2% vs. control).
- No degradation in safety scores.
- Latency and cost must not exceed thresholds.
In Arize AX:
- “CI/CD Experiments” run combinations of prompts/models/routers against your datasets with attached evaluators.
- Each experiment yields a diff vs. baseline with metrics and slice analysis.
Outcome: each PR implicitly becomes an experiment. If the treatment doesn’t beat or match control on critical metrics, you block the deploy.
- For each change (prompt, model, router):
-
Wire experiments into your CI/CD pipeline
You don’t need proprietary pipeline tooling here—just standard CI calling into your eval stack.Typical pattern:
-
Step 1 – Build & test:
- Run unit tests and integration tests.
- Build images / artifacts.
-
Step 2 – Launch experiment:
- CI calls Arize AX/Phoenix via API with:
- Dataset reference(s).
- Variant definitions (prompt template, model, router config).
- Evaluator set.
- Start an evaluation job.
- CI calls Arize AX/Phoenix via API with:
-
Step 3 – Wait & assert:
- CI polls the evaluation job or waits on a webhook.
- Pull aggregate metrics for control vs. treatment.
- Compare against thresholds; if any metric fails, fail the build.
-
Step 4 – Deploy on success:
- If all thresholds pass, merge or promote the config to production.
- Version the prompt/router in a prompt hub or config store so you can roll back quickly.
Outcome: prompt and router changes are treated exactly like code: they must pass automated checks before they’re allowed anywhere near users.
-
-
Feed production behavior back into evals (closing the loop)
- In production, continue tracing agents via OTEL/OpenInference into Arize AX or Phoenix: full multi-agent graphs, tool spans, and user sessions.
- Set up Online Evals for real-time scoring on sampled production traffic (LLM-as-a-Judge + code checks):
- Catch regressions instantly when a release slips past offline tests.
- Raise alerts when quality drops below SLOs per slice (e.g., a particular region or tenant).
- Use Human Annotation and Queues:
- Route tricky or high-impact samples to human reviewers.
- Build “golden datasets” from those decisions.
- Periodically sync those updated datasets back into your CI/CD experiments.
Outcome: you have a build–learn–improve loop powered by production behavior, not just synthetic examples.
Common Mistakes to Avoid
-
Treating LLM eval as a one-time tuning exercise
- Mistake: Run a one-off benchmark when launching, then stop evaluating changes.
- How to avoid it:
- Make eval a required CI stage for every prompt, model, and router change.
- Version datasets and evaluators; run them consistently over time to track drift and regressions.
-
Using opaque, proprietary “quality scores” with no control
- Mistake: Rely on black-box eval models or vendor-specific scoring you can’t inspect, reproduce, or move.
- How to avoid it:
- Use open models and open-source metrics where possible.
- Stick with OpenTelemetry and OpenInference so your traces and datasets remain portable.
- In Arize, configure your own LLM-as-a-Judge prompts and models so you can audit and iterate on the evaluators themselves.
Real-World Example
Imagine a global marketplace rolling out a multi-agent support assistant. It uses:
- A router agent to choose between “order status,” “refund policy,” and “technical issue” tools.
- A response agent that summarizes tool outputs into user-facing answers.
Every week, the team tweaks:
- The router’s prompt to better handle ambiguous queries.
- The response agent’s prompt for tone and clarity.
- The model choice in certain regions to balance latency and cost.
Here’s how they use CI/CD eval tooling to stay sane:
- Tracing: All agent calls, tool invocations, and routing decisions are traced via OTEL + OpenInference into Arize Phoenix in lower environments and Arize AX in production.
- Datasets: They build datasets from real production sessions—carefully anonymized—including:
- Queries with known “correct” routes (from historical logs and human review).
- Sessions where users had to re-ask or escalate (negative examples).
- Evaluators:
- A router correctness judge: given the query and the tools, the judge scores whether the selected route was correct.
- An answer correctness judge: compares final answer to ground truth and checks if required constraints (e.g., no policy deviations) are met.
- Code evals: JSON structure validity for tool parameters, latency and token/$$ budgets.
- CI/CD Integration:
- Every PR that touches prompt or router config triggers a CI job:
- Launch an Arize CI/CD Experiment with control (current router/prompt) vs. treatment (proposed changes).
- Run all evaluators across a fixed dataset of ~5k real queries.
- CI fails if:
- Router correctness drops by >1%.
- Safety-related scores drop at all.
- Latency or cost per request exceeds budget.
- Every PR that touches prompt or router config triggers a CI job:
- Online Evals & Alerts:
- After a successful deploy, they run Online Evals on sampled live traffic to watch for unexpected slice regressions (e.g., non-English users).
- When a new failure pattern shows up (e.g., a new kind of ambiguous query), they send those traces to a Human Annotation queue, label the correct behavior, and add them to the CI dataset.
Over time, the agent becomes more reliable, not because anyone found the “perfect” prompt, but because the team built a tight loop: trace → evaluate → experiment → deploy → monitor → annotate → enrich datasets.
Pro Tip: Start small by gating only the most critical metrics (e.g., safety and routing correctness) in CI, then gradually expand into accuracy, style, and cost as your test datasets and evaluators mature. That way you don’t block shipping while you’re still stabilizing your eval stack.
Summary
LLM eval tooling in CI/CD is the difference between “it worked in my playground” and “it works, every day, for real users.” By instrumenting with OpenTelemetry, building datasets from real traces, defining transparent LLM-as-a-Judge and code evaluators, and wiring CI/CD Experiments into your pipeline, you can safely iterate on prompts, models, and routers without flying blind.
With Arize AX and Arize Phoenix, you get open-standard tracing, offline + online evals, and human-in-the-loop annotation tied into a single loop—so you can detect prompt and agent regressions early, enforce SLOs, and ship agents that actually work in production.