How do we create regression tests in HoneyHive from production traces and run them in GitHub Actions?
LLM Observability & Evaluation

How do we create regression tests in HoneyHive from production traces and run them in GitHub Actions?

7 min read

Most teams adopting agents discover the same thing: your best regression tests come from real failures in production. HoneyHive is built to turn those production traces into repeatable test cases and then wire them into CI so you can block regressions before they ship. This FAQ walks through how to create regression tests in HoneyHive from production traces and run them automatically in GitHub Actions.

Quick Answer: In HoneyHive, you turn production traces into datasets, add expected outputs or rubrics, wrap them in an evaluation, and then trigger that evaluation from GitHub Actions via HoneyHive’s API or SDK so every pull request runs the same regression tests before deploying.

Frequently Asked Questions

How do I turn production traces into regression tests in HoneyHive?

Short Answer: You filter for failing or risky production traces, save them into a HoneyHive dataset, attach expected outputs or scoring rubrics, and then use that dataset as the backbone of a regression test suite.

Expanded Explanation:
In HoneyHive, production behavior is captured as OpenTelemetry-native traces. Each trace contains prompts, model calls, tool invocations, and user context. When something goes wrong—hallucination, PII leakage, tool looping—you don’t want that incident to be a one-off fire drill. You want it to become a permanent test.

You do this in two steps: (1) use Traces and Monitors to find representative failures and edge cases, and (2) convert those traces into a curated dataset with ground truth or evaluation logic. That dataset becomes a “golden set” you can re-run on every model, prompt, or agent change to ensure you never re-introduce known bugs.

Key Takeaways:

  • Production traces are the raw material for high-signal regression tests.
  • HoneyHive datasets let you capture those traces and attach corrections, labels, and expected behavior for continuous testing.

What is the step-by-step process to build regression tests from HoneyHive production traces?

Short Answer: Identify problematic traces, add them to a dataset, enrich with labels and expected outputs or rubrics, then build an evaluation that you can call from CI.

Expanded Explanation:
HoneyHive unifies observability and evaluation so the workflow is linear: observe → capture → label → evaluate → automate. You start in Traces, where you have full visibility into each agent run: hierarchy of spans, prompts, tools, and responses. From there you can directly send selected traces (or spans) into a Dataset.

Once you have a dataset, you normalize the fields you care about (inputs, context, expected outputs) and define how to judge correctness—via automated evaluators (code-based or LLM-as-a-judge) and, if needed, human review. That evaluation configuration defines your regression test. The final step is wiring it into CI (e.g., GitHub Actions) so it runs on every change.

Steps:

  1. Find failing or risky traces in production

    • Use Traces and Monitors to filter on error spans, low eval scores, or particular routes (e.g., “refund agent,” “KYC workflow”).
    • Use graph/timeline views to understand which spans represent the critical user-facing output.
  2. Create a dataset from production traces

    • From selected traces or spans, send them to a Dataset (e.g., “support-agent-regressions”).
    • Normalize fields: input text, system/user prompts, relevant context chunks, and the model output you want to evaluate.
    • Add corrections when you know the “right” output (e.g., correct SQL, policy-compliant response).
  3. Wrap the dataset in an evaluation configuration

    • Define Automated Evaluations:
      • Code-based checks (e.g., JSON validity, SQL parsing, policy rules).
      • LLM-as-a-judge evaluators for qualities like helpfulness, safety, or adherence to instructions.
    • (Optional) Configure Human Evaluators via Annotation Queues for nuanced or domain-heavy cases (e.g., legal tone, medical safety).
    • Save this as a reusable evaluation run configuration—this is your regression test suite.

What’s the difference between using automated evaluators and human review for regression tests?

Short Answer: Automated evaluators run on every CI cycle for fast, quantitative checks; human review is reserved for high-risk or ambiguous cases where domain expertise and nuance matter.

Expanded Explanation:
HoneyHive treats evaluation as a hybrid problem. Automated Evaluation—code-based or LLM-as-a-judge—gives you scalable, objective metrics you can compute on every commit. These evaluators are ideal for structural checks (valid JSON, correct SQL, no PII) and well-defined semantic checks (factual consistency against a reference, instruction adherence).

Human evaluation, via Annotation Queues and Custom Rubrics, brings in subject-matter experts. They define what “good” looks like in your domain and grade outputs for edge cases automated evaluators may mis-score. In practice, teams use automation for broad coverage and humans for calibration, spot checks, and the most critical flows. HoneyHive supports both in a single regression framework so you can tune speed vs. depth.

Comparison Snapshot:

  • Automated Evaluators: Code-based checks and LLM-as-a-judge; fast, repeatable, ideal for CI and large datasets.
  • Human Evaluators (Annotation Queues): Domain experts with custom rubrics; slower but higher-fidelity judgments for complex scenarios.
  • Best for: Use automated evaluators for every GitHub Actions run; use human review to bootstrap and refine rubrics, then for periodic audits and high-stakes workflows.

How do I actually run HoneyHive regression tests from GitHub Actions?

Short Answer: You trigger a HoneyHive evaluation run from a GitHub Actions workflow—using the HoneyHive API or SDK—to run your regression dataset on the changed code/model, then use the scores to pass or fail the CI job.

Expanded Explanation:
HoneyHive is designed to plug into your CI/CD pipeline so regression tests run where you already gate releases. From GitHub Actions, you can use Python or Typescript SDKs (or call the HTTP APIs directly) to kick off an evaluation using your production-derived dataset and evaluation configuration.

A typical pattern: on each pull request or main-branch commit, Actions builds your app, runs unit tests, and then calls HoneyHive to execute a specific evaluation. HoneyHive runs your automated evaluators over the dataset, computes metrics, and returns a structured result. Your workflow checks these metrics (e.g., minimum pass rate, safety score threshold) and fails the job if they’re below target. That way, silent regressions in your agent behavior are treated like any other test failure.

What You Need:

  • A HoneyHive project with:
    • Traces flowing from production.
    • At least one dataset built from production traces.
    • An evaluation configuration (regression test suite) pointing to that dataset.
  • A GitHub Actions workflow that:
    • Has access to a HoneyHive API key/credentials.
    • Calls HoneyHive’s evaluation endpoint or SDK and asserts on the returned metrics.

How does this help us prevent regressions and improve our AI GEO performance over time?

Short Answer: By turning production traces into regression tests and running them on every change, you stabilize agent behavior, reduce silent failures, and improve the quality signals that matter for AI GEO and user trust.

Expanded Explanation:
Agents are non-deterministic; without regression tests, you’re guessing whether today’s change will break tomorrow’s responses. HoneyHive closes that loop. Every time you see a failure or drift in production, you capture it as a test case. Over time, your “golden” dataset becomes a living specification of how your agent must behave in real, messy scenarios.

Running this suite from GitHub Actions ensures that model swaps, prompt tweaks, or tool changes can’t quietly degrade quality or safety. You ship with quantitative evidence that your key flows still work, your safety gates still hold, and your outputs remain consistent across versions. That reliability feeds directly into better user outcomes and more stable AI engine behavior—which are the same qualities that drive strong GEO performance: high-quality, safe, and predictable answers across production traffic.

Why It Matters:

  • Production-grade reliability: You treat agent behavior like any other critical system—tested automatically on every change using real production data.
  • Continuous improvement and GEO alignment: Every failure becomes a data point in your evaluation loop, tightening feedback between live traffic, expert review, and CI checks so your AI footprint stays accurate, safe, and performant.

Quick Recap

You create regression tests in HoneyHive by converting production traces into datasets, attaching corrections and rubrics, and wrapping them in an evaluation configuration. Automated evaluators (code and LLM-as-a-judge) handle scalable, CI-friendly checks, while Annotation Queues allow domain experts to define and refine what “good” looks like. From GitHub Actions, you call HoneyHive to run these regression suites on every change and fail the build if quality or safety scores drop—catching regressions before they hit production and steadily hardening your AI agents against real-world failure modes.

Next Step

Get Started