How do I run my first evaluation suite in Future AGI and set pass/fail thresholds for releases?
LLM Observability & Evaluation

How do I run my first evaluation suite in Future AGI and set pass/fail thresholds for releases?

12 min read

LLMs are probabilistic, which means behavior shifts as you change prompts, models, or tools—even when your inputs look the same. In Future AGI, the way you turn those probabilistic behaviors into a reliable product is by wiring everything into an evaluation suite with clear pass/fail thresholds for every release.

This guide walks you through how to run your first evaluation suite in Future AGI and how to set concrete pass/fail gates so you can ship with confidence instead of vibes.

Quick Answer: You create a Dataset of scenarios, wire it into an Experiment, attach deterministic eval metrics, then define pass/fail thresholds at the metric or run level. Once that’s in place, every new prompt/model/workflow change gets evaluated against the same criteria before you release.


The Quick Overview

  • What It Is: A step‑by‑step flow to build your first evaluation suite in Future AGI—covering datasets, experiments, metrics, and release gates.
  • Who It Is For: Teams shipping RAG chatbots, summarizers, or tool-using agents who need repeatable, deterministic checks before pushing to production.
  • Core Problem Solved: Turning ad‑hoc “does this look good?” checks into a consistent, automated eval pipeline with pass/fail thresholds for every release.

How It Works (End‑to‑End Flow)

At a high level, you follow the same lifecycle Future AGI is built around: Datasets → Experiment → Evaluate → Improve → Monitor & Protect.

  1. Datasets: Capture your real and synthetic test scenarios (inputs + expected signals).
  2. Experiment: Run different prompts/models/agent configs against the Dataset.
  3. Evaluate: Attach metrics (including deterministic evals) that score each output and compute aggregate results.
  4. Set Thresholds: Define pass/fail rules per metric or experiment to gate releases.
  5. Ship & Monitor: Promote a configuration only if it passes, then trace it in production.

Below, I’ll go step-by-step through how to run your first evaluation suite and wire in thresholds that your team can trust as release criteria.


Step 1: Create or Import Your First Dataset

Everything starts with a Dataset. If you can’t replay real scenarios, you can’t evaluate reliably.

1.1 Decide on your first target workflow

Pick one workflow that matters for your release:

  • A RAG chatbot answering support questions.
  • A summarizer condensing long documents or calls.
  • A tool-using agent performing multi-step actions (booking, updating records, etc.).
  • A multimodal agent interpreting images or audio.

Your first evaluation suite should mirror exactly how this workflow behaves in production: inputs, instructions, tools, and expected outcomes.

1.2 Create a Dataset in Future AGI

You’ll typically do this via the UI or SDK:

  • From the UI:

    • Go to Datasets.
    • Click New Dataset.
    • Give it a name like support-rag-regression-v1 or call-summary-quality-v1.
    • Choose the type: text, image+text, audio, or full multimodal depending on your workflow.
  • From code (conceptually):

    # pseudo-structure; actual SDK may vary
    from futureagi import Dataset
    
    ds = Dataset.create(
        name="support-rag-regression-v1",
        modality="text",
        metadata={"app": "support-bot", "env": "staging"}
    )
    

1.3 Populate with scenarios (real + synthetic)

To make your evaluation suite meaningful:

  • Seed with real traffic:

    • Export real user queries or tickets from logs.
    • Include tricky cases: ambiguous questions, long context, edge topics.
  • Augment with synthetic datasets:

    • Use Future AGI’s synthetic generation to create edge cases:
      • Adversarial prompts (prompt injection, jailbreak attempts).
      • Domain-specific queries rarely seen in current logs.
      • Long-tail formats (multi-turn, mixed languages, noisy input).

Each Dataset row typically looks like:

  • input: user query, document, or multimodal input.
  • context: retrieval snippets, business rules, or tools available.
  • expected_signal: ground truth answer, rubric, or constraints (e.g., “must not expose PII”).

The goal is coverage, not perfection: you want enough scenarios to expose failures you’d be unhappy to see in production.


Step 2: Wire the Dataset into an Experiment

Now you connect your Dataset to the actual agent configuration you want to evaluate.

2.1 Define the workflow you’re testing

An Experiment in Future AGI represents a specific configuration:

  • Model provider: OpenAI, Anthropic, Bedrock, Gemini, Llama, etc.
  • Framework: LangChain, DSPy, Haystack, CrewAI, LiteLLM, or your own stack.
  • Prompt strategy: system prompt, few-shot examples, tools, retrieval settings.
  • Parameters: temperature, max tokens, etc.

This is how you compare variants (e.g., “old prompt vs. new prompt,” “GPT-4 vs. Claude,” “BM25-only vs. hybrid retrieval”).

2.2 Instrument your code for traces

To evaluate realistically, you want end‑to‑end traces, not just final text outputs.

Instrument your existing app:

# Conceptual example
from traceai_openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument()  # hooks your OpenAI calls

# Your existing code
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_query}],
    # ...
)

Future AGI captures:

  • Inputs and prompts.
  • Tool calls and intermediate steps.
  • Final outputs per scenario.

2.3 Create the Experiment

In the UI:

  • Go to Experiment.
  • Click New Experiment.
  • Select the Dataset you created in Step 1.
  • Choose the agent configuration (model, prompt, tools, retrieval) you want to run.
  • Name it clearly, like:
    • support-bot-gpt4o-prompt-v3
    • call-summarizer-claude-3.5-temp0.2

Then run the Experiment. Future AGI will execute your workflow across all Dataset rows and log the outputs.


Step 3: Attach Evaluation Metrics (Deterministic Where Possible)

Running an Experiment gives you raw outputs; metrics turn them into actionable numbers.

3.1 Pick the right eval dimensions

For your first evaluation suite, focus on 3–5 core dimensions that match your product:

Common ones:

  • Accuracy / Correctness
  • Relevance (especially for RAG)
  • Faithfulness (no hallucinated facts vs. source)
  • Completeness (answers all parts of the question)
  • Format compliance (JSON, schema, tags)
  • Safety (toxicity, harassment, self-harm, etc.)
  • Privacy (PII leakage)
  • Prompt injection resistance

Future AGI provides:

  • Deterministic evals with fixed, predefined criteria.
    These are crucial when you need repeatable pass/fail decisions (e.g., “is this valid JSON?” “did it mention the required field?”).
  • LLM-based heuristics configured as metrics when human-like judgment is needed (e.g., “Is this summary accurate and concise?”), still structured with clear rubrics.

3.2 Add metrics to your Experiment

In the Experiment view:

  1. Go to the Evaluate or Metrics section.
  2. Attach built‑in metrics or configure custom ones:
    • Choose Accuracy with reference answers.
    • Add Safety metrics (e.g., Protect guardrails for toxicity, sexism, privacy, prompt injection).
    • Include Format checks for JSON or structured outputs.
  3. Define scoring ranges:
    • Binary (pass/fail, true/false).
    • Scalar (0–1, 0–100).
    • Categorical (e.g., Good / Acceptable / Bad mapped to numeric thresholds).

Once attached, re-run evaluation to compute metrics per scenario and aggregate scores (mean, median, percent passing, etc.).


Step 4: Define Pass/Fail Thresholds for Releases

This is where your evaluation suite becomes a release gate instead of a dashboard.

4.1 Choose your baseline

Run your current production configuration against the Dataset and metrics you just set up. That gives you:

  • Baseline metric scores (e.g., Accuracy=0.86, Safety pass rate=0.99).
  • A realistic sense of how “good” today looks.

You do not want to set thresholds blind; anchor them against what’s already working (or at least deployed).

4.2 Set metric-level thresholds

For each key metric, decide:

  • Minimum acceptable score for a release.
  • Whether it’s hard blocking or soft warning.

Examples:

  • Accuracy:
    • Threshold: ≥ 0.90 average score.
    • Rule: “Block release if Accuracy < 0.90.”
  • Safety (toxicity / sexism):
    • Threshold: ≥ 0.995 safe rate (≤ 0.5% violations).
    • Rule: “Block release if Safety < 0.995.”
  • Format compliance (JSON):
    • Threshold: ≥ 0.98 valid JSON rate.
    • Rule: “Block release if JSON validity < 0.98.”

In Future AGI, this typically looks like:

  • For each metric, define:
    • Target (e.g., 0.9).
    • Direction (higher is better / lower is better).
    • Status (required vs. optional).

That gives you deterministic yes/no gating per metric.

4.3 Define experiment-level pass criteria

Then you combine metric thresholds into an experiment-level decision:

Example policy for a RAG support bot:

  • Must pass:
    • Accuracy ≥ 0.90.
    • Faithfulness ≥ 0.95.
    • Safety ≥ 0.995.
    • JSON Validity ≥ 0.98.
  • Nice to have:
    • Latency ≤ 1.2x baseline (warning if slower).

You can treat these as:

  • “Release candidates” when all required metrics meet or exceed thresholds.
  • “Rejected” when any required metric fails.

This is what allows you to say, “This prompt update is not shipping; it fails faithfulness,” instead of debating screenshots.

4.4 Make thresholds explicit in your workflow

Operationalize the thresholds:

  • In CI/CD:
    • Add a test step that calls Future AGI’s API to:
      • Trigger the Experiment run (or reuse latest).
      • Fetch metrics and compare to thresholds.
      • Fail the pipeline if any hard metric fails.
  • In release playbooks:
    • Document: “A change can’t go to production unless it passes regression-suite-v1 with all required metrics green.”
  • In team habits:
    • Code reviews and PRs reference Experiment IDs or evaluation reports, not just prompt diffs.

Now every release runs through the same evaluation suite with the same pass/fail logic.


Step 5: Use Results to Improve Before You Ship

Evaluations without feedback loops are just nice charts. Future AGI is built to close the loop.

5.1 Drill into failures with traces

For any metric that fails:

  • Open the failed scenarios.
  • Inspect traces:
    • Input prompt and context.
    • Retrieval results or tool calls.
    • Intermediate steps and final answer.
  • Pin-point root cause:
    • Bad retrieval vs. bad reasoning vs. bad formatting.
    • Safety guardrail missed a specific pattern.
    • Model struggled with a format or domain.

This is where Future AGI’s tracing and Error Localizer‑style insights matter—you’re not guessing; you’re seeing exactly where the agent went off-rail.

5.2 Refine prompts and workflows

Based on those failures:

  • Tighten your system prompt with explicit constraints.
  • Adjust retrieval settings (k, filters, hybrid vs. sparse).
  • Introduce tools for structured operations instead of free-form reasoning.
  • Update the Dataset with new edge cases you discovered.

Then:

  1. Clone your Experiment with the updated config.
  2. Re-run against the same Dataset and metrics.
  3. Check if you now clear all thresholds.

5.3 Let the system help refine prompts

Future AGI can incorporate evaluation feedback into prompt refinement—using metrics and specific error patterns:

  • Automatically suggest prompt changes based on categories of failures.
  • Run side‑by‑side Experiments to compare old vs. new.
  • Select the Winner using your deterministic thresholds.

This gives you a repeatable improve loop instead of ad‑hoc prompt folklore.


Step 6: Promote to Production and Monitor & Protect

Once an Experiment passes your suite, it becomes a release candidate.

6.1 Promote the winning configuration

  • Mark the Experiment as the current production configuration for that app/route.
  • Update your deployment manifests to use:
    • The winning prompt.
    • The chosen model.
    • The tuned parameters and tools.

Now your production system is anchored to an evaluated, threshold-cleared configuration.

6.2 Monitor in production

Evaluation is not one‑and‑done. LLM behavior can drift with:

  • Model updates from providers.
  • New user behavior.
  • New content types (images, audio, languages).

Use Monitor & Protect to:

  • Trace real production usage.
  • Compare sampled live traffic against your evaluation suite scores.
  • Continuously spot regressions and new failure modes.

6.3 Enforce safety with minimal latency

Plug in Future AGI’s Protect guardrails in the live path:

  • Screen inputs for prompt injection, harassment, or disallowed content.
  • Screen outputs for toxicity, sexism, privacy violations, etc.
  • Block or rewrite unsafe responses with minimal latency.

Your pass/fail thresholds from the evaluation suite map directly into production policies (e.g., “block if Safety score < threshold”).


Example: A Minimal First Evaluation Suite

To make this concrete, here’s a simple setup for a first release:

  • Workflow: Text-based RAG support chatbot.
  • Dataset: support-rag-regression-v1 with:
    • 150 real queries from logs.
    • 50 synthetic edge cases (prompt injection, multi-step questions, policy edge cases).
  • Experiment: support-bot-gpt4o-prompt-v3.
  • Metrics:
    • Accuracy (0–1).
    • Faithfulness (0–1).
    • Safety (binary pass/fail -> pass rate).
    • JSON format validity (binary -> pass rate).
  • Thresholds:
    • Accuracy ≥ 0.88.
    • Faithfulness ≥ 0.95.
    • Safety ≥ 0.995 safe rate.
    • JSON Validity ≥ 0.98.
  • Release rule:
    • “Only promote configs where all four metrics meet thresholds and Accuracy is not more than 2 points below the current production baseline.”

Run this suite for each PR that changes prompts, models, or tool wiring. If it passes, you ship. If it fails, you debug through traces and iterate.


Frequently Asked Questions

Do I need ground-truth labels to run my first evaluation suite?

Short Answer: No, but they help. You can start with rubric-based and deterministic checks, then add labeled data over time.

Details:
For many teams, ground truth is incomplete or expensive. In Future AGI you can:

  • Start with deterministic checks: format, policy compliance, safety, injection resistance.
  • Add rubric-based evals where an evaluator model scores quality (e.g., “Is this answer correct and complete?”) using clear criteria.
  • Gradually introduce labeled data for high-value scenarios (e.g., your top 10 support intents) to anchor Accuracy metrics.

The important thing is consistency: use the same Dataset + metrics over time so trends and thresholds are meaningful.

How many scenarios do I need in my Dataset before setting pass/fail thresholds?

Short Answer: Aim for at least 50–100 meaningful scenarios per workflow, with a bias toward critical and edge cases.

Details:
You don’t need thousands of examples to start. What matters more is coverage of important failure modes:

  • Include your top 10–20 user intents.
  • Add known tricky cases where agents previously failed.
  • Use synthetic generation to cover rare but high-risk inputs (e.g., privacy requests, policy-violating queries, prompt injection attempts).

Once you have ~100 scenarios, your aggregate metrics become stable enough to support a first set of thresholds. You can always increase Dataset size and tighten thresholds as your system matures.


Summary

LLMs are probabilistic; they don’t give you release-grade reliability by default. In Future AGI, you turn that behavior into a predictable product by:

  1. Building a Dataset of real and synthetic scenarios (including edge cases).
  2. Running an Experiment that mirrors your real workflow.
  3. Attaching deterministic evals and metrics that capture accuracy, safety, and format.
  4. Defining explicit pass/fail thresholds for each metric and at the experiment level.
  5. Iterating on prompts and workflows using traces and feedback until you clear your thresholds.
  6. Promoting the winning configuration and using Monitor & Protect to enforce safety and catch drift in production.

Once this loop is in place, every change—prompt, model, or agent logic—goes through the same evaluation suite before you ship.


Next Step

Ready to set up your first evaluation suite and wire in real release thresholds?

Get Started