Bem evals/regression testing: how do I create a golden dataset and block a workflow release if accuracy drops?

Most teams discover the hard way that “looks good on a sample doc” is not a deployment strategy. If you’re not treating evals and regression tests as first-class citizens, your unstructured-data workflows will silently degrade as you change prompts, models, or schemas. Bem is built so you can do the opposite: define golden datasets, measure F1 before every promotion, and block any workflow release where accuracy drops.

Quick Answer: In Bem, you create a golden dataset by submitting labeled examples against a function/workflow schema and using them as the reference set for evals. Then you wire your CI/CD to call Bem’s eval and regression endpoints (e.g., /v2/functions/regression) before promoting a new version, and automatically fail the deployment if Precision/Recall/F1 fall below your thresholds.

Why This Matters

If you’re running AP, claims, logistics, or onboarding flows, “pretty good” accuracy isn’t good enough. You need a way to prove that invoice-extractor-v3 is at least as good as v2, and that no schema change or model swap quietly torches your F1.

Bem’s evals and regression testing give you the same safety rails you expect from unit tests and code coverage:

Golden datasets you control.
Versioned functions and workflows.
Automated gates that block bad releases before they hit production.

You stop shipping on vibes and start shipping on stats.

Key Benefits:

Deterministic promotions: Only ship new function/workflow versions when F1 scores meet or exceed your target thresholds.
Drift detection in production: Catch accuracy regressions early by re-running historical payloads and monitoring pass rates over time.
Lower human review load: Use objective evals to set review thresholds, estimate required human-in-the-loop effort, and avoid over- or under-reviewing documents.

Core Concepts & Key Points

Concept	Definition	Why it's important
Golden Dataset	A curated set of inputs plus ground-truth JSON outputs matching your schema (e.g., invoices + correct line items, totals, and vendor IDs).	Becomes the reference standard for Precision/Recall/F1; your “unit test suite” for unstructured workflows.
Automated Evals	Bem’s capability to run statistical analysis (Precision, Recall, F1) on a function/workflow against a golden dataset.	Lets you compare versions and models objectively, not by eyeballing a handful of examples.
Regression Testing	Replaying real historical payloads against a new function/workflow version via Bem’s regression endpoints.	Ensures new versions don’t break existing behavior or degrade on live traffic characteristics.

How It Works (Step-by-Step)

At a high level, you:

Define and store your schema (what “correct” means).
Build a golden dataset of labeled examples.
Wire Bem evals and regression tests into CI/CD.
Block any workflow release where metrics drop below thresholds.

1. Define the schema you want to protect

Bem is schema-first. That’s the foundation of meaningful evals.

Create or import a JSON Schema for the function/workflow:
- Example: invoice_extraction.schema.json
- Fields: invoice_number, invoice_date, vendor_name, net_total, tax_total, line_items[] (with sku, description, qty, unit_price, line_total), etc.
Use strict typing and enums where possible:
- Dates as format: date.
- Currencies as enums or ISO codes.
- Statuses as enums (PENDING, APPROVED, EXCEPTION).

This schema is what Bem enforces in production (“schema-valid output, or explicit exception”) and what it uses when scoring evals.

2. Build your first golden dataset

A golden dataset is just: {input} → {ground_truth_output} pairs that conform to your schema.

Typical starting sources:

Documents your team has already keyed in manually (e.g., AP history from your ERP).
Historical emails/tickets where you know the correct extracted fields.
Edge-case packets that have broken prior systems (multi-doc PDFs, mixed languages, blurry scans).

Process:

Collect inputs
- 100–500 representative inputs is a good v1.
- Aim for diversity: different vendors, layouts, amounts, languages, scan quality, page counts.
- Include “nasty” cases: handwritten totals, credits, multi-invoice PDFs.
Define ground truth outputs
- Either:
  - Export correct data from your system of record and map it into your Bem schema; or
  - Use Bem’s Supervise/Surface UI to let humans correct outputs once, then lock those as truth.
- Ensure every record is schema-valid; fix any inconsistencies (e.g., always tax-exclusive vs tax-inclusive totals).
Store as a Bem Collection or dataset
- Create a dedicated collection or dataset tagged for evals:
  - Example: ap-invoices-golden-v1
- Each item should include:
  - Input reference (file path, blob ID, email ID, etc.).
  - Expected JSON payload aligned to your workflow’s final schema.
  - Optional: tags for scenario (high_value, multi_page, handwritten) to slice metrics later.

This dataset never goes away. You add to it. You don’t overwrite it.

3. Run automated evals on a function

Say you have a function invoice-extractor-v1 and you’ve drafted invoice-extractor-v2 with a better prompt or model.

You want to answer:

Is v2 actually better than v1 across my golden dataset?
If not, in which scenarios does it regress?

Flow:

Submit eval job
- In CI or a script, call Bem’s eval endpoint for functions, passing:
  - functionName: invoice-extractor-v2
  - dataset: ap-invoices-golden-v1
- Bem executes v2 across the dataset and compares outputs to ground truth using your schema.
Review metrics
- Bem computes for each field and overall:
  - Precision
  - Recall
  - F1 score
- You get a breakdown like:
  - invoice_number: F1 0.998
  - net_total: F1 0.993
  - tax_total: F1 0.970
  - line_items[]: F1 0.945
  - Overall: F1 0.962
Define thresholds
- Example policy:
  - Overall F1 must be ≥ 0.96.
  - net_total and tax_total F1 must be ≥ 0.99.
  - line_items[] F1 must not drop vs previous version.
- These thresholds are what you will enforce in CI/CD.

This is your “unit test suite” for the extraction function.

4. Add regression testing with historical payloads

Golden datasets are curated. Regression testing adds realism: “does this new version behave at least as well on real traffic?”

Bem exposes regression endpoints (e.g., /v2/functions/regression) that:

Re-run historical payloads against a new function version.
Compare metrics against baselines.
Identify drift.

Flow:

Select historical window
- Example: last 30 days of invoices that:
  - Successfully processed through invoice-extractor-v1.
  - Have known outcomes (no outstanding exceptions).
Call regression endpoint
- Provide:
  - oldFunction: invoice-extractor-v1
  - newFunction: invoice-extractor-v2
  - Payload IDs / time window.
- Bem replays and reports:
  - Delta in F1 per field.
  - Delta in exception rate.
  - Drift by scenario (vendor, format, etc.), if you’re tagging payloads.
Act on the results
- If v2 improves overall F1 and doesn’t spike exceptions, it’s a candidate for promotion.
- If specific vendors regress, you can:
  - Add those cases to the golden dataset.
  - Patch the function (prompt or logic) and re-run.

This keeps you from shipping v2 that looks great on v1 golden examples but fails on that one ugly logistics vendor you forgot to label.

5. Block workflow releases when accuracy drops

Everything above is only useful if it’s automatic. You don’t want a human saying “looks ok.”

You want CI to fail the release.

Typical pattern:

Version your workflow
- Treat workflows like code:
  - process-invoices-v4 (current production)
  - process-invoices-v5 (candidate)
- Each workflow references specific function versions (invoice-extractor-v2, vendor-router-v1, gl-code-enricher-v3).
In CI/CD, after building a new version:
- Step 1: Run function-level evals:
  - Invoke Bem evals for each critical function (e.g., invoice-extractor).
- Step 2: Run workflow-level regression:
  - Call /v2/functions/regression or the equivalent workflow-level endpoint to replay historical calls across the entire pipeline.
- Step 3: Parse metrics and enforce policies:
  - Example thresholds:
    - Overall workflow F1 must be ≥ 0.97.
    - Any drop > 0.005 in net_total or line_items[] F1 vs previous version fails the build.
    - Exception rate must not increase by more than 0.5%.
Fail the pipeline when thresholds aren’t met
- Your CI script becomes something like:
  - Run eval.
  - Fetch results JSON.
  - If metrics.overall_f1 < 0.97 or metrics.net_total.delta_f1 < 0, exit with non-zero status.
- The PR/MR is blocked; workflow version is not promoted.
Only promote when evals pass
- Once evals and regressions are green, you:
  - Tag the workflow as production: process-invoices-production -> v5.
  - Optionally run a canary: small percentage of production traffic routed to v5 for additional monitoring.

This is “accuracy as code coverage.” You wouldn’t merge a build that fails tests; you shouldn’t ship a workflow that fails evals.

6. Use human-in-the-loop and review endpoints to tune thresholds

Accuracy isn’t static. As you expand vendors and geos, you’ll see new edge cases. Bem’s human review & eval endpoints help you keep thresholds realistic.

Two key pieces:

Human-in-the-loop queue
- Route low-confidence outputs to a Supervise UI.
- Operators correct fields; Bem converts those corrections into new training/eval data.
- You promote function versions only after they perform well on these newly-labeled edge cases.
Review endpoints (effort estimation)
- Bem’s /v2/functions/review endpoint estimates:
  - Statistical confidence of your pipeline.
  - How much human review is required to hit, say, 99.9% accuracy.
- You can translate that into operational decisions: “At this F1, we need to review 8% of documents to stay inside SLA.”

Over time, your golden dataset grows, your thresholds get tighter, and your human review load drops.

Common Mistakes to Avoid

Using only “happy path” examples in your golden dataset:
How to avoid it: Systematically add every exception case you see in production back into your golden dataset. Make “broke once” equal “tested forever.”
Treating evals as a one-off pre-launch task:
How to avoid it: Wire Bem evals into your CI/CD so every new function/workflow version is automatically tested, compared to the previous one, and blocked if metrics regress.

Real-World Example

Imagine a finance team using Bem to process a few million invoices a month.

They start with process-invoices-v1 and a 150-document golden dataset. Their baseline:

Overall F1: 0.94
net_total F1: 0.995
line_items[] F1: 0.91

They build invoice-extractor-v2 with a new model and improved prompt.

In CI:

They run Bem evals against ap-invoices-golden-v1:
- Overall F1: 0.962
- net_total F1: 0.997
- line_items[] F1: 0.94
- All thresholds satisfied.
They run regression on the last 30 days of production traffic via /v2/functions/regression:
- Exception rate flat.
- Slight improvement on high-value invoices.
- One vendor’s PDFs regress; those 40 invoices are added to the golden dataset as ap-invoices-golden-v2.
They update CI thresholds and re-run evals against the expanded golden dataset:
- V2 still passes; delta vs v1 is positive across all critical fields.
CI marks the workflow as safe to promote. They cut over process-invoices-production to v2 and keep a human review queue for any low-confidence outliers.

Two months later, they change GL-code enrichment logic. Evals catch a subtle drop in line-item classification; the release is blocked until they fix the mapping. No surprise CFO emails. No “we found out when the auditors did.”

Pro Tip: Treat every exception your operators fix in Bem’s Supervise UI as a future test case. Add it to a golden dataset, re-run evals, and don’t promote any function/workflow version that fails on a case you’ve already seen once.

Summary

Bem’s evals and regression testing let you treat AI accuracy like software quality, not a demo metric.

You:

Define strict schemas and golden datasets that encode what “correct” means.
Run automated evals and regression tests via Bem’s endpoints on every new function/workflow version.
Enforce hard thresholds in CI/CD so no release is promoted if F1 drops or exception rates spike.
Use human review and golden dataset growth to continuously harden the system, especially on edge cases.

Agents guess. Demos impress. Production needs evidence. Golden datasets plus blocking eval gates are how you get it.

Next Step

Get Started