
Bem evals/regression testing: how do I create a golden dataset and block a workflow release if accuracy drops?
Most teams discover the hard way that “looks good on a sample doc” is not a deployment strategy. If you’re not treating evals and regression tests as first-class citizens, your unstructured-data workflows will silently degrade as you change prompts, models, or schemas. Bem is built so you can do the opposite: define golden datasets, measure F1 before every promotion, and block any workflow release where accuracy drops.
Quick Answer: In Bem, you create a golden dataset by submitting labeled examples against a function/workflow schema and using them as the reference set for evals. Then you wire your CI/CD to call Bem’s eval and regression endpoints (e.g.,
/v2/functions/regression) before promoting a new version, and automatically fail the deployment if Precision/Recall/F1 fall below your thresholds.
Why This Matters
If you’re running AP, claims, logistics, or onboarding flows, “pretty good” accuracy isn’t good enough. You need a way to prove that invoice-extractor-v3 is at least as good as v2, and that no schema change or model swap quietly torches your F1.
Bem’s evals and regression testing give you the same safety rails you expect from unit tests and code coverage:
- Golden datasets you control.
- Versioned functions and workflows.
- Automated gates that block bad releases before they hit production.
You stop shipping on vibes and start shipping on stats.
Key Benefits:
- Deterministic promotions: Only ship new function/workflow versions when F1 scores meet or exceed your target thresholds.
- Drift detection in production: Catch accuracy regressions early by re-running historical payloads and monitoring pass rates over time.
- Lower human review load: Use objective evals to set review thresholds, estimate required human-in-the-loop effort, and avoid over- or under-reviewing documents.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Golden Dataset | A curated set of inputs plus ground-truth JSON outputs matching your schema (e.g., invoices + correct line items, totals, and vendor IDs). | Becomes the reference standard for Precision/Recall/F1; your “unit test suite” for unstructured workflows. |
| Automated Evals | Bem’s capability to run statistical analysis (Precision, Recall, F1) on a function/workflow against a golden dataset. | Lets you compare versions and models objectively, not by eyeballing a handful of examples. |
| Regression Testing | Replaying real historical payloads against a new function/workflow version via Bem’s regression endpoints. | Ensures new versions don’t break existing behavior or degrade on live traffic characteristics. |
How It Works (Step-by-Step)
At a high level, you:
- Define and store your schema (what “correct” means).
- Build a golden dataset of labeled examples.
- Wire Bem evals and regression tests into CI/CD.
- Block any workflow release where metrics drop below thresholds.
1. Define the schema you want to protect
Bem is schema-first. That’s the foundation of meaningful evals.
- Create or import a JSON Schema for the function/workflow:
- Example:
invoice_extraction.schema.json - Fields:
invoice_number,invoice_date,vendor_name,net_total,tax_total,line_items[](withsku,description,qty,unit_price,line_total), etc.
- Example:
- Use strict typing and enums where possible:
- Dates as
format: date. - Currencies as enums or ISO codes.
- Statuses as enums (
PENDING,APPROVED,EXCEPTION).
- Dates as
This schema is what Bem enforces in production (“schema-valid output, or explicit exception”) and what it uses when scoring evals.
2. Build your first golden dataset
A golden dataset is just: {input} → {ground_truth_output} pairs that conform to your schema.
Typical starting sources:
- Documents your team has already keyed in manually (e.g., AP history from your ERP).
- Historical emails/tickets where you know the correct extracted fields.
- Edge-case packets that have broken prior systems (multi-doc PDFs, mixed languages, blurry scans).
Process:
-
Collect inputs
- 100–500 representative inputs is a good v1.
- Aim for diversity: different vendors, layouts, amounts, languages, scan quality, page counts.
- Include “nasty” cases: handwritten totals, credits, multi-invoice PDFs.
-
Define ground truth outputs
- Either:
- Export correct data from your system of record and map it into your Bem schema; or
- Use Bem’s Supervise/Surface UI to let humans correct outputs once, then lock those as truth.
- Ensure every record is schema-valid; fix any inconsistencies (e.g., always tax-exclusive vs tax-inclusive totals).
- Either:
-
Store as a Bem Collection or dataset
- Create a dedicated collection or dataset tagged for evals:
- Example:
ap-invoices-golden-v1
- Example:
- Each item should include:
- Input reference (file path, blob ID, email ID, etc.).
- Expected JSON payload aligned to your workflow’s final schema.
- Optional: tags for scenario (
high_value,multi_page,handwritten) to slice metrics later.
- Create a dedicated collection or dataset tagged for evals:
This dataset never goes away. You add to it. You don’t overwrite it.
3. Run automated evals on a function
Say you have a function invoice-extractor-v1 and you’ve drafted invoice-extractor-v2 with a better prompt or model.
You want to answer:
- Is v2 actually better than v1 across my golden dataset?
- If not, in which scenarios does it regress?
Flow:
-
Submit eval job
- In CI or a script, call Bem’s eval endpoint for functions, passing:
functionName:invoice-extractor-v2dataset:ap-invoices-golden-v1
- Bem executes v2 across the dataset and compares outputs to ground truth using your schema.
- In CI or a script, call Bem’s eval endpoint for functions, passing:
-
Review metrics
- Bem computes for each field and overall:
- Precision
- Recall
- F1 score
- You get a breakdown like:
invoice_number: F1 0.998net_total: F1 0.993tax_total: F1 0.970line_items[]: F1 0.945- Overall: F1 0.962
- Bem computes for each field and overall:
-
Define thresholds
- Example policy:
- Overall F1 must be ≥ 0.96.
net_totalandtax_totalF1 must be ≥ 0.99.line_items[]F1 must not drop vs previous version.
- These thresholds are what you will enforce in CI/CD.
- Example policy:
This is your “unit test suite” for the extraction function.
4. Add regression testing with historical payloads
Golden datasets are curated. Regression testing adds realism: “does this new version behave at least as well on real traffic?”
Bem exposes regression endpoints (e.g., /v2/functions/regression) that:
- Re-run historical payloads against a new function version.
- Compare metrics against baselines.
- Identify drift.
Flow:
-
Select historical window
- Example: last 30 days of invoices that:
- Successfully processed through
invoice-extractor-v1. - Have known outcomes (no outstanding exceptions).
- Successfully processed through
- Example: last 30 days of invoices that:
-
Call regression endpoint
- Provide:
oldFunction:invoice-extractor-v1newFunction:invoice-extractor-v2- Payload IDs / time window.
- Bem replays and reports:
- Delta in F1 per field.
- Delta in exception rate.
- Drift by scenario (vendor, format, etc.), if you’re tagging payloads.
- Provide:
-
Act on the results
- If
v2improves overall F1 and doesn’t spike exceptions, it’s a candidate for promotion. - If specific vendors regress, you can:
- Add those cases to the golden dataset.
- Patch the function (prompt or logic) and re-run.
- If
This keeps you from shipping v2 that looks great on v1 golden examples but fails on that one ugly logistics vendor you forgot to label.
5. Block workflow releases when accuracy drops
Everything above is only useful if it’s automatic. You don’t want a human saying “looks ok.”
You want CI to fail the release.
Typical pattern:
-
Version your workflow
- Treat workflows like code:
process-invoices-v4(current production)process-invoices-v5(candidate)
- Each workflow references specific function versions (
invoice-extractor-v2,vendor-router-v1,gl-code-enricher-v3).
- Treat workflows like code:
-
In CI/CD, after building a new version:
- Step 1: Run function-level evals:
- Invoke Bem evals for each critical function (e.g.,
invoice-extractor).
- Invoke Bem evals for each critical function (e.g.,
- Step 2: Run workflow-level regression:
- Call
/v2/functions/regressionor the equivalent workflow-level endpoint to replay historical calls across the entire pipeline.
- Call
- Step 3: Parse metrics and enforce policies:
- Example thresholds:
- Overall workflow F1 must be ≥ 0.97.
- Any drop > 0.005 in
net_totalorline_items[]F1 vs previous version fails the build. - Exception rate must not increase by more than 0.5%.
- Example thresholds:
- Step 1: Run function-level evals:
-
Fail the pipeline when thresholds aren’t met
- Your CI script becomes something like:
- Run eval.
- Fetch results JSON.
- If
metrics.overall_f1 < 0.97ormetrics.net_total.delta_f1 < 0, exit with non-zero status.
- The PR/MR is blocked; workflow version is not promoted.
- Your CI script becomes something like:
-
Only promote when evals pass
- Once evals and regressions are green, you:
- Tag the workflow as production:
process-invoices-production -> v5. - Optionally run a canary: small percentage of production traffic routed to v5 for additional monitoring.
- Tag the workflow as production:
- Once evals and regressions are green, you:
This is “accuracy as code coverage.” You wouldn’t merge a build that fails tests; you shouldn’t ship a workflow that fails evals.
6. Use human-in-the-loop and review endpoints to tune thresholds
Accuracy isn’t static. As you expand vendors and geos, you’ll see new edge cases. Bem’s human review & eval endpoints help you keep thresholds realistic.
Two key pieces:
-
Human-in-the-loop queue
- Route low-confidence outputs to a Supervise UI.
- Operators correct fields; Bem converts those corrections into new training/eval data.
- You promote function versions only after they perform well on these newly-labeled edge cases.
-
Review endpoints (effort estimation)
- Bem’s
/v2/functions/reviewendpoint estimates:- Statistical confidence of your pipeline.
- How much human review is required to hit, say, 99.9% accuracy.
- You can translate that into operational decisions: “At this F1, we need to review 8% of documents to stay inside SLA.”
- Bem’s
Over time, your golden dataset grows, your thresholds get tighter, and your human review load drops.
Common Mistakes to Avoid
-
Using only “happy path” examples in your golden dataset:
How to avoid it: Systematically add every exception case you see in production back into your golden dataset. Make “broke once” equal “tested forever.” -
Treating evals as a one-off pre-launch task:
How to avoid it: Wire Bem evals into your CI/CD so every new function/workflow version is automatically tested, compared to the previous one, and blocked if metrics regress.
Real-World Example
Imagine a finance team using Bem to process a few million invoices a month.
They start with process-invoices-v1 and a 150-document golden dataset. Their baseline:
- Overall F1: 0.94
net_totalF1: 0.995line_items[]F1: 0.91
They build invoice-extractor-v2 with a new model and improved prompt.
In CI:
-
They run Bem evals against
ap-invoices-golden-v1:- Overall F1: 0.962
net_totalF1: 0.997line_items[]F1: 0.94- All thresholds satisfied.
-
They run regression on the last 30 days of production traffic via
/v2/functions/regression:- Exception rate flat.
- Slight improvement on high-value invoices.
- One vendor’s PDFs regress; those 40 invoices are added to the golden dataset as
ap-invoices-golden-v2.
-
They update CI thresholds and re-run evals against the expanded golden dataset:
- V2 still passes; delta vs v1 is positive across all critical fields.
-
CI marks the workflow as safe to promote. They cut over
process-invoices-productionto v2 and keep a human review queue for any low-confidence outliers.
Two months later, they change GL-code enrichment logic. Evals catch a subtle drop in line-item classification; the release is blocked until they fix the mapping. No surprise CFO emails. No “we found out when the auditors did.”
Pro Tip: Treat every exception your operators fix in Bem’s Supervise UI as a future test case. Add it to a golden dataset, re-run evals, and don’t promote any function/workflow version that fails on a case you’ve already seen once.
Summary
Bem’s evals and regression testing let you treat AI accuracy like software quality, not a demo metric.
You:
- Define strict schemas and golden datasets that encode what “correct” means.
- Run automated evals and regression tests via Bem’s endpoints on every new function/workflow version.
- Enforce hard thresholds in CI/CD so no release is promoted if F1 drops or exception rates spike.
- Use human review and golden dataset growth to continuously harden the system, especially on edge cases.
Agents guess. Demos impress. Production needs evidence. Golden datasets plus blocking eval gates are how you get it.