
Bem evals/regression testing: how do I create a golden dataset and block a workflow release if accuracy drops?
Most teams treat evals as an afterthought—something you run once before a big launch. That’s how you end up shipping regressions, breaking downstream systems, and finding out from your finance team that totals don’t add up anymore. With Bem, evals and regression testing are first-class: you create golden datasets, run automated checks on every new version, and block a workflow release the same way you’d block a failing test suite.
Quick Answer: You create a golden dataset in Bem by submitting labeled “ground truth” outputs for real inputs (documents, emails, packets), then attach that dataset to a function or workflow schema. Before promoting a new version, you hit Bem’s eval/regression endpoints to compute F1/precision/recall, compare against your thresholds, and only roll forward if metrics stay at or above your target—otherwise you block the release and keep the previous version in production.
Why This Matters
If you’re using LLMs in production without evals and regression testing, you’re flying blind. Demo accuracy doesn’t survive real-world packets, vendor drift, and new edge cases. You need a way to say, concretely:
- “This new version is better than the old one.”
- “If accuracy drops below X, nobody can ship it.”
- “When it breaks, I can see exactly where and why.”
Bem treats accuracy like code coverage: golden datasets, F1 scores, regression runs, and pass/fail gates for releases. Not vibes. Not “looks good in staging.”
Key Benefits:
- Deterministic releases: New extraction logic only ships if it beats—or at least matches—your current metrics.
- Auditable accuracy: Every workflow version has eval stats and regression traces you can replay and debug.
- Production resilience: Vendor layout changes or model updates trigger drift detection instead of silent failures in downstream systems.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Golden dataset | A curated set of inputs (PDFs, images, emails, mixed packets) with human-verified expected JSON outputs mapped to your schema. | Gives you a ground truth baseline to measure Precision, Recall, and F1 Scores for each new function or workflow version. |
| Automated evals | Programmatic evaluations Bem runs against your golden dataset using a specific function/workflow version. | Turns “looks right” into measurable metrics, so you can track improvements and detect regressions per field and per schema. |
| Regression testing & release gating | Replaying historical payloads and/or golden datasets against a new version, comparing metrics and error patterns, and blocking promotion if accuracy drops below a threshold. | Protects production from model drift and bad changes, the same way failing unit tests block a code deploy. |
How It Works (Step-by-Step)
At a high level, you:
- Define your schema and critical fields
- Create and maintain a golden dataset
- Wire Bem evals/regression runs into your CI/CD to gate releases
1. Define your schema and critical fields
First, you need something concrete to measure against. In Bem, that’s your JSON Schema: the contract your workflows must satisfy.
Example (simplified) invoice schema:
{
"$id": "https://api.your-company.com/schemas/invoice-v1",
"type": "object",
"required": ["invoice_number", "invoice_date", "vendor_name", "total_amount", "line_items"],
"properties": {
"invoice_number": { "type": "string" },
"invoice_date": { "type": "string", "format": "date" },
"vendor_name": { "type": "string" },
"total_amount": { "type": "number" },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] },
"line_items": {
"type": "array",
"items": {
"type": "object",
"required": ["description", "quantity", "unit_price", "line_total"],
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"unit_price": { "type": "number" },
"line_total": { "type": "number" }
}
}
}
}
}
Mark critical fields (e.g., total_amount, line_items.line_total, invoice_date) as required and use tight types/enums. That’s how Bem can enforce “schema-valid output or explicit exception,” and how evals can score exactly what matters.
2. Create a golden dataset in Bem
A golden dataset is just: {input → expected_output} mapped to your schema.
You want real-world mess, not synthetic happy paths:
- Different vendors/layouts
- Scanned PDFs vs native PDFs
- Multi-doc packets (e.g., invoice + receipt + credit memo)
- Edge cases: discounts, negative lines, taxes split, multi-currency
There are two practical ways to build it in Bem.
2.1 Submit corrections from production to build goldens
If you already have a workflow in production:
-
Enable human review for low-confidence cases.
Use Bem’s human-in-the-loop queue (Supervise layer) to route low-confidence fields or exceptions to reviewers. They correct the JSON directly in the UI. -
Treat those corrections as ground truth.
Each approved correction becomes a labeled example: the input document plus the correct output for your schema. -
Promote reviewed cases into a golden dataset.
Use Bem’s “Corrections & golden datasets” pipeline (via API or UI) to tag reviewed examples as goldens for a specific function/workflow.
This gives you a golden dataset that matches your actual traffic and failure modes. No guesswork.
2.2 Manually seed a golden dataset from historical data
If you’re still pre-production or migrating from existing systems:
-
Collect a diverse sample of historical inputs.
Example: 200–1,000 invoices across your top 50 vendors, plus a long tail set. -
Generate candidate outputs.
Run them through your current Bem function/workflow (e.g.,invoice-extractor-v1) or your legacy system. -
Have humans verify and correct.
Either in your own tools or via Bem’s review UI. The important part: final outputs must be trusted. -
Upload to Bem as a golden dataset.
Through the evals API or upload flow, you store:{ "dataset_name": "invoices-golden-v1", "schema_id": "https://api.your-company.com/schemas/invoice-v1", "examples": [ { "input": { "uri": "s3://your-bucket/invoices/1234.pdf" }, "expected_output": { "invoice_number": "INV-1234", "invoice_date": "2024-01-31", "vendor_name": "Acme Parts Inc.", "total_amount": 1532.45, "currency": "USD", "line_items": [ { "description": "Brake Pads", "quantity": 4, "unit_price": 55.00, "line_total": 220.00 } ] } } ] }
Bem stores this as a reusable golden dataset tied to your schema. You can now run evals against it any time, for any function/workflow version that outputs that schema.
3. Run automated evals for a new version
When you ship a new version of a function or workflow—say you go from invoice-extractor-v1 to invoice-extractor-v2—you shouldn’t be eyeballing sample outputs. You should be running evals.
Bem gives you:
- Automated Evals: Precision, Recall, F1 before promoting to production.
- Regression Testing: Re-run historical payloads against new versions to catch drift.
- /v2/functions/review: Estimate statistical confidence and human review load to hit 99.9% accuracy.
A typical eval flow:
-
Deploy the new version in “candidate” mode.
You createinvoice-extractor-v2as a function version, but don’t make it the production alias yet. -
Run evals against your golden dataset.
Call Bem’s eval endpoint, passing the function version and dataset name:curl -X POST https://api.bem.ai/v2/functions/eval \ -H "Authorization: Bearer $BEM_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "function_name": "invoice-extractor", "function_version": "v2", "dataset_name": "invoices-golden-v1", "metrics": ["precision", "recall", "f1"] }' -
Inspect metrics per field and overall.
The response will include something like:{ "function_name": "invoice-extractor", "function_version": "v2", "dataset_name": "invoices-golden-v1", "summary": { "precision": 0.987, "recall": 0.981, "f1": 0.984 }, "by_field": { "invoice_number": { "precision": 0.998, "recall": 0.998, "f1": 0.998 }, "total_amount": { "precision": 0.995, "recall": 0.995, "f1": 0.995 }, "line_items": { "precision": 0.975, "recall": 0.962, "f1": 0.968 } } } -
Optionally estimate human review needs.
If you’re targeting “99.9% effective accuracy,” use:curl -X POST https://api.bem.ai/v2/functions/review \ -H "Authorization: Bearer $BEM_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "function_name": "invoice-extractor", "function_version": "v2", "dataset_name": "invoices-golden-v1", "target_accuracy": 0.999 }'Bem will estimate how much of your traffic needs human review to hit that target, based on per-field confidence.
4. Run regression testing before promotion
Evals on goldens tell you “vs ground truth, how do we perform?” Regression testing adds “vs my current production version, did I break anything?”
Because Bem treats functions as versioned primitives, you can replay historical payloads against multiple versions.
Typical pattern:
-
Replay historical payloads.
Use the regression endpoint to run, for example, the last 10,000 production invoices through bothv1andv2:curl -X POST https://api.bem.ai/v2/functions/regression \ -H "Authorization: Bearer $BEM_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "function_name": "invoice-extractor", "baseline_version": "v1", "candidate_version": "v2", "payload_source": { "type": "historical", "since": "2024-01-01T00:00:00Z" }, "schema_id": "https://api.your-company.com/schemas/invoice-v1" }' -
Compare metrics and diffs.
You get metrics like:{ "summary": { "baseline_f1": 0.972, "candidate_f1": 0.984, "delta_f1": 0.012 }, "by_field": { "invoice_number": { "baseline_f1": 0.990, "candidate_f1": 0.993 }, "total_amount": { "baseline_f1": 0.978, "candidate_f1": 0.997 }, "line_items": { "baseline_f1": 0.958, "candidate_f1": 0.966 } }, "regressions": [ { "payload_id": "inv-44821", "field": "invoice_date", "baseline_value": "2024-01-31", "candidate_value": "2024-01-30", "difference_type": "semantic" } ] } -
Investigate regressions before promotion.
Regression entries link back to the exact inputs and outputs. You can pull them into your observability stack or open them in Bem’s surfaces to see what changed.
This is where Bem stops being “just extraction” and becomes production infrastructure: you have versioned functions, replays, diffs, metrics, and a formal go/no-go gate.
5. Block workflow releases when accuracy drops (CI/CD gating)
To enforce “no regressions,” you wire eval/regression calls into your deployment pipeline. Think: npm test, but for unstructured data.
Example: GitHub Actions-style pseudo-config:
name: Deploy invoice workflow
on:
push:
branches: [main]
paths:
- workflows/invoices/**
jobs:
eval-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Run Bem evals
run: |
RESPONSE=$(curl -s -X POST https://api.bem.ai/v2/functions/eval \
-H "Authorization: Bearer $BEM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"function_name": "invoice-extractor",
"function_version": "v2",
"dataset_name": "invoices-golden-v1",
"metrics": ["precision", "recall", "f1"]
}')
F1=$(echo "$RESPONSE" | jq '.summary.f1')
echo "F1 Score: $F1"
MIN_F1=0.98
if (( $(echo "$F1 < $MIN_F1" | bc -l) )); then
echo "F1 below threshold ($MIN_F1). Failing build."
exit 1
fi
- name: Run Bem regression
run: |
REG=$(curl -s -X POST https://api.bem.ai/v2/functions/regression \
-H "Authorization: Bearer $BEM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"function_name": "invoice-extractor",
"baseline_version": "v1",
"candidate_version": "v2",
"payload_source": {
"type": "historical",
"since": "2024-01-01T00:00:00Z"
}
}')
DELTA=$(echo "$REG" | jq '.summary.delta_f1')
echo "Delta F1 vs baseline: $DELTA"
MIN_DELTA=-0.002 # don’t allow >0.2% drop
if (( $(echo "$DELTA < $MIN_DELTA" | bc -l) )); then
echo "Regression exceeds allowed delta. Failing build."
exit 1
fi
- name: Promote Bem function version
if: success()
run: |
curl -X POST https://api.bem.ai/v2/functions/promote \
-H "Authorization: Bearer $BEM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"function_name": "invoice-extractor",
"function_version": "v2"
}'
Mechanically:
- Your CI pipeline runs evals on the golden dataset and regression on historical payloads.
- It parses F1 and delta F1 from the Bem responses.
- If metrics fall below thresholds, the job exits non-zero. The version does not get promoted.
- If metrics pass, you call a
promoteaction (or update your workflow alias) andv2becomes the production version.
That’s how you “block a workflow release if accuracy drops” in a deterministic, reproducible way.
Common Mistakes to Avoid
-
Mistake 1: Tiny, unrepresentative golden datasets.
If your golden dataset is 20 clean PDFs from one vendor, your evals will lie to you. Include the hard stuff: scans, partial uploads, multi-doc packets, edge-case layouts. Aim for coverage of real production variance, not just “something to pass CI.” -
Mistake 2: Single global metric with no field weighting.
A global F1 bump that hides atotal_amountregression is worse than useless. In Bem, look atby_fieldmetrics and set thresholds per critical field (e.g., don’t accept any drop on totals/dates, allow minor movement on descriptions).
Real-World Example
A fleet management platform needed to ingest invoices from hundreds of vendors into their ERP. Early on, they had “good enough” demo accuracy, but production was brutal: new layouts, handwritten notes, totals that didn’t match line items.
They did three things:
- Defined a strict invoice schema with required totals, line items, and date formats.
- Built a golden dataset from 1,500 historical invoices plus ongoing human-reviewed corrections inside Bem.
- Wired Bem evals/regression into CI so every new extraction tweak had to beat
v1on F1, and could not regress on totals or dates.
Result: they got to “Totals including line items, they were 100% accurate” in production. When they tuned for new vendor types or added enrichment against their vendor master list, evals made it obvious whether changes helped or hurt. Releases became boring. Invoices just entered themselves.
Pro Tip: Don’t wait for a “perfect” golden dataset. Start with 50–100 high-impact examples, wire evals into CI with conservative thresholds, and then continuously promote new reviewed cases into your dataset. Your evals will get better as your traffic and edge cases grow.
Summary
Bem evals and regression testing give you a way to treat unstructured-data accuracy like software quality:
- You define a schema and build golden datasets from real inputs plus human-verified outputs.
- You run automated evals (Precision, Recall, F1) on every new function/workflow version.
- You replay historical payloads to catch drift and regressions.
- You wire these checks into CI/CD so a workflow release literally cannot ship if accuracy drops below your thresholds.
No more shipping on vibes. Just versioned functions, measurable accuracy, and release gates that keep production stable while you iterate.