
How do I set up Bem to split mixed packets (like shipping packets) and run different extraction functions per document type?
Quick Answer: You set up Bem to handle mixed packets by building a workflow that (1) ingests the packet, (2) automatically splits and classifies each embedded document, and (3) routes each type (BOL, invoice, cert, etc.) to its own extraction function and schema. One file in, multiple typed JSON outputs out—plus per-field confidence, hallucination detection, and exception handling.
Why This Matters
Mixed packets are where “demo AI” dies. A 40‑page shipping packet with a BOL, three invoices, a packing list, and a certificate of origin will happily pass through a per-page OCR API—but your ERP still can’t reconcile containers, match invoices, or post to the GL. You don’t need text. You need correctly typed, schema-valid JSON per document type, every time, at scale.
Bem is designed for exactly this: one workflow per packet that keeps working as vendors, layouts, and edge cases change. Instead of building a parser for every layout, you compose deterministic steps—Split, Route, Transform, Enrich, Validate—then ship once and run millions of documents through the same pipeline.
Key Benefits:
- Single pipeline for messy packets: Ingest 50-page shipping packets and get clean, separated JSON for each document type with one call.
- Deterministic routing, not guessing: Route docs to the right extraction function and schema using explicit classification and branching logic.
- Production‑grade safety rails: Enforce schema validity, per-field confidence thresholds, hallucination detection, and exception routing from day one.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Mixed packet handling | Treating a multi‑document PDF (e.g., shipping packet) as one input that is automatically split into individual documents. | You keep your integration simple—one file, one workflow—while Bem does the heavy lifting of separating BOLs, invoices, and certs. |
| Route and Split primitives | Workflow steps that first split the packet into component documents, then classify each document into a type (BOL, Invoice, PL, Cert, etc.). | This is how you avoid brittle per-page logic and manual document sorting; routing logic lives in one place and is versioned. |
| Per-type extraction functions | Individual functions (Transform steps) tuned for each document type, each with its own JSON Schema and evals. | You get schema-valid JSON per doc type, per-field confidence, and the ability to iterate and roll back without touching the whole pipeline. |
How It Works (Step-by-Step)
At a high level, you build a Bem workflow that takes a single packet file and returns a structured payload of typed documents. Under the hood, it’s just composable primitives:
- Ingest the packet (one input, one call).
- Split & classify into documents.
- Route each document type to its own extraction function, schema, and post-processing.
Below is how to set this up concretely.
1. Define your target schemas per document type
Before routing anything, decide what “done” looks like. For a typical shipping packet you might have:
bill_of_lading.schema.jsoncommercial_invoice.schema.jsonpacking_list.schema.jsoncertificate_of_origin.schema.json- (Optionally)
payment_receipt.schema.json,release_order.schema.json, etc.
Each schema defines the fields you actually need downstream, for example:
// commercial_invoice.schema.json
{
"$id": "CommercialInvoice",
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"invoice_date": { "type": "string", "format": "date" },
"shipper_name": { "type": "string" },
"consignee_name": { "type": "string" },
"total_amount": { "type": "number" },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP", "JPY", "CNY"] },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"hs_code": { "type": "string" },
"quantity": { "type": "number" },
"unit_price": { "type": "number" }
},
"required": ["description", "quantity", "unit_price"]
}
}
},
"required": ["invoice_number", "invoice_date", "shipper_name", "consignee_name", "total_amount", "currency"]
}
This is what Bem enforces. You either get JSON that passes this schema, or an explicit exception that you can route to a human.
2. Create extraction functions per document type
Next, create separate functions in Bem for each doc type. Conceptually:
extract_bill_of_ladingextract_commercial_invoiceextract_packing_listextract_certificate_of_origin
Each function:
- Takes a single “atomic” document as input (already split from the packet).
- Uses LLM/OCR under the hood to extract fields.
- Enforces your JSON Schema.
- Emits:
data: the schema-valid JSON.confidence: per-field confidence scores.hallucination: per-field hallucination flags / risk.metadata: trace, version, timing.
In the workflow definition, they’ll show up as Transform steps. For example:
{
"name": "extract_commercial_invoice",
"type": "transform",
"schema": "CommercialInvoice",
"config": {
"hallucination_detection": true,
"return_field_confidence": true
}
}
3. Build a Split step for the mixed packet
Now define how Bem should break apart the packet.
In your workflow:
- Add an Ingest step that accepts a file (PDF/email/attachment).
- Add a Split step that:
- Uses the packet’s layout and semantics (not just page count) to separate documents.
- Produces a list of child documents, each with its own
idandpage_range.
Conceptually, the output of split_shipping_packet looks like:
{
"documents": [
{ "id": "doc_1", "type": null, "pages": [1, 2] },
{ "id": "doc_2", "type": null, "pages": [3, 6] },
{ "id": "doc_3", "type": null, "pages": [7, 9] },
{ "id": "doc_4", "type": null, "pages": [10, 12] }
]
}
At this point, documents are split, but not yet labeled as BOL, invoice, etc. That’s the next step.
4. Add a classification Route for each document
After splitting, you add a Route step that classifies each document into a logical type. Typical labels:
bill_of_ladingcommercial_invoicepacking_listcertificate_of_originother(for anything unexpected)
Conceptual routing configuration:
{
"name": "route_shipping_docs",
"type": "route",
"input": "{{ split_shipping_packet.documents }}",
"config": {
"routes": [
{ "label": "bill_of_lading", "condition": "doc.class == 'BOL'" },
{ "label": "commercial_invoice", "condition": "doc.class == 'INVOICE'" },
{ "label": "packing_list", "condition": "doc.class == 'PACKING_LIST'" },
{ "label": "certificate_of_origin", "condition": "doc.class == 'CERT_OF_ORIGIN'" },
{ "label": "other", "condition": "true" }
]
}
}
Behind the scenes, Bem uses classifiers tuned for your domain. From your perspective, it’s a typed branch: each labeled route fans into a different function.
5. Connect per-type extraction functions via branching
Now wire the routes to the extraction functions you created earlier.
For each route:
bill_of_lading→extract_bill_of_ladingcommercial_invoice→extract_commercial_invoicepacking_list→extract_packing_listcertificate_of_origin→extract_certificate_of_originother→ optional fallback handler or exception
In workflow terms:
{
"name": "shipping_packet_workflow",
"steps": [
{ "name": "ingest_packet", "type": "ingest" },
{ "name": "split_shipping_packet", "type": "split", "input": "{{ ingest_packet.file }}" },
{ "name": "route_shipping_docs", "type": "route", "input": "{{ split_shipping_packet.documents }}" },
{
"name": "extract_bols",
"type": "transform",
"input": "{{ route_shipping_docs.bill_of_lading }}",
"function": "extract_bill_of_lading"
},
{
"name": "extract_invoices",
"type": "transform",
"input": "{{ route_shipping_docs.commercial_invoice }}",
"function": "extract_commercial_invoice"
},
{
"name": "extract_packing_lists",
"type": "transform",
"input": "{{ route_shipping_docs.packing_list }}",
"function": "extract_packing_list"
},
{
"name": "extract_certs",
"type": "transform",
"input": "{{ route_shipping_docs.certificate_of_origin }}",
"function": "extract_certificate_of_origin"
}
]
}
Everything runs in parallel where possible. One packet in, multiple typed JSON documents out.
6. Enrich & Validate against your master data
Extraction is only half the work. You still need to anchor values against reality.
Use Enrich and Validate steps to:
- Match vendors/consignees to your Collections (e.g., vendor master).
- Map ports, Incoterms, and currencies to normalized codes.
- Validate totals, line-item sums, and cross-document consistency (e.g., container IDs match between BOL and packing list).
Examples:
enrich_invoice_vendor→ lookupshipper_namein your vendor Collection, returnvendor_id+match_confidence.validate_bol_vs_packing_list→ check that container numbers and gross weight align across docs.
Conceptually:
{
"name": "enrich_invoice_vendor",
"type": "enrich",
"input": "{{ extract_invoices.data.shipper_name }}",
"config": {
"collection": "vendors",
"match_on": ["legal_name", "dba_name"],
"min_match_confidence": 0.85
}
}
If match confidence is low, you can route to a human review Surface rather than silently guessing.
7. Shape the final payload for your ERP/TMS
Finally, use Payload Shaping (JMESPath-style) to produce an ERP/TMS-ready object.
For example, for a single packet:
{
"bol": "{{ extract_bols.data[0] }}",
"invoices": "{{ extract_invoices.data }}",
"packing_lists": "{{ extract_packing_lists.data }}",
"certificates_of_origin": "{{ extract_certs.data }}",
"meta": {
"packet_id": "{{ ingest_packet.id }}",
"source": "{{ ingest_packet.source }}",
"processed_at": "{{ workflow.executed_at }}"
}
}
You then deliver this via:
- Webhook to your integration layer
- Direct REST pull (polling)
- Event into your message bus (e.g., processed_packet topic)
Common Mistakes to Avoid
-
Treating each page as a separate “document”:
How to avoid it: Always split semantically, not by page count. Use Bem’s Split primitive on the whole packet first, then route; don’t call extraction per page and try to reassemble documents manually. -
One generic extraction function for all doc types:
How to avoid it: Create separate functions and schemas per type. Use Route to fan out toextract_bill_of_lading,extract_commercial_invoice, etc., so you can tune, eval, and roll back each independently. -
No confidence thresholds or exception routing:
How to avoid it: Set per-field and per-document confidence thresholds and wire low-confidence or hallucinated fields into a review Surface. “Schema-valid or exception” should be the rule. -
Burying enrichment in your own glue code:
How to avoid it: Use Bem’s Enrich and Validate steps with Collections to handle vendor matching, GL codes, and port normalization inside the workflow, not scattered throughout your codebase.
Real-World Example
A global logistics team receives “shipping-packet-extraction · v59” style PDFs: 30–60 pages each, containing a Bill of Lading, multiple commercial invoices, a packing list, customs declarations, and certificates. Previously, they had:
- A per-page OCR pipeline.
- Regex-based scripts trying to guess where the invoice starts and ends.
- Human operators copying container numbers and totals into their TMS and ERP.
With Bem, they built a single shipping_packet_workflow:
- Upload & Identify: They POST each packet to a single workflow endpoint.
- Split & Route: Bem automatically splits the packet into BOL, invoices, packing list, customs docs, and certs, and routes each to the right extraction function.
- Extract: Each doc type is extracted to its own schema, with per-field confidence and hallucination detection.
- Enrich & Validate: Shippers are matched to vendor records; ports and Incoterms are normalized; totals are validated across BOL and invoices.
- Sync: The final, shaped JSON is pushed into their TMS and ERP via webhooks.
Operationally:
- Manual keying dropped close to zero.
- Exceptions are explicit and auditable.
- They can version and roll back
extract_commercial_invoicewithout touching BOL extraction.
Pro Tip: Start with a single “golden” packet and build the full workflow end-to-end—Split, Route, Extract per type, Enrich, Validate, Payload Shaping—before you think about scale. Then run that same workflow against thousands of historical packets and use evals (F1 scores per field, drift detection) to harden it before routing live traffic.
Summary
To set up Bem to split mixed packets like shipping packets and run different extraction functions per document type, you don’t need bespoke parsers or page-level tricks. You:
- Define strict schemas per document type.
- Build per-type extraction functions.
- Use a Split step to break the packet into atomic documents.
- Use Route to classify each document and fan out to the right function.
- Enrich and Validate against your own master data.
- Shape and sync the final payload into your ERP/TMS.
One workflow. One input per packet. Deterministic, schema-enforced JSON out—with confidence, hallucination detection, and exceptions wired in from day one.