
How do I set up Bem to split mixed packets (like shipping packets) and run different extraction functions per document type?
Quick Answer: Use a workflow that treats a “packet” as a single input, then explicitly Split, Route, and Transform. In Bem, you ingest the shipping packet once, automatically split it into individual documents, classify each one, and run a dedicated extraction function per document type, all under one versioned, idempotent workflow.
Why This Matters
Shipping packets and other “mixed packets” are where demo-grade extraction breaks. One PDF might contain a Bill of Lading, commercial invoice, packing list, and customs forms—each with different schemas, business rules, and downstream systems. If you can’t reliably split and route those pieces, you get bad totals, missing line items, and manual re-keying.
Bem is built for exactly this failure mode. Instead of treating every page as a separate problem, you treat the packet as a single event and define a deterministic pipeline: split, classify, extract into strict schemas, enrich, validate, and sync. One workflow. Many document types. Schema-valid JSON or explicit exceptions.
Key Benefits:
- Single upload, full packet coverage: Ingest one shipping packet and get structured JSON for every document inside—BOL, invoice, packing list, certs—plus clear exceptions where something doesn’t pass validation.
- Deterministic routing per document type: Use Route and Transform functions to send each split document to its own typed schema and extraction logic, with per-field confidence and hallucination detection.
- Production-grade operations: Idempotent execution, versioned workflows, and operator “Surfaces” for review mean you can run millions of packets weekly without glue-code rebuilds every time a vendor layout changes.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Mixed packet workflow | A Bem workflow that takes a multi-document input (e.g., 50-page shipping packet) and orchestrates Split → Route → Extract → Enrich → Validate → Sync. | You stop writing ad-hoc scripts per vendor; you define one pipeline that handles arbitrary packets deterministically. |
| Split + Route primitives | Workflow steps that first segment the packet into individual documents, then classify each segment (BOL, invoice, cert, etc.) to choose the correct downstream function. | This is how you get “one file in, many documents out” without brittle template matching or manual pre-sorting. |
| Per-document extraction functions | Dedicated Transform functions with strict schemas for each type (e.g., bol_extraction, invoice_extraction). | Lets you enforce schema validity, apply business rules, and track accuracy per document type with evals and regression tests. |
How It Works (Step-by-Step)
At a high level, you’re building a workflow that does this:
- Treat the shipping packet as the input.
- Split it into individual documents.
- Classify each document.
- Run the right extraction function for each class.
- Enrich, validate, and sync results to your system of record.
In Bem terms, that’s a workflow composed of functions like Split, Route, Transform, Enrich, and Validate, all versioned and idempotent.
Below is a concrete way to set this up.
1. Define your schemas per document type
Start by deciding what “correct” looks like for each document. For example:
- Bill of Lading (BOL) schema
shipper_name,consignee_name,bol_number,vessel_name,port_of_loading,port_of_discharge,container_numbers[],seal_numbers[],gross_weight,packages_count,incoterms, etc.
- Commercial Invoice schema
invoice_number,invoice_date,seller_name,buyer_name,currency,total_amount,line_items[],po_number,payment_terms, etc.
- Packing List schema
packing_list_number,shipment_reference,cartons_count,gross_weight,net_weight,line_items[], etc.
- Certificate/Customs schema
certificate_type,issuer,issue_date,reference_numbers[],country_of_origin, etc.
Represent these as JSON Schemas in Bem. Schema enforcement is the point:
- If the output matches the schema, it passes.
- If it doesn’t, Bem flags the exception instead of guessing.
2. Build extraction functions per document type
For each schema, create a Transform function that takes an unstructured document (or its extracted text + layout) and outputs schema-valid JSON.
Example function names:
bol_extraction_v1invoice_extraction_v3packing_list_extraction_v2certificate_extraction_v1
Each function:
- Is versioned (
_v1,_v2, etc.) so you can roll forward/back with confidence. - Emits per-field confidence and hallucination detection signals.
- Can be evaluated against golden datasets (F1 scores, pass rates) to track accuracy over time.
3. Create a Split step for mixed packets
In the workflow editor or via API, add a Split function as the first step after ingest:
- Input: the raw packet (e.g.,
sample-bol.pdf, 4 pages, “International shipping packet”). - Output: an array of
documents[], where each item has:- Page range or region metadata
- Extracted text/layout
- A pointer back to the original packet (for auditability)
Bem already knows how to split typical logistics packets—BOL, invoices, PLs, certs—based on structure and patterns, not just page counts. You don’t maintain brittle page heuristics.
High-level behavior:
{
"packet_id": "pkt_123",
"documents": [
{ "doc_id": "doc_1", "pages": [1, 2], "content": "...", "type": null },
{ "doc_id": "doc_2", "pages": [3], "content": "...", "type": null },
{ "doc_id": "doc_3", "pages": [4], "content": "...", "type": null }
]
}
4. Add a Route step to classify each split document
Next, add a Route function that runs per doc emitted by the Split step and decides “what is this?”
The classifier will label each doc as one of:
bill_of_ladingcommercial_invoicepacking_listcertificateother/unknown
Example conceptual output:
{
"doc_id": "doc_2",
"predicted_type": "commercial_invoice",
"confidence": 0.98
}
Use confidence thresholds:
confidence >= 0.9→ route automatically to the matching extraction function.0.7 <= confidence < 0.9→ route to extraction, but mark for review.confidence < 0.7→ route to a human review Surface first.
That routing logic is baked into the workflow, not a side script. Deterministic, auditable.
5. Map each route to a specific extraction function
Now wire the classifier outputs to the appropriate Transform functions:
- If
predicted_type == "bill_of_lading"→ callbol_extraction_v1 - If
predicted_type == "commercial_invoice"→ callinvoice_extraction_v3 - If
predicted_type == "packing_list"→ callpacking_list_extraction_v2 - If
predicted_type == "certificate"→ callcertificate_extraction_v1 - Else → send to an exception path (e.g., “Unclassified Document Surface”)
In pseudo-workflow:
steps:
- name: split_packet
type: Split
- name: classify_docs
type: Route
for_each: split_packet.documents
- name: extract_bol
type: Transform
when: classify_docs.predicted_type == "bill_of_lading"
- name: extract_invoice
type: Transform
when: classify_docs.predicted_type == "commercial_invoice"
- name: extract_packing_list
type: Transform
when: classify_docs.predicted_type == "packing_list"
- name: extract_certificate
type: Transform
when: classify_docs.predicted_type == "certificate"
Bem executes these in parallel per document where possible. One packet in, multiple extraction calls out—but all under a single workflow call and trace.
6. Enrich against your own master data
Once each document is extracted into schema-valid JSON, add Enrich steps to match against your systems:
- Match
shipper_nameandconsignee_nameagainst your vendor/customer Collections. - Map
port_of_loading/port_of_dischargeto your internal port codes. - Map
SKUormaterial_numberfrom invoices/packing lists to your product master.
Each enrichment returns match confidence and choice:
{
"shipper_name_raw": "ACME LOGISTICS LLC",
"shipper_id": "VEND-2398",
"shipper_match_confidence": 0.97
}
Low-confidence matches can be routed to a review Surface, corrected, and fed back into training.
7. Validate against business rules
Add a Validate step that enforces your own invariants:
- BOL total gross weight vs. sum of packing list weights.
- Invoice currency vs. allowed currencies for that customer.
- Duplicate invoice detection against your ERP.
- Required fields:
bol_number,invoice_number,container_numbers[]must be non-null.
Example validation outcomes:
{
"document_type": "commercial_invoice",
"valid": false,
"errors": [
{ "field": "invoice_number", "issue": "missing" },
{ "field": "total_amount", "issue": "mismatch_with_packing_list" }
]
}
If valid == false, the workflow doesn’t silently push bad data downstream. It flags an exception, sends it to a Surface, and you can track this as part of your operational metrics.
8. Sync into your ERP or TMS
Finally, add a Sync step to push results into your systems via REST, webhooks, or polling:
- AP invoices → SAP / NetSuite / Dynamics
- Shipment data → TMS / WMS
- Container and seal numbers → internal logistics DB
Because everything is schema-validated, you’re not writing “defensive” glue code around messy extraction. You’re consuming a stable JSON contract.
Important production behaviors:
- Idempotent sync: if the same packet is reprocessed, Bem preserves idempotency so you don’t double-book invoices or duplicate shipments.
- Versioned workflows: you can upgrade extraction functions (e.g.,
invoice_extraction_v3 → v4) and roll back instantly if an eval shows regression. - 99.99% uptime & auditability: every mixed packet has a trace: when it entered, how it was split, each route decision, each extraction, and the final sync.
Common Mistakes to Avoid
-
Treating each page as a separate document:
This is how you end up losing cross-document relationships and making brittle assumptions about page order. Instead, treat the packet as the unit of work and let Split + Route handle boundaries and types. -
Skipping schema and validation:
Letting “whatever the model returns” flow straight into your ERP is how hallucinations become journal entries. Define strict per-type schemas, use Bem’s schema enforcement, and add business-rule validation steps before sync. -
Hard-coding vendor-specific templates:
Template logic per carrier or vendor might work for the first dozen, then collapses under real-world variability. Use classifier-based routing and extraction functions that generalize across layouts, with evals to measure coverage.
Real-World Example
A logistics team receives a 50-page PDF per shipment: a bundled packet containing:
- 1x Bill of Lading
- 2x Commercial Invoices
- 1x Packing List
- 1x Certificate of Origin
- Several pages of “noise” (emails, instructions)
They wire up a Bem workflow:
- Ingest: The packet arrives by email and is forwarded to a Bem ingestion endpoint.
- Split: Bem splits it into eight logical documents.
- Route: The classifier identifies 1 BOL, 2 invoices, 1 packing list, 1 certificate, and 3 “other.”
- Extract:
- The BOL runs through
bol_extraction_v1, yieldingbol_number,vessel_name,ports,containers,seals,gross_weight. - The invoices run through
invoice_extraction_v3, each with line items and totals. - The packing list and certificate go through their respective functions.
- The BOL runs through
- Enrich & Validate:
- Shipper and consignee are matched to internal vendor/customer IDs.
- Line items are matched to SKUs.
- Weights and totals are cross-checked between BOL, invoices, and packing list.
- Sync: Valid documents are synced to SAP and the TMS. One invoice with a currency mismatch is flagged and sent to a review Surface.
Net result: millions of documents processed weekly, with deterministic behavior. No one is copy-pasting container numbers at 11 p.m.
Pro Tip: Start with a small golden dataset of 50–100 real packets and wire Bem’s evals to that workflow. Track F1 scores per document type and per field (e.g.,
bol_number,total_amount,container_numbers[]) before routing anything into production. Treat accuracy like code coverage, not a vibe.
Summary
To handle mixed packets like shipping packets in Bem, you don’t stitch together ad-hoc scripts. You define a single, versioned workflow that:
- Ingests the packet as a whole.
- Splits it into individual documents.
- Routes each document by type.
- Runs type-specific extraction functions into strict schemas.
- Enriches and validates against your own systems.
- Syncs only schema-valid, confidence-annotated JSON downstream, with exceptions routed to humans.
Agents guess. Templates crack. A Split → Route → Extract → Enrich → Validate → Sync workflow gives you a deterministic production layer for unstructured logistics data.