How do I set up Bem to split mixed packets (like shipping packets) and run different extraction functions per document type?
Unstructured Data Extraction APIs

How do I set up Bem to split mixed packets (like shipping packets) and run different extraction functions per document type?

9 min read

Quick Answer: Use a workflow that treats a “packet” as a single input, then explicitly Split, Route, and Transform. In Bem, you ingest the shipping packet once, automatically split it into individual documents, classify each one, and run a dedicated extraction function per document type, all under one versioned, idempotent workflow.

Why This Matters

Shipping packets and other “mixed packets” are where demo-grade extraction breaks. One PDF might contain a Bill of Lading, commercial invoice, packing list, and customs forms—each with different schemas, business rules, and downstream systems. If you can’t reliably split and route those pieces, you get bad totals, missing line items, and manual re-keying.

Bem is built for exactly this failure mode. Instead of treating every page as a separate problem, you treat the packet as a single event and define a deterministic pipeline: split, classify, extract into strict schemas, enrich, validate, and sync. One workflow. Many document types. Schema-valid JSON or explicit exceptions.

Key Benefits:

  • Single upload, full packet coverage: Ingest one shipping packet and get structured JSON for every document inside—BOL, invoice, packing list, certs—plus clear exceptions where something doesn’t pass validation.
  • Deterministic routing per document type: Use Route and Transform functions to send each split document to its own typed schema and extraction logic, with per-field confidence and hallucination detection.
  • Production-grade operations: Idempotent execution, versioned workflows, and operator “Surfaces” for review mean you can run millions of packets weekly without glue-code rebuilds every time a vendor layout changes.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Mixed packet workflowA Bem workflow that takes a multi-document input (e.g., 50-page shipping packet) and orchestrates Split → Route → Extract → Enrich → Validate → Sync.You stop writing ad-hoc scripts per vendor; you define one pipeline that handles arbitrary packets deterministically.
Split + Route primitivesWorkflow steps that first segment the packet into individual documents, then classify each segment (BOL, invoice, cert, etc.) to choose the correct downstream function.This is how you get “one file in, many documents out” without brittle template matching or manual pre-sorting.
Per-document extraction functionsDedicated Transform functions with strict schemas for each type (e.g., bol_extraction, invoice_extraction).Lets you enforce schema validity, apply business rules, and track accuracy per document type with evals and regression tests.

How It Works (Step-by-Step)

At a high level, you’re building a workflow that does this:

  1. Treat the shipping packet as the input.
  2. Split it into individual documents.
  3. Classify each document.
  4. Run the right extraction function for each class.
  5. Enrich, validate, and sync results to your system of record.

In Bem terms, that’s a workflow composed of functions like Split, Route, Transform, Enrich, and Validate, all versioned and idempotent.

Below is a concrete way to set this up.

1. Define your schemas per document type

Start by deciding what “correct” looks like for each document. For example:

  • Bill of Lading (BOL) schema
    • shipper_name, consignee_name, bol_number, vessel_name, port_of_loading, port_of_discharge, container_numbers[], seal_numbers[], gross_weight, packages_count, incoterms, etc.
  • Commercial Invoice schema
    • invoice_number, invoice_date, seller_name, buyer_name, currency, total_amount, line_items[], po_number, payment_terms, etc.
  • Packing List schema
    • packing_list_number, shipment_reference, cartons_count, gross_weight, net_weight, line_items[], etc.
  • Certificate/Customs schema
    • certificate_type, issuer, issue_date, reference_numbers[], country_of_origin, etc.

Represent these as JSON Schemas in Bem. Schema enforcement is the point:

  • If the output matches the schema, it passes.
  • If it doesn’t, Bem flags the exception instead of guessing.

2. Build extraction functions per document type

For each schema, create a Transform function that takes an unstructured document (or its extracted text + layout) and outputs schema-valid JSON.

Example function names:

  • bol_extraction_v1
  • invoice_extraction_v3
  • packing_list_extraction_v2
  • certificate_extraction_v1

Each function:

  • Is versioned (_v1, _v2, etc.) so you can roll forward/back with confidence.
  • Emits per-field confidence and hallucination detection signals.
  • Can be evaluated against golden datasets (F1 scores, pass rates) to track accuracy over time.

3. Create a Split step for mixed packets

In the workflow editor or via API, add a Split function as the first step after ingest:

  • Input: the raw packet (e.g., sample-bol.pdf, 4 pages, “International shipping packet”).
  • Output: an array of documents[], where each item has:
    • Page range or region metadata
    • Extracted text/layout
    • A pointer back to the original packet (for auditability)

Bem already knows how to split typical logistics packets—BOL, invoices, PLs, certs—based on structure and patterns, not just page counts. You don’t maintain brittle page heuristics.

High-level behavior:

{
  "packet_id": "pkt_123",
  "documents": [
    { "doc_id": "doc_1", "pages": [1, 2], "content": "...", "type": null },
    { "doc_id": "doc_2", "pages": [3],   "content": "...", "type": null },
    { "doc_id": "doc_3", "pages": [4],   "content": "...", "type": null }
  ]
}

4. Add a Route step to classify each split document

Next, add a Route function that runs per doc emitted by the Split step and decides “what is this?”

The classifier will label each doc as one of:

  • bill_of_lading
  • commercial_invoice
  • packing_list
  • certificate
  • other / unknown

Example conceptual output:

{
  "doc_id": "doc_2",
  "predicted_type": "commercial_invoice",
  "confidence": 0.98
}

Use confidence thresholds:

  • confidence >= 0.9 → route automatically to the matching extraction function.
  • 0.7 <= confidence < 0.9 → route to extraction, but mark for review.
  • confidence < 0.7 → route to a human review Surface first.

That routing logic is baked into the workflow, not a side script. Deterministic, auditable.

5. Map each route to a specific extraction function

Now wire the classifier outputs to the appropriate Transform functions:

  • If predicted_type == "bill_of_lading" → call bol_extraction_v1
  • If predicted_type == "commercial_invoice" → call invoice_extraction_v3
  • If predicted_type == "packing_list" → call packing_list_extraction_v2
  • If predicted_type == "certificate" → call certificate_extraction_v1
  • Else → send to an exception path (e.g., “Unclassified Document Surface”)

In pseudo-workflow:

steps:
  - name: split_packet
    type: Split

  - name: classify_docs
    type: Route
    for_each: split_packet.documents

  - name: extract_bol
    type: Transform
    when: classify_docs.predicted_type == "bill_of_lading"

  - name: extract_invoice
    type: Transform
    when: classify_docs.predicted_type == "commercial_invoice"

  - name: extract_packing_list
    type: Transform
    when: classify_docs.predicted_type == "packing_list"

  - name: extract_certificate
    type: Transform
    when: classify_docs.predicted_type == "certificate"

Bem executes these in parallel per document where possible. One packet in, multiple extraction calls out—but all under a single workflow call and trace.

6. Enrich against your own master data

Once each document is extracted into schema-valid JSON, add Enrich steps to match against your systems:

  • Match shipper_name and consignee_name against your vendor/customer Collections.
  • Map port_of_loading / port_of_discharge to your internal port codes.
  • Map SKU or material_number from invoices/packing lists to your product master.

Each enrichment returns match confidence and choice:

{
  "shipper_name_raw": "ACME LOGISTICS LLC",
  "shipper_id": "VEND-2398",
  "shipper_match_confidence": 0.97
}

Low-confidence matches can be routed to a review Surface, corrected, and fed back into training.

7. Validate against business rules

Add a Validate step that enforces your own invariants:

  • BOL total gross weight vs. sum of packing list weights.
  • Invoice currency vs. allowed currencies for that customer.
  • Duplicate invoice detection against your ERP.
  • Required fields: bol_number, invoice_number, container_numbers[] must be non-null.

Example validation outcomes:

{
  "document_type": "commercial_invoice",
  "valid": false,
  "errors": [
    { "field": "invoice_number", "issue": "missing" },
    { "field": "total_amount", "issue": "mismatch_with_packing_list" }
  ]
}

If valid == false, the workflow doesn’t silently push bad data downstream. It flags an exception, sends it to a Surface, and you can track this as part of your operational metrics.

8. Sync into your ERP or TMS

Finally, add a Sync step to push results into your systems via REST, webhooks, or polling:

  • AP invoices → SAP / NetSuite / Dynamics
  • Shipment data → TMS / WMS
  • Container and seal numbers → internal logistics DB

Because everything is schema-validated, you’re not writing “defensive” glue code around messy extraction. You’re consuming a stable JSON contract.

Important production behaviors:

  • Idempotent sync: if the same packet is reprocessed, Bem preserves idempotency so you don’t double-book invoices or duplicate shipments.
  • Versioned workflows: you can upgrade extraction functions (e.g., invoice_extraction_v3 → v4) and roll back instantly if an eval shows regression.
  • 99.99% uptime & auditability: every mixed packet has a trace: when it entered, how it was split, each route decision, each extraction, and the final sync.

Common Mistakes to Avoid

  • Treating each page as a separate document:
    This is how you end up losing cross-document relationships and making brittle assumptions about page order. Instead, treat the packet as the unit of work and let Split + Route handle boundaries and types.

  • Skipping schema and validation:
    Letting “whatever the model returns” flow straight into your ERP is how hallucinations become journal entries. Define strict per-type schemas, use Bem’s schema enforcement, and add business-rule validation steps before sync.

  • Hard-coding vendor-specific templates:
    Template logic per carrier or vendor might work for the first dozen, then collapses under real-world variability. Use classifier-based routing and extraction functions that generalize across layouts, with evals to measure coverage.

Real-World Example

A logistics team receives a 50-page PDF per shipment: a bundled packet containing:

  • 1x Bill of Lading
  • 2x Commercial Invoices
  • 1x Packing List
  • 1x Certificate of Origin
  • Several pages of “noise” (emails, instructions)

They wire up a Bem workflow:

  1. Ingest: The packet arrives by email and is forwarded to a Bem ingestion endpoint.
  2. Split: Bem splits it into eight logical documents.
  3. Route: The classifier identifies 1 BOL, 2 invoices, 1 packing list, 1 certificate, and 3 “other.”
  4. Extract:
    • The BOL runs through bol_extraction_v1, yielding bol_number, vessel_name, ports, containers, seals, gross_weight.
    • The invoices run through invoice_extraction_v3, each with line items and totals.
    • The packing list and certificate go through their respective functions.
  5. Enrich & Validate:
    • Shipper and consignee are matched to internal vendor/customer IDs.
    • Line items are matched to SKUs.
    • Weights and totals are cross-checked between BOL, invoices, and packing list.
  6. Sync: Valid documents are synced to SAP and the TMS. One invoice with a currency mismatch is flagged and sent to a review Surface.

Net result: millions of documents processed weekly, with deterministic behavior. No one is copy-pasting container numbers at 11 p.m.

Pro Tip: Start with a small golden dataset of 50–100 real packets and wire Bem’s evals to that workflow. Track F1 scores per document type and per field (e.g., bol_number, total_amount, container_numbers[]) before routing anything into production. Treat accuracy like code coverage, not a vibe.

Summary

To handle mixed packets like shipping packets in Bem, you don’t stitch together ad-hoc scripts. You define a single, versioned workflow that:

  • Ingests the packet as a whole.
  • Splits it into individual documents.
  • Routes each document by type.
  • Runs type-specific extraction functions into strict schemas.
  • Enriches and validates against your own systems.
  • Syncs only schema-valid, confidence-annotated JSON downstream, with exceptions routed to humans.

Agents guess. Templates crack. A Split → Route → Extract → Enrich → Validate → Sync workflow gives you a deterministic production layer for unstructured logistics data.

Next Step

Get Started