
Bem vs Unstructured for email thread + attachment ingestion: which one preserves thread context and produces a single structured intake record?
Most teams discover the email problem the hard way. The user forwards a messy thread with inline replies, CCs, and a stack of PDFs, images, and spreadsheets. Your “document AI” happily parses each attachment in isolation, but your product logic needs something completely different: a single, structured intake record that preserves the full thread context, links every attachment to that conversation, and exposes all of it as clean JSON your system can trust.
Quick Answer: Unstructured is excellent at normalizing email content and attachments into machine-readable pieces, but it largely stops at that boundary: per-email, per-attachment outputs that you still have to stitch into an intake record. Bem is built to treat an entire thread plus attachments as one event, route and split it deterministically, and return a single schema-enforced JSON object (or an explicit exception) that encodes both thread context and attachment data. If the goal is a unified intake record that your ERP, claims system, or ticketing backend can ingest without glue code, Bem is the more opinionated fit.
Why This Matters
If you’re handling support tickets, claims packets, onboarding flows, or vendor emails, the “unit of work” is not a PDF or a single email—it’s a conversation plus its artifacts. Lose that context and you lose the signal your downstream system actually cares about:
- Who said what, and when?
- Which attachment belongs to which request?
- What’s the latest instruction or correction in the thread?
- Is this a new intake, or an update to an existing case?
Per-email or per-attachment tools force you to re-implement that logic in your own codebase. You’re joining on subjects, guessing on message IDs, and writing brittle heuristics to decide which document “wins.” That’s where systems break in production.
A production layer should do the opposite: take the chaos (email MIME, nested replies, PDFs, images, forwarded threads) and emit a single, typed object that already encodes the conversation, the attachments, and the derived fields your business logic wants.
Key Benefits:
- Preserved thread context: Bem keeps the entire conversation—participants, timestamps, message order, quoted text—linked to the attachments, so you can model “one case, many messages” without reconstruction hacks.
- Single schema-enforced intake record: Instead of dozens of disconnected outputs, Bem returns one JSON object that aligns to your intake schema (case, claim, ticket, order) with strict types, enums, and required fields.
- Deterministic workflows, not best-effort heuristics: Bem workflows let you explicitly define how to route, split, transform, and join email + attachment data, with versioning, idempotency, per-field confidence, and exception routing for edge cases.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Thread-aware ingestion | Treating a full email thread (all messages + attachments) as one event rather than independent documents. | Lets you build case/claim/ticket records that match reality, not “a bag of PDFs.” You keep intent, corrections, and latest state. |
| Single intake record | A schema-enforced JSON object that represents the entire intake (conversation + artifacts + derived fields) as your backend needs it. | Eliminates glue code and ad-hoc joins. You can sync directly into ERP/CRM/claims systems as a first-class object. |
| Deterministic unstructured pipelines | Explicit workflows that Route, Split, Transform, Join, Enrich, and Validate unstructured inputs with versioning and idempotency. | You can debug, roll back, and evolve behavior safely instead of relying on hidden heuristics or agent “magic.” |
How It Works (Step-by-Step)
At a high level, the difference between Unstructured and Bem for email + attachment ingestion looks like this:
- Unstructured: “Break the email into parts; you own the pipeline.”
- Bem: “Treat the entire email event as input; we give you a workflow engine to turn it into one intake record.”
Here’s how a typical Bem workflow handles an email thread with attachments as a single, context-preserving pipeline.
-
Ingest the raw email event
You start by sending the full email event into Bem—MIME payload from your mail handler (SES, SendGrid, Gmail API, O365), webhook from your existing email infrastructure, or a pre-normalized JSON envelope.
- Input is one object:
thread_id, full message history, headers, bodies, attachments (raw bytes, content types, filenames). - Bem treats this as the atomic event for the workflow.
- Input is one object:
-
Route, split, and normalize the thread
A Bem workflow decomposes without losing the linkage:
- Route: Decide which workflow to run based on heuristics or model outputs (e.g., “claims@yourcompany.com” → Claims Intake workflow).
- Split: Extract individual messages and attachments, but retain parent IDs, message IDs, and thread IDs.
- Transform: Normalize each piece:
- Parse body text (HTML/Plain).
- Strip or detect quoted history vs latest message.
- Normalize sender/recipient structure.
- Map attachments (PDF, DOCX, images, XLSX) into typed “document objects” ready for downstream extraction.
Every child object carries references back to the thread and the parent message—no context is thrown away.
-
Extract, enrich, and join into a single intake record
Now the workflow assembles exactly the object your backend expects:
-
Transform/Enrich per attachment:
For example:- Invoices → line items, totals, vendor IDs, PO numbers.
- IDs or forms → extracted fields, document type classification.
- Screenshots/images → OCR + classification.
-
Join: Combine:
- Thread-level metadata (subject, participants, timestamps).
- Message-level text (latest user ask, corrections, approvals).
- Attachment-level structured data (invoice JSON, claim form JSON, KYC docs).
-
Payload Shaping: Use a shape step (JMESPath-style) to map all of that into your intake schema, e.g.:
{ "case_id": "derived_or_existing_id", "thread": { "thread_id": "abc123", "subject": "Claim #7782: Damaged shipment", "participants": [...], "messages": [...] }, "attachments": [ { "type": "invoice", "parsed": {...}, "confidence": {...} }, { "type": "photo", "parsed": {...}, "confidence": {...} } ], "derived_fields": { "claim_type": "damage", "priority": "high", "requires_human_review": true } } -
Validate: JSON Schema-based validation enforces that the final output is either:
- Schema-valid (all required fields present, correct types), or
- Explicitly marked as an exception with reasons (missing fields, low confidence, conflicting data).
Result: one structured intake record that encodes the entire thread and all attachments with typed, queryable fields.
-
Common Mistakes to Avoid
-
Treating emails and attachments as separate “documents”:
If you point any extraction tool—Unstructured, an LLM, or classic OCR—directly at attachments without modeling the email event, you’re forced to reconstruct context later. Use a workflow that treats the thread as the root object and attachments as children. -
Relying on subject-line heuristics to group messages:
“Same subject → same case” will fail the moment a user forwards a thread elsewhere, changes the subject, or uses a template. In Bem, use stable identifiers (message IDs, thread IDs, your own case IDs) as first-class fields and define routing/merging logic declaratively in the workflow.
Real-World Example
Imagine a logistics company handling damage claims via email. A typical thread looks like:
- User sends: “Shipment 99421 arrived damaged” with photos attached.
- Support replies asking for the invoice and packing list.
- User forwards an older thread from the vendor with the invoice PDF, plus attaches a handwritten dock receipt.
- A manager later replies in-thread: “Please expedite this claim, VIP customer.”
With a per-document approach (typical of Unstructured-style pipelines as implemented by most teams):
- Each email body, PDF, image, and dock receipt gets parsed independently.
- You write custom code to decide:
- Which invoice belongs to this claim.
- Which message text is the “main description.”
- Whether the manager’s “expedite” reply should update priority.
- You glue it together using subject lines, bare email addresses, and best-effort parsing of quoted text.
In a Bem workflow:
- The mail handler pushes the full thread (messages + attachments) into Bem as a single event.
- The workflow:
- Classifies the event as a “Damage Claim Intake.”
- Splits messages and attachments but maintains explicit references.
- Extracts structured data from:
- Invoice PDF (line items, totals, currency, vendor).
- Dock receipt (handwritten fields).
- Photos (presence of damage, packaging state, maybe via a classifier).
- Parses the latest user message body to get free-text description and claim reason.
- Detects and parses the manager’s “expedite” reply as an internal note that sets
priority = "high".
- The Join + Payload Shaping step emits a single
claimobject with:claim_id,shipment_id,customer_id- Full
threadandmessages[]array with roles (customer, support, internal) documents[]with typed entries for invoice, receipt, photospriority = high,status = new,requires_approval = true
That JSON is what your claims system ingests. No manual stitching. No subject-line hacks. One record, complete context.
Pro Tip: When designing your email intake with Bem, start from the schema you want your backend to see (case, claim, ticket) and work backwards. Model
thread,messages, andattachmentsas explicit sub-objects in that schema, then use Bem’s workflow steps—Route, Split, Transform, Enrich, Join, Validate—to fill that schema deterministically. It’s much easier than trying to reconstruct context after you’ve already atomized the data.
Summary
If your use case is “turn raw emails and attachments into normalized pieces I’ll handle later,” Unstructured can do that job. But if your real unit of work is an intake record—a claim, ticket, case, or onboarding packet—then you need infrastructure that:
- Treats the entire email thread plus attachments as one event.
- Preserves and encodes thread context (who said what, when, and why).
- Produces a single, schema-enforced JSON object with clear exceptions instead of silent failures.
That’s the gap Bem is built to fill. It’s not “another parser.” It’s a production layer for unstructured data that gives you deterministic workflows, versioning, idempotency, and review surfaces so email-based operations can run without fragile heuristics or glue code.