Best API to extract fields from PDFs/images into schema-validated JSON (not just OCR text + bounding boxes)
Unstructured Data Extraction APIs

Best API to extract fields from PDFs/images into schema-validated JSON (not just OCR text + bounding boxes)

8 min read

Most teams don’t actually want “OCR for PDFs and images.” You want a single API that turns whatever you throw at it—scans, photos, crumpled receipts, vendor packs—into clean, schema-validated JSON your systems can trust. Not text blobs. Not bounding boxes. Actual fields that pass your business rules.

Quick Answer: The best API for extracting fields from PDFs and images into schema-validated JSON is one that acts as a production layer, not just an OCR engine. In practice, that means an API like Bem: you send any unstructured file, specify a workflow, and get back strictly typed JSON (or an explicit exception) with per-field confidence, hallucination checks, and built-in evaluation and versioning.

Why This Matters

If you only solve “OCR,” you still own 80% of the work: field mapping, validation, enrichment, error handling, and downstream integrations. That’s why so many “AI extraction” projects stall between demo and production—LLM wrappers hallucinate, per-page tools don’t understand packets, and your team spends months building glue code to normalize outputs into something an ERP, claims system, or ledger can accept.

A schema-validated JSON API changes that dynamic. Instead of parsing layouts, you define the structure you want and enforce it via architecture. The API either returns valid JSON, or it raises a clear exception you can route to a human. No silent failures. No brittle regex forests.

Key Benefits:

  • Outcome-based instead of page-based: You pay per function call, not per page or token, so a 44‑page PDF, a single-page receipt, or a WhatsApp thread all look the same from a billing and integration standpoint.
  • Schema-valid outputs by design: The system enforces your JSON Schema, per-field types, enums, and required fields, catching issues before they hit your core systems.
  • Production-ready from day one: You get workflows, idempotency, versioning, evals, and review surfaces out of the box, so you’re not rebuilding the same operational layer around yet another OCR/LLM API.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Schema-validated JSONStructured output that must conform to a predefined JSON Schema: types, required fields, enums, formats.Turns stochastic LLM behavior into deterministic contracts: your ERP gets either valid data or a clear exception. No “almost valid” payloads sneaking into production.
Production extraction workflowA composable sequence of steps—Route, Split, Transform, Enrich, Validate—that runs on any unstructured input and returns a normalized JSON payload.Moves you from fragile “prompt + parse” scripts to versioned, testable pipelines with branching logic and state. This is what keeps AP, claims, and logistics running when vendors change layouts.
Per-call, multi-modal APIA single endpoint that accepts PDFs, images, audio/video, emails, and chats, and returns one coherent JSON response.Eliminates per-source pipelines and per-page billing. You can support mixed packets (PDF + JPG + .eml) and omnichannel inputs without re-architecting every time.

How It Works (Step-by-Step)

At a high level, a production-grade API for PDFs/images → schema-validated JSON should work like this:

  1. You define the schema and workflow.
    You start by codifying what “good” looks like: a JSON Schema for your document type (e.g., an invoice, bill of lading, claim packet) and a workflow that describes how to get there—Route, Split, Transform, Join, Enrich, Validate, and Payload Shaping.

    Example (invoice schema sketch):

    {
      "type": "object",
      "required": ["invoiceNumber", "invoiceDate", "currency", "lineItems", "totalAmount"],
      "properties": {
        "invoiceNumber": { "type": "string" },
        "invoiceDate": { "type": "string", "format": "date" },
        "currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] },
        "vendorId": { "type": "string" },
        "lineItems": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["description", "quantity", "unitPrice", "lineTotal"],
            "properties": {
              "description": { "type": "string" },
              "quantity": { "type": "number" },
              "unitPrice": { "type": "number" },
              "lineTotal": { "type": "number" }
            }
          }
        },
        "totalAmount": { "type": "number" }
      }
    }
    

    In Bem, this schema directly drives both validation and the operator UI (“Surface”) used for review.

  2. You call a single endpoint with any unstructured input.
    With Bem, the main production call looks like this:

    curl https://api.bem.ai/v2/calls \
      --request POST \
      --header "X-Api-Key: YOUR_API_KEY" \
      --form "file=@document.pdf" \
      --form "workflowName=invoice-intake"
    

    Or you can forward an email directly to your workflow (e.g., ap-intake@workflow.bem.ai) and let it handle the attachments, body, and thread history in one go.

    Under the hood, Bem:

    • Detects document type and structure (Route).
    • Splits mixed packets (e.g., cover + invoice + proof-of-delivery) (Split).
    • Runs deterministic extraction, not “best guess” agents (Transform).
    • Joins related pieces (e.g., header → line items → totals) (Join).
    • Enriches against your own Collections (vendors, GL codes, product SKUs) with match confidence (Enrich).
    • Validates against your JSON Schema with per-field confidence and hallucination checks (Validate).
  3. You receive schema-valid JSON or a flagged exception.
    The response is either:

    • Schema-conforming JSON with per-field confidence scores, hallucination detection flags, and enrichments; or
    • An explicit exception if it cannot meet your schema or falls below your configured confidence thresholds.

    A successful payload might look like:

    {
      "invoiceNumber": {
        "value": "INV-10293",
        "confidence": 0.99,
        "hallucinationRisk": 0.01
      },
      "invoiceDate": {
        "value": "2025-03-31",
        "confidence": 0.97
      },
      "currency": {
        "value": "USD",
        "confidence": 0.99
      },
      "vendorId": {
        "value": "V-2837",
        "confidence": 0.96,
        "matchedFromCollection": "vendors"
      },
      "lineItems": [
        {
          "description": { "value": "Brake pads", "confidence": 0.98 },
          "quantity": { "value": 4, "confidence": 0.99 },
          "unitPrice": { "value": 75.5, "confidence": 0.97 },
          "lineTotal": { "value": 302.0, "confidence": 0.97 }
        }
      ],
      "totalAmount": {
        "value": 302.0,
        "confidence": 0.98
      },
      "_bemMeta": {
        "workflowVersion": "invoice-intake@v12",
        "evalRunId": "eval_abc123",
        "schemaValid": true
      }
    }
    

    If something’s off (e.g., totals don’t add up, or confidence drops below your threshold), Bem routes the case into a review Surface. An operator fixes the fields directly in the schema-shaped UI, and those corrections feed back into training and regression tests.

Common Mistakes to Avoid

  • Treating “OCR + LLM” as the final product:
    Many teams glue together an OCR engine, a general LLM API, and a regex layer and call it done. It demos well but fails under real vendor variability, mixed packets, and volume.

    How to avoid it: Anchor your system around JSON Schema and explicit workflows. Use OCR/LLMs as primitives, not as your “architecture.” Look for an API that enforces schema-valid output and gives you per-field confidence and hallucination detection.

  • Pricing and architecture tied to pages, not outcomes:
    If your extraction stack charges per page/token and returns only text and boxes, you’ll constantly re-justify costs and rebuild pipelines as the use case grows.

    How to avoid it: Prefer per-call, source-agnostic pricing where PDFs, images, audio, and email all go through the same workflow. You should be paying for “one call in, one schema-valid JSON out,” not for how many pages the PDF happened to have.

Real-World Example

A fleet management platform needed to extract line items and totals from vendor invoices, maintenance reports, and driver-submitted photos. They tried the standard path:

  • Per-page OCR to get text.
  • A hosted LLM to “parse” invoices into a JSON-like blob.
  • A regex/post-processing layer to map outputs into their ERP schema.

It looked fine on 20 test documents. It collapsed at 20,000:

  • New vendors broke the prompt-based parser.
  • Handwritten notes on crumpled receipts weren’t captured reliably.
  • The system occasionally “fixed” totals incorrectly—no way to know until accounting caught the discrepancy.
  • Every workflow tweak meant redeploying brittle scripts across services.

They switched to Bem:

  • Defined strict schemas for invoices, maintenance reports, and multipage packets.
  • Built a single workflow (fleet-docs-intake) that Routes between types, Splits mixed packets, Transforms and Enriches against their “vendors” and “assets” Collections, and Validates totals (header vs line items).
  • Configured confidence thresholds: any field under 0.9, or any mismatch between summed line items and the total, is auto-routed to a review Surface.

Outcome:

  • They moved from “we think this is 90%+ accurate” to audited F1 scores and 98.4% passing evals on their golden datasets.
  • Totals—including line items—hit 100% accuracy on their production sample, with the remaining edge cases clearly flagged for review.
  • The ops team stopped opening PDFs. Documents now “enter themselves” into their ERP.

Pro Tip: Before you commit to any PDF/image extraction API, assemble a real golden dataset—mixed vendors, ugly scans, handwritten notes, and edge cases—and require evals with F1 scores and regression tests. If the provider can’t version workflows and show you eval runs over time, you’re buying a demo, not a system.

Summary

If your goal is to extract fields from PDFs and images into schema-validated JSON, the “best API” isn’t the one with the prettiest OCR demo. It’s the one that:

  • Accepts any unstructured input (PDFs, images, audio, email threads) in a single call.
  • Enforces your JSON Schema by architecture, with strict typing, enums, and per-field confidence and hallucination checks.
  • Ships with workflows, evals, versioning, idempotent execution, and human review surfaces so you can run real production workloads—not just prototypes.

Bem was built specifically for this pattern: a production layer that turns messy inputs into schema-valid, enriched, and validated JSON, with explicit exceptions when it can’t. No agents. No vibes. Just deterministic pipelines that keep your operation running.

Next Step

Get Started