ZeroEntropy Search API: how do I ingest PDFs and turn on OCR for scanned documents, and what are the OCR page limits?
Embeddings & Reranking Models

ZeroEntropy Search API: how do I ingest PDFs and turn on OCR for scanned documents, and what are the OCR page limits?

7 min read

Most teams hit the same wall with document-heavy RAG: you have piles of PDFs—contracts, clinical reports, manuals—and half of them are scanned images. If your retrieval stack doesn’t handle parsing and OCR out of the box, your “search” silently skips the most important pages. With the ZeroEntropy Search API, PDF ingestion and OCR are built in, so you can ship human-level retrieval over real-world documents without bolting on yet another service.

Quick Answer: You ingest PDFs into the ZeroEntropy Search API by sending them directly to the ingestion endpoint (via file upload or URL), and you enable OCR with a simple flag in your request. OCR runs only on scanned/image-like pages and consumes from your plan’s OCR page allowance, which caps how many pages per month you can process automatically.

Frequently Asked Questions

How does PDF ingestion work with the ZeroEntropy Search API?

Short Answer: You send PDFs directly to the Search API’s ingestion endpoint, and ZeroEntropy handles parsing, chunking, and indexing for hybrid retrieval and reranking.

Expanded Explanation:
Ingestion is designed to be a one-step operation: you provide the PDF, ZeroEntropy does the rest. The ingestion pipeline parses the document, runs it through our chunking logic, and stores the resulting text and metadata in a way that’s immediately queryable via dense + sparse retrieval and zerank-2 reranking.

You don’t need a separate PDF parsing microservice, a custom chunking script, or a pre-index “ETL step.” The ingestion API normalizes inputs into the same representation used by the Search API, so your RAG or agent stack talks to a single system: query in, ranked results out, with calibrated relevance scores you can trust.

Key Takeaways:

  • Send PDFs directly to the ingestion endpoint; no pre-processing or manual chunking required.
  • Parsed content is stored for hybrid retrieval (dense + sparse) and reranked by zerank-2 with calibrated scores.

How do I enable OCR for scanned PDF documents?

Short Answer: You turn on OCR by setting an OCR flag in your ingestion request; the Search API then runs OCR only on scanned pages and charges them against your OCR page quota.

Expanded Explanation:
Scanned PDFs are basically bundles of images—traditional parsers see nothing. When you enable OCR on ingestion, ZeroEntropy automatically detects image-heavy pages and runs them through our OCR pipeline before chunking and indexing. Text-rich digital PDFs flow through the “normal” parser; only pages that need OCR consume OCR page credits.

From your side, this remains a single API call: you pass the file (or URL), set ocr: true (or the equivalent flag in the SDK), and the pipeline decides what needs OCR. That’s it—no separate OCR vendor, no post-processing, no second indexing pass.

Steps:

  1. Obtain your ZeroEntropy API key and configure the SDK or HTTP client.
  2. Call the ingestion endpoint with your PDF file or URL and enable the OCR flag in the request body.
  3. Confirm ingestion status (and optionally page counts) via the response or dashboard to ensure OCR has run on the expected pages.

How is OCR different from standard PDF parsing in the Search API?

Short Answer: Standard parsing extracts text from digital PDFs; OCR is an additional step that converts image-based pages in scanned PDFs into text before they ever reach the retrieval layer.

Expanded Explanation:
Digital PDFs (e.g., exported from Word, Google Docs, or most modern tools) already contain structured text. For those, ZeroEntropy simply parses the text, chunks it, and indexes it. OCR is only invoked when the PDF page doesn’t expose text—such as scanned contracts, faxed clinical notes, or old manuals.

This difference matters for both performance and cost. OCR is heavier per page than straightforward parsing, so we treat it as an optional, metered capability. You get fast, low-cost ingestion for digital PDFs, and targeted OCR for the pages that actually need it—while your RAG system sees a single, unified search surface.

Comparison Snapshot:

  • Standard parsing: Fast, low-overhead; works on digital PDFs with embedded text; does not consume OCR page credits.
  • OCR processing: Converts image-only pages to text; used for scanned/faxed/photographed documents; consumes OCR page credits.
  • Best for: Teams with a mix of digital and scanned PDFs who want one Search API that “just works” across both.

What are the OCR page limits and how do they affect my ingestion?

Short Answer: Each Search API plan includes a fixed number of OCR pages per month; every processed page counts as one OCR page, and once you hit the limit, new OCR requests either fail or fall back depending on your plan configuration.

Expanded Explanation:
OCR is priced and limited at the page level, not by file size or tokens. When you ingest a PDF with OCR enabled, the pipeline counts how many pages require OCR and subtracts that from your monthly quota. For example, a 100-page PDF where 40 pages are scanned will consume 40 OCR pages.

This model gives you predictable cost behavior and avoids surprises in your token bill. You keep sending documents through the same ingestion endpoint; the only thing that changes as you scale is your OCR page allowance. If you’re regularly ingesting large, scanned corpora (e.g., legal archives, clinical scans, or old compliance files), you’ll typically step up to a plan with higher OCR page limits or work with us on an enterprise tier.

What You Need:

  • A Search API plan with an OCR page allowance that matches your monthly scanned-document volume.
  • Monitoring (via dashboard or usage endpoint) to track OCR page consumption as you ingest large or mixed-format PDF sets.

How should I think about OCR and PDF ingestion strategically for RAG and agents?

Short Answer: Treat OCR and PDF ingestion as core retrieval infrastructure—turn it on once, size your OCR page limits correctly, and let the Search API feed higher-quality, fewer chunks into your LLM to cut cost and reduce hallucinations.

Expanded Explanation:
Naive RAG breaks down when retrieval can’t see half your corpus. If scanned documents, signed contracts, faxed clinical notes, or photographed manuals aren’t searchable, your agents will hallucinate or over-index on whatever digital PDFs they can find. By letting ZeroEntropy handle PDF parsing and OCR at ingestion time, you push the complexity down into the retrieval layer and keep your application logic clean.

On the cost side, the combination of OCR + calibrated reranking matters. Once OCR has unlocked the text in scanned PDFs, zerank-2 can reorder your dense + sparse candidates so the LLM only sees the most relevant chunks. That means fewer tokens per answer, better NDCG@10, and more predictable p50–p99 latency under load—especially in document-heavy verticals like legal, healthcare, and compliance.

Why It Matters:

  • Reliability: If OCR is off or misconfigured, your “search” silently ignores critical scanned evidence; turning it on at the ingestion layer closes that gap.
  • Cost + performance: OCR + better retrieval reduces wasted LLM tokens by only sending top-ranked chunks, stabilizing both quality and latency at scale.

Quick Recap

ZeroEntropy’s Search API is built so you don’t have to manage a separate stack for PDFs, OCR, and chunking. You ingest PDFs directly, flip on OCR for scanned documents, and the system automatically decides which pages need image-to-text conversion. OCR is metered by page with clear limits per plan, so you can size your allowance to your real-world scanned volume. The result is a single, hybrid retrieval surface—backed by zerank-2—that can actually see your full corpus and feed higher-quality evidence into your RAG and agent workflows.

Next Step

Get Started