ZeroEntropy Search API: how do I ingest PDFs and turn on OCR for scanned documents, and what are the OCR page limits?
Embeddings & Reranking Models

ZeroEntropy Search API: how do I ingest PDFs and turn on OCR for scanned documents, and what are the OCR page limits?

7 min read

Most teams don’t lose retrieval quality on model choice—they lose it at ingestion. If your PDFs aren’t parsed correctly, if scanned docs never go through OCR, your “human-level search” runs on broken text and blank pages. The ZeroEntropy Search API is designed to remove that failure mode: you send PDFs (native or scanned), we run parsing + optional OCR, chunk intelligently, and store everything behind a unified dense+sparse+rerank stack.

Quick Answer: You ingest PDFs into the ZeroEntropy Search API by calling the ingestion endpoint with your file payload; OCR for scanned documents is enabled via a simple flag/config, and your account’s OCR page limits are governed by your Search API plan’s monthly “OCR pages” allocation (with usage visible in your dashboard and via API).


Frequently Asked Questions

How do I ingest PDFs into the ZeroEntropy Search API?

Short Answer: You ingest PDFs by sending them to the Search API’s ingestion endpoint (or SDK helper) with your API key; ZeroEntropy handles parsing, chunking, and indexing for hybrid retrieval and reranking.

Expanded Explanation:
Instead of wiring up your own PDF parser, tokenizer, vector DB, and reranker, you point your PDFs at ZeroEntropy’s Search API. Under the hood, we:

  • Parse the PDF into structured text (sections, headings, paragraphs when available).
  • Chunk it for retrieval in a way that works well with RAG and agents (not arbitrary 512-token cuts).
  • Run zembed-1 embeddings and sparse indexing over those chunks.
  • Store everything in a unified index that feeds zerank-2 rerankers at query time.

You don’t have to tune BM25 weights, vector thresholds, or rerank configs. You just ingest content once and query it with a single API call. This is the opposite of the infra Frankenstein (parsers + vector DB + LLM + custom reranker) that many teams end up maintaining.

Key Takeaways:

  • Use the Search API ingestion endpoint (or SDK) to upload PDFs directly.
  • ZeroEntropy automatically parses, chunks, embeds, and indexes for hybrid retrieval + reranking.

How do I turn on OCR for scanned PDF documents?

Short Answer: You enable OCR by setting the OCR flag in your ingestion request or configuration so that scanned or image-only PDFs are run through ZeroEntropy’s OCR pipeline before chunking and indexing.

Expanded Explanation:
Native PDFs (with real text layers) parse fine without OCR, but legal scans, faxed contracts, and image-heavy reports will look “empty” to any search engine unless you run OCR. In ZeroEntropy, OCR is not a separate pipeline you have to glue in; it’s a first-class part of the Search API.

When you ingest a PDF and enable OCR, we:

  • Detect whether the document has an extractable text layer.
  • Run OCR only where needed (e.g., scanned pages) to avoid wasting tokens and OCR pages.
  • Feed the recognized text into the same chunking and indexing flow as any other document.

This gives you human-level retrieval over legacy scans and “paper archives” without building your own OCR stack.

Steps:

  1. Get your API key and Search API project set up in the ZeroEntropy dashboard.
  2. Call the ingestion endpoint (or use the SDK) with your PDF file and the ocr (or equivalent) flag enabled.
  3. Monitor OCR page usage in your dashboard and query the Search API as usual—scanned PDFs will now return relevant hits.

How are OCR page limits defined, and how do they differ from ingestion tokens?

Short Answer: OCR page limits are counted per processed page (each page of a scanned PDF that goes through OCR), while ingestion tokens measure the text volume you store; they are separate quotas in your Search API plan.

Expanded Explanation:
ZeroEntropy Search API plans expose two distinct resource dimensions for ingestion:

  • Ingestion tokens: the total number of tokens of text you store and index (after parsing/OCR).
  • OCR pages: how many PDF pages we run OCR on each month.

This separation lets you control cost without guessing which documents are scanned. OCR is heavier than plain text parsing, so we meter it at the page level. Parsing a native PDF page with a real text layer consumes ingestion tokens but does not consume an OCR page; running OCR on a scanned page consumes both OCR pages (for the image processing) and ingestion tokens (for the extracted text we index).

Comparison Snapshot:

  • Option A: Ingestion tokens – Track stored text volume across PDFs, HTML, JSON, etc.
  • Option B: OCR pages – Track how many scanned/visual pages you convert to text each month.
  • Best for: Teams who want predictable spend on both storage (tokens) and heavy processing (OCR), without surprise overruns.

What does ZeroEntropy actually do with my PDF after ingestion?

Short Answer: We parse, optionally OCR, chunk, embed (dense + sparse), and index your PDF so that queries run through hybrid retrieval and zerank-2 reranking with calibrated scores.

Expanded Explanation:
The ingestion pipeline is built to maximize NDCG@10 and reduce downstream LLM tokens, not just to “store documents.” After you send a PDF:

  • We parse structure and content, then normalize it for indexing.
  • If OCR is turned on and required, we convert scanned pages to text.
  • We chunk content using heuristics tuned for RAG and agent workflows (keeping paragraphs/sections coherent so answers don’t fall “between chunks”).
  • zembed-1 embeddings are generated, sparse features are computed, and we write both to the unified index.
  • At query time, the Search API pulls a candidate set via hybrid retrieval and passes it through zerank-2, which uses calibrated zELO-based scores.

Because rerankers and embeddings are open-weight (Hugging Face), you’re not locked into a black-box stack and can mirror the same behavior on ze-onprem if needed.

What You Need:

  • A ZeroEntropy Search API project with ingestion tokens and OCR pages in your plan.
  • An integration (SDK or HTTP) that sends PDFs and OCR preferences to the ingestion endpoint.

How should I plan OCR usage strategically across my corpus?

Short Answer: Use OCR selectively for high-value scanned content (contracts, clinical records, audit PDFs), monitor OCR pages in your dashboard, and rely on native parsing for everything else to maximize retrieval quality per dollar.

Expanded Explanation:
Not every PDF deserves OCR. For many corpora—product docs, software manuals, Confluence exports—native text parsing is enough. OCR is where you unlock value in “dark data”: legacy legal binders, scanned clinical reports, signed agreements, and image-based compliance documents.

From a retrieval and cost perspective:

  • Focus OCR on the domains where missing a clause or a clinical detail is unacceptable.
  • Let ZeroEntropy handle mixed docs: we only OCR pages that need it.
  • Use the dashboard (and API) to watch your OCR page consumption alongside ingestion tokens and query volume. That combination tells you how much RAG surface you’ve realistically lit up.

As you scale, pairing OCR with calibrated reranking (zerank-2) means your LLM only sees the most relevant slices of those scanned docs, which reduces token spend and improves answer completeness.

Why It Matters:

  • Targeted OCR turns previously unusable scans into high-signal retrieval assets for your RAG and agent workflows.
  • Managing OCR pages and ingestion tokens together gives you predictable spend while keeping NDCG@10 high and downstream LLM tokens low.

Quick Recap

To ingest PDFs with the ZeroEntropy Search API, you send them to the ingestion endpoint (or through the SDK) and let the stack handle parsing, chunking, embeddings, and indexing. OCR for scanned documents is controlled with a simple flag/config and metered as OCR pages, separate from your ingestion tokens. By treating OCR pages as a deliberate resource—focused on high-value scanned content—you unlock human-level hybrid retrieval over your entire document history without building and maintaining your own OCR + vector DB + reranker Frankenstein.

Next Step

Get Started