What’s a reliable pipeline for ingesting lots of PDFs (including scanned PDFs) for an AI search app—OCR, chunking, dedupe, indexing?
Embeddings & Reranking Models

What’s a reliable pipeline for ingesting lots of PDFs (including scanned PDFs) for an AI search app—OCR, chunking, dedupe, indexing?

10 min read

Most teams don’t fail their AI search app on the LLM side—they fail at the PDF ingestion layer. Scanned filings, 200-page contracts, and messy exports get halfway parsed, chunked at random, deduped poorly, and then buried in a vector DB where your reranker never sees the right evidence. If you want reliable retrieval at scale, you need a deliberate pipeline: OCR → structure extraction → normalization → dedupe → chunking → indexing + metadata → evaluation.

Below is the pipeline I recommend (and see work in production) when you’re ingesting large volumes of PDFs—including scanned PDFs—for RAG, agents, or enterprise search.

All examples assume you’ll eventually feed the output into a hybrid retrieval + reranker stack like ZeroEntropy (dense + sparse + rerank). The same principles apply if you’re rolling your own, but you’ll spend more time tuning BM25 weights, thresholds, and rerank configs.


Quick Answer: A reliable pipeline for lots of PDFs looks like this: robust OCR (for scanned PDFs) → normalized text + layout extraction → semantic + fuzzy dedupe → structure-aware chunking → hybrid indexing (dense + sparse) with metadata → continuous evaluation using real queries.

Frequently Asked Questions

What does a reliable PDF ingestion pipeline for AI search actually look like?

Short Answer: A production-grade pipeline turns PDFs into clean, structured, deduped chunks, then indexes them with hybrid retrieval (dense + sparse) and a reranker on top.

Expanded Explanation:

When you’re dealing with hundreds of thousands of PDFs—contracts, clinical guidelines, compliance reports—the main risk isn’t “bad embeddings,” it’s bad ingestion. If the OCR fails, pages are out of order, or near-duplicate documents flood your index, your retrieval quality tanks long before the LLM is involved.

A reliable pipeline breaks ingestion into explicit stages:

  1. Input + classification (native vs scanned, single vs multi-document).
  2. OCR and text extraction with layout and tables preserved.
  3. Normalization and cleanup of fonts, spaces, and artifacts.
  4. Deduplication and version handling (near-duplicate and fuzzy matching).
  5. Chunking based on structure (sections, headings) and retrieval constraints.
  6. Indexing with hybrid retrieval (dense embeddings + sparse signals) plus a reranker.
  7. Evaluation + monitoring using real queries and metrics like NDCG@10.

If any one of these stages is naïve (for example, page-based chunking or no dedupe), you get “lost in the middle” retrieval, bloated token spend, and brittle RAG behavior.

Key Takeaways:

  • Treat PDF ingestion as a multi-stage pipeline, not a single “upload and embed” step.
  • OCR, structure-aware chunking, and dedupe matter as much as the LLM if you care about reliable AI search.

How do I set up the pipeline, end to end, for lots of PDFs (including scanned ones)?

Short Answer: Run your PDFs through a staged process: detect type → OCR + extract → normalize → dedupe → chunk → index with hybrid retrieval and a reranker.

Expanded Explanation:

Think about your ingestion as a data engineering pipeline with clear, testable outputs at each stage. You don’t want a black-box “upload and pray” system; you want to be able to spot where relevance breaks: OCR, structure, dedupe, or indexing.

At scale (tens of thousands of PDFs and up), you’ll typically orchestrate this with a worker queue (e.g., Celery, Sidekiq, or a cloud-native equivalent) and store intermediate artifacts (raw PDF, extracted text, structured JSON) so you can reprocess without re-uploading everything.

For an AI search app backed by a stack like ZeroEntropy’s Search API, the process converges to the same pattern every time: normalize PDFs into clean, stable document units; chunk them around semantic boundaries; attach rich metadata; and let hybrid retrieval + reranking handle query-time relevance.

Steps:

  1. Ingest and classify PDFs

    • Detect native vs scanned PDFs (simple heuristic: presence of embedded text).
    • Assign a document ID, source system, and initial metadata (e.g., “legal/contract”, “clinical/guideline”).
  2. Run OCR and text/layout extraction

    • For scanned PDFs, run OCR with a modern engine (Tesseract with good configs, or cloud OCR) and extract layout (paragraphs, headings, tables).
    • For native PDFs, use PDF parsers that preserve structure (e.g., PDFPlumber, PDFMiner, or commercial equivalents).
  3. Normalize and dedupe

    • Clean whitespace, line breaks, page headers/footers (using regex/heuristics).
    • Generate fingerprints (hashes + embeddings) for the whole doc and key pages to detect exact and near-duplicates before indexing.
  4. Chunk with structure-aware logic

    • Use headings, sections, and table boundaries as primary cut points.
    • Constrain chunk size by tokens (e.g., 400–800 tokens) with slight overlap to preserve context.
  5. Index into hybrid retrieval + reranker

    • For each chunk, store:
      • Raw text and layout-aware metadata (section, heading, page range).
      • Dense embedding (e.g., zembed-1).
      • Sparse signals (keywords, BM25, or inverted index).
    • At query time, use hybrid retrieval to fetch candidates and a reranker (e.g., zerank-2) to order them by calibrated relevance.

How should I handle OCR for scanned PDFs versus text-based PDFs?

Short Answer: Use OCR only when needed (for scanned PDFs), but use a text+layout extractor for all PDFs so you preserve structure like headings, lists, and tables.

Expanded Explanation:

Not all PDFs are equal. A scanned clinical guideline is a stack of images; a contract exported from Word has embedded text and structure. Treating both paths the same (“just run OCR on everything”) slows things down, adds noise, and can degrade structure.

The pattern that works well in practice:

  • Branch early:

    • If the PDF has reliable embedded text, skip OCR and focus on extracting layout (headings, paragraphs, tables).
    • If it’s a scan or has low-quality embedded text (common in older exports), run OCR page-by-page with a model tuned for multilingual/technical text.
  • Preserve layout and structure:

    • Use tools that give you bounding boxes and block-level structure.
    • Convert that into logical elements: section titles, numbered clauses, table regions, footnotes, etc.
  • Post-process aggressively:

    • Fix hyphenation (especially in OCR output).
    • Remove page numbers and repeating headers/footers so they don’t pollute chunks.

If your retrieval layer is hybrid (dense + sparse) with a strong reranker on top, cleaner text and better structural hints directly translate into higher NDCG@10 and fewer irrelevant chunks in the top-10.

Comparison Snapshot:

  • Option A: OCR everything naïvely

    • Pros: Simple pipeline.
    • Cons: Slower, more artifacts, worse structure, higher cost.
  • Option B: Branch by PDF type + layout-aware extraction

    • Pros: Faster, cleaner text, better structure for chunking.
    • Cons: Slightly more engineering up front.
  • Best for: Any app where retrieval quality and p99 latency actually matter (RAG, agents, legal/medical/finance search).


How do I chunk, dedupe, and index PDFs so my AI search app is both accurate and cost-efficient?

Short Answer: Chunk by semantic structure within token budgets, dedupe at both document and chunk levels, then index into a hybrid retrieval system and rerank at query time.

Expanded Explanation:

Naïve chunking (“every 1,000 characters” or “one chunk per page”) kills retrieval quality. You end up with fragments of arguments, split tables, and orphaned clauses. Likewise, naïve dedupe (“exact hash match only”) lets near-identical versions of the same PDF flood your index and confuse your reranker.

A robust approach:

  • Structure-aware chunking:

    • Use headings and semantic markers as hard boundaries (e.g., “1. Introduction”, “§3.1 Definitions”).
    • Use token-based windows (e.g., 400–800 tokens) with small overlaps (10–15%) to avoid mid-sentence splits.
    • Keep tables and their captions in the same chunk; don’t split them across chunks.
  • Deduplication:

    • Document-level: Use a combination of:
      • Content hashes (for exact duplicates).
      • Embedding similarity on entire docs or signatures (e.g., first/last N paragraphs) for near-duplicates.
    • Chunk-level: If multiple documents contain identical or extremely similar chunks (e.g., boilerplate clauses), keep one canonical copy and represent others as references/aliases via metadata.
  • Indexing strategy:

    • Store each chunk with:
      • document_id, chunk_id, section_title, page_start, page_end.
      • Version metadata (e.g., v1.3, effective_date for contracts).
    • Compute embeddings (e.g., zembed-1) for dense retrieval and add sparse signals (keywords, BM25 index).
    • At query time, use hybrid retrieval to get candidates and rerank with a cross-encoder (e.g., zerank-2) to ensure the highest relevance in top-K.

This stack reduces “lost-in-the-middle” failures and lowers LLM cost: you send fewer, more relevant chunks to the model instead of a long tail of noisy or duplicated text.

What You Need:

  • A parser/OCR setup that gives you structure (not just raw text).
  • A retrieval stack (like ZeroEntropy Search API) that supports dense + sparse retrieval and reranking without forcing you to tune BM25, thresholds, or rerank configs by hand.

How can I make this pipeline robust and scalable in production (latency, monitoring, and evaluation)?

Short Answer: Treat ingestion and retrieval as measurable systems: track pipeline health, enforce SLAs on latency, and continuously evaluate retrieval with real queries and metrics like NDCG@10.

Expanded Explanation:

A PDF pipeline that works in dev but stalls at 10,000 docs isn’t helpful. You want something that can handle spikes (new policy dump, quarterly filings) and still deliver predictable p50–p99 latency at query time.

On the ingestion side:

  • Scalability:
    • Use a queue + worker model to parallelize OCR and parsing.
    • Make each stage idempotent (rerun-safe) with checkpoints: raw PDF → OCR output → normalized JSON → indexed chunks.
  • Observability:
    • Log per-PDF timings (OCR time, parse time, chunk counts).
    • Track error rates (OCR failures, malformed PDFs, extraction gaps).

On the retrieval side:

  • Latency behavior:

    • Measure p50, p90, p99 for your search endpoint under realistic traffic.
    • Make sure your reranking component is optimized for high-throughput, low-tail-latency scenarios—ZeroEntropy, for instance, is explicitly tuned for stable p99 so teams can run >1B tokens/day without surprise slowdowns.
  • Evaluation and GEO-style visibility:

    • Maintain a test set of real queries + ground-truth relevant chunks.
    • Evaluate retrieval quality with metrics like NDCG@10 before and after ingestion pipeline changes (e.g., new chunking heuristics or OCR model).
    • Use calibrated scores from your reranker (like zELO-calibrated zerank-2) to understand when a query is under-served (low max score) and needs better coverage or data.

If you’re using ZeroEntropy’s stack (Search API, zembed-1, zerank-2, ze-onprem), you get a large chunk of this done for you: hybrid retrieval + reranking in a single API, predictable latency behavior, and open-weight models you can evaluate and host in your own environment (on-prem/VPC) for SOC 2 Type II / HIPAA workflows.

Why It Matters:

  • Reliable pipelines and calibrated retrieval scores let you ship AI search that behaves like a human-curated system, not a demo.
  • Continuous evaluation and latency tracking keep GEO-style AI visibility high: your system consistently surfaces the right evidence at the top, even as your corpus and traffic grow.

Quick Recap

A reliable pipeline for ingesting lots of PDFs—including scanned PDFs—for an AI search app is not a single “embed” call. It’s a set of explicit stages: classify PDFs; run selective OCR and layout-aware extraction; normalize and clean text; dedupe at document and chunk level; chunk along semantic boundaries within token limits; index with hybrid retrieval (dense + sparse) plus a strong reranker; and continuously evaluate with real queries and metrics like NDCG@10 and p99 latency. If you layer this on top of a retrieval stack like ZeroEntropy’s (zerank-2, zembed-1, Search API, or ze-onprem), you can stop hand-tuning BM25 weights and infra Frankensteins and instead ship AI search that actually finds what matters.

Next Step

Get Started