Can you help me estimate monthly cost on ZeroEntropy if we do ~20k queries/month plus ingestion and OCR?
Embeddings & Reranking Models

Can you help me estimate monthly cost on ZeroEntropy if we do ~20k queries/month plus ingestion and OCR?

8 min read

Most teams asking about cost at ~20k queries/month are really trying to answer one question: “Can we run serious RAG or AI search on ZeroEntropy without getting surprised by the bill?” The short answer is yes—but the exact number depends on three things: your query volume, how many tokens you ingest (including OCR), and whether you’re just using rerank/embeddings or the full Search API.

Quick Answer: For ~20k queries/month plus moderate ingestion and OCR, most teams land in a low–to–mid three-digit monthly range on ZeroEntropy, with pricing driven mainly by total tokens (ingestion + queries) rather than just “number of requests.”


Note: ZeroEntropy’s pricing is simple and token-based, but plan specifics can change. Always confirm against the live pricing page: https://zeroentropy.dev/pricing


Frequently Asked Questions

How much will ~20k queries/month on ZeroEntropy cost in practice?

Short Answer: If you’re running around 20k queries/month with a normal-sized knowledge base and some OCR’d documents, you should expect a bill in the low–to–mid hundreds of dollars per month, depending mainly on how many tokens you ingest and how heavy each query is.

Expanded Explanation:
ZeroEntropy charges based on tokens (ingestion + queries) and plan tier, not just “number of API calls.” A workload of ~20k queries/month is well within typical production ranges we see for early RAG deployments, internal search, and agent backends.

The three main drivers are:

  1. Query tokens – How large your prompts and retrieved chunks are for each search or rerank call.
  2. Ingestion tokens – How much text you push into the system (including PDFs, docs, and any OCR’d content).
  3. OCR pages – If you’re using the Search API’s OCR pipeline, the number of pages you process matters, but it’s usually a secondary cost compared to raw ingestion tokens.

If your stack is optimized around hybrid retrieval + reranking (instead of brute-force sending huge context blocks to the LLM), you’ll usually see lower total LLM and retrieval spend at this volume.

Key Takeaways:

  • 20k queries/month is a realistic production load for the Starter/Team tiers; costs are typically low–to–mid three digits.
  • Total tokens (ingestion + query side) and OCR pages are the real levers, not the raw number of requests.

How do I estimate monthly cost for our specific workload on ZeroEntropy?

Short Answer: Break your estimate into three parts—queries, ingestion, and OCR—and apply ZeroEntropy’s token-based pricing from the live pricing page to each component.

Expanded Explanation:
ZeroEntropy’s model is deliberately straightforward: you pay for tokens and features, not a complex soup of hidden surcharges. To estimate your monthly bill, you only need a rough sense of:

  • How many tokens you’ll ingest (including OCR’d docs)
  • How many tokens you’ll process per query
  • Which product surfaces you’ll use (Search API vs standalone reranker/embeddings)

From there, you can map each bucket to the corresponding pricing line item on https://zeroentropy.dev/pricing and sum it up.

Steps:

  1. Estimate query-side tokens.
    • Example: 20k queries/month × ~2–5k tokens processed per query (retrieval + reranking context) gives you a monthly query-token range.
  2. Estimate ingestion + OCR.
    • Count or approximate your total document size in tokens (1 page of dense text ≈ 500–800 tokens).
    • Add OCR pages if you’re converting scanned PDFs/manuals; map those to the OCR allowance/pricing in your plan.
  3. Map to the pricing page.
    • Go to https://zeroentropy.dev/pricing, locate the nearest tier that covers your token needs, and check any overage rates. Sum query, ingestion, and OCR to get a conservative estimate. If you’re close to a threshold, assume the higher tier for safety.

Is it cheaper to just use rerankers/embeddings vs the full Search API?

Short Answer: Using only rerankers/embeddings is often cheaper on infrastructure but requires you to run your own search stack; the full Search API costs more per token but saves you from hosting a “Frankenstein” of vector DBs, OCR, and pipelines.

Expanded Explanation:
ZeroEntropy exposes three main surfaces:

  1. Reranker API (zerank-2) – You send candidate documents or passages; the model returns calibrated relevance scores.
  2. Embeddings API (zembed-1) – You generate dense vectors and pair them with your own vector / hybrid search infra.
  3. Search API – ZeroEntropy hosts the full hybrid retrieval stack (dense + sparse + rerank), plus ingestion, OCR, and latency tuning.

Running only rerankers/embeddings usually means lower ZeroEntropy spend but higher operational cost on your side—vector DB hosting, ingestion pipelines, OCR infra, and manual hybrid tuning. The Search API shifts that complexity (and some infra cost) into ZeroEntropy’s bill so you can ship production search in a few lines of code.

Comparison Snapshot:

  • Option A: Reranker + Embeddings only
    • You pay just for tokens used by zerank-2 and zembed-1.
    • You manage your own storage, indexing, OCR, and hybrid retrieval.
  • Option B: Full Search API (hybrid + rerank + OCR)
    • Higher ZeroEntropy line item, but you eliminate vector DB hosting, custom scoring logic, and OCR infrastructure.
    • You get dense + sparse + calibrated reranking as a single API with predictable p50/p99 latency.
  • Best for:
    • A: Teams with an existing, mature infra stack that just needs better relevance.
    • B: Teams that want to avoid maintaining search infra and care about getting “lawyer-level,” “doctor-level,” or compliance-grade retrieval online fast.

What do I need to implement a cost-efficient 20k-queries/month setup?

Short Answer: You need a right-sized plan based on your token usage, plus a retrieval design that uses ZeroEntropy’s hybrid retrieval and reranking to keep context small and LLM cost low.

Expanded Explanation:
At 20k queries/month, you’re past “toy project” scale but still early enough that design choices have a big impact on cost. If you simply dump huge unfiltered contexts into the LLM, your LLM bill will dominate. The better pattern is:

  • Use ZeroEntropy’s hybrid retrieval (dense + sparse) to assemble a precise candidate set.
  • Rerank those candidates with zerank-2 for calibrated relevance.
  • Send only the top-k (e.g., top 5–10 passages) to the LLM.

This combination improves NDCG@10 (more of the right content appears in the top 10) and typically cuts LLM tokens per query, which matters a lot when you start multiplying by 20k.

What You Need:

  • A plan sized to your tokens.
    • Use the free Starter trial (2 weeks, 1,000 queries, 1M ingestion tokens) to observe your real token usage and then choose a paid tier.
  • A minimal, efficient retrieval pipeline.
    • Decide whether you’ll use the full Search API or plug zerank-2/zembed-1 into your own stack.
    • Tune k (candidate size) and chunk sizes so that top-k precision is high, but LLM context stays compact.

How should this cost fit into our broader RAG / agentic AI strategy?

Short Answer: Treat ZeroEntropy as the reliability and cost-control layer for your RAG stack; the incremental cost at 20k queries/month is small compared to the LLM, but it materially improves answer quality and reduces hallucination and wasted tokens.

Expanded Explanation:
Most teams overspend on LLMs and underspend on retrieval. You see this in naive RAG systems that blast long contexts into a model and hope for the best. The result: high token bills, unstable answers, and debugging sessions where the “right” document is sitting at rank 67 and never gets read.

ZeroEntropy is designed to invert that: better retrieval and reranking upfront so you can safely send fewer, higher-quality chunks to the LLM. That’s why we obsess over NDCG@10 and p99 latency—if your top-k is strong and predictable, you don’t need to overpay on generation.

For a workload like 20k queries/month, you’re usually talking about:

  • A modest line item for ZeroEntropy (driven by token usage and OCR),
  • A much larger line item for your LLM, and
  • A measurable reduction in LLM spend once you start reranking and trimming context intelligently.

Why It Matters:

  • Impact on quality: Higher top-k precision and calibrated scores reduce hallucinations and “partial answers,” especially in legal, medical, and compliance workloads.
  • Impact on cost: Better retrieval means fewer tokens per LLM call and fewer retries—so your total stack cost at 20k queries/month is lower than with naive RAG, even after adding ZeroEntropy.

Quick Recap

For ~20k queries/month plus ingestion and OCR, your ZeroEntropy bill will sit mainly on token usage: query-side tokens, ingestion tokens (including OCR’d content), and any Search API features you use. Using hybrid retrieval and reranking to control context size usually keeps total cost in the low–to–mid three-digit range and reduces your LLM spend versus naive RAG. The fastest way to move from “rough guess” to “real number” is to run your traffic through the free Starter trial, measure actual tokens and latency, and then size your plan accordingly.

Next Step

Get Started