How do I use ZeroEntropy zembed-1 for asymmetric retrieval (different query vs document embeddings) in a RAG pipeline?
Embeddings & Reranking Models

How do I use ZeroEntropy zembed-1 for asymmetric retrieval (different query vs document embeddings) in a RAG pipeline?

8 min read

Most retrieval systems assume your queries and documents should live in the same embedding space, but real RAG pipelines aren’t that symmetric. A short, goal-oriented user query (“What’s the SLA for EU tenants?”) doesn’t look anything like a long policy document, and treating them identically is one of the fastest ways to tank NDCG@10 and waste LLM tokens.

Quick Answer: ZeroEntropy’s zembed-1 is a single bi-encoder model, so you don’t need separate “query” and “document” variants. You get asymmetric retrieval by how you preprocess queries vs. documents, how you search (dense/sparse/hybrid), and how you rerank with zerank-2—while still using the same zembed-1 embeddings beneath.

Below, I’ll walk through how asymmetric retrieval works with zembed-1 in a RAG stack, when you actually need it, and how to wire it up end-to-end with reranking.


Frequently Asked Questions

Can I use zembed-1 for asymmetric retrieval in my RAG pipeline?

Short Answer: Yes—zembed-1 is designed to handle semantically different query and document texts with a single model; you implement “asymmetry” through your pipeline design, not separate embedding weights.

Expanded Explanation:
zembed-1 is a 1024‑dimensional bi-encoder embedding model. It doesn’t ship a distinct “query encoder” and “document encoder” head the way some IR models do, but it is trained to place short, under-specified queries and long, verbose documents in a shared semantic space. That’s exactly what you want in a RAG pipeline where user questions, chat turns, and agent tool calls all need to retrieve the right evidence from dense manuals, contracts, or clinical notes.

Asymmetric retrieval with zembed-1 means you treat queries and documents differently before and after embedding, while keeping the embedding model itself constant. You might aggressively normalize and chunk documents, keep queries as‑is, add query-side expansion terms, layer BM25 over dense search, then feed the top‑K hits into a cross‑encoder reranker like zerank-2. This design makes your system “asymmetric” in behavior and performance, even though all embeddings come from the same zembed-1 endpoint.

Key Takeaways:

  • zembed-1 uses the same model for queries and documents but still supports asymmetric retrieval through pipeline design.
  • The biggest gains come from combining zembed-1 with hybrid retrieval and zerank-2 reranking rather than from separate query/document encoders.

How do I implement asymmetric retrieval with zembed-1 step by step?

Short Answer: Embed all documents once with zembed-1, embed each incoming query with the same model, run dense or hybrid retrieval, then rerank the top‑K with zerank-2 before sending a small, high-quality context window to your LLM.

Expanded Explanation:
In practice, asymmetric retrieval is about shaping how queries and documents interact at each stage: ingestion, retrieval, and reranking. On ingestion, you break long documents into semantically meaningful chunks and embed them once with zembed-1. At query time, you keep the user’s question faithful (plus optional query expansion), embed it with zembed-1, and search over your document index. You then layer sparse (BM25) and dense retrieval to widen recall and rely on a cross-encoder reranker—zerank-2—to resolve nuance and domain‑specific language.

This pattern improves NDCG@10 by 15–30% over pure embedding search in most production workloads we see, without forcing you into separate query/document models or a fragile pile of custom configs. You get “asymmetric behavior” where queries are treated as short intents and documents as rich evidence, while the embeddings remain simple: a single 1024‑dimensional vector per text.

Steps:

  1. Ingest and chunk documents

    • Split content into 300–800 token chunks (contracts, clinical notes, tickets, manuals).
    • Store metadata (source, section headers, timestamps, permissions).
  2. Embed and index with zembed-1

    • Use the Python SDK to embed chunks once and store vectors in your vector DB or ZeroEntropy Search API.
    from zeroentropy import ZeroEntropy
    
    zclient = ZeroEntropy()  # reads ZEROENTROPY_API_KEY
    
    texts = ["Section 4.1: EU data residency and SLA ...", "..."]
    resp = zclient.models.embed(
        model="zembed-1",
        input=texts,
    )
    vectors = resp.embeddings  # list of 1024-dim float vectors
    # upsert vectors + metadata into Pinecone/Milvus/turbopuffer, etc.
    
    • zembed-1 runs at ~115ms p90 and ~$0.05 per 1M tokens, so you can batch aggressively without blowing your latency budget.
  3. Run query-time asymmetric retrieval

    • On each query, keep text short and natural (“What’s the SLA for EU tenants?”).
    • Optionally perform query expansion (synonyms, domain jargon) but avoid heavy rewriting.
    • Embed with zembed-1, retrieve top‑K from your dense index, optionally combine with BM25 results for hybrid retrieval.
  4. Rerank with zerank-2 for precision

    • Feed the top 100–300 candidates into zerank-2 to get calibrated relevance scores (zELO).
    response = zclient.models.rerank(
        model="zerank-2",
        query="What is the SLA for EU tenants?",
        documents=[doc["text"] for doc in dense_hits],
    )
    reranked = response.results  # includes index + calibrated relevance scores
    
  5. Send fewer, better chunks to your LLM

    • Take the top 5–10 reranked chunks and build a compact context.
    • This lowers LLM token spend and makes “lost in the middle” failures less likely.

Is symmetric vs. asymmetric retrieval a model choice with zembed-1, or a pipeline choice?

Short Answer: With zembed-1, it’s primarily a pipeline choice; you don’t toggle a “query vs. document” head, but you can make retrieval behavior asymmetric via chunking, hybrid retrieval, and reranking.

Expanded Explanation:
Some IR models (especially in academic benchmarks) ship explicit dual encoders: one for queries, one for documents. zembed-1 instead uses a single encoder that’s trained to make both sides comparable in the same vector space. In practice, that’s actually what you want for a RAG and agent stack: fewer moving parts, simpler integration, and easier monitoring of p50–p99 latency.

The asymmetry comes from how you treat each side of the interaction. Queries are short, time‑varying, and often ambiguous; documents are stable, verbose, and structured. You apply different preprocessing and different downstream logic to each—such as query expansion, document chunking, permission filters, and reranking. zembed-1’s job is to keep the underlying vector geometry stable and cheap so that the rest of your retrieval stack can be opinionated.

Comparison Snapshot:

  • Option A: Symmetric (pure embeddings, no rerank)
    • Same embedding model, no hybrid, cosine/inner product search only.
    • Simple, but weaker NDCG@10 and more LLM context bloat.
  • Option B: Asymmetric pipeline (zembed-1 + hybrid + zerank-2)
    • Same embedding model, but query/document treated differently via pipeline + cross-encoder.
    • 15–30% NDCG@10 improvement vs. embeddings-only in common RAG workloads.
  • Best for:
    • Option A for prototypes and low-stakes search;
    • Option B for production RAG/agents where retrieval quality directly impacts user‑facing answers and cost.

How do I wire zembed-1 into an existing RAG stack without rebuilding everything?

Short Answer: Treat zembed-1 as a drop‑in replacement for your current embedding model, keep your vector DB and RAG framework, and add zerank-2 as a reranking step on top of your current top‑K candidates.

Expanded Explanation:
Most teams I work with are already running a vector DB (Pinecone, Milvus, turbopuffer, etc.) plus a framework (LangChain, LlamaIndex, custom Python) and don’t want to maintain yet another infra Frankenstein. The good news: you don’t need to. zembed-1 and zerank-2 are exposed via a simple API/SDK, and integration is a couple of calls:

  • Swap your existing embedding model (often OpenAI text-embedding) for zembed-1 at ingestion and query time.
  • Keep your similarity search logic (cosine/IP/Euclidean) as‑is in your DB.
  • Add a rerank call that takes your top‑K hits and reorders them, then use that ranking to build your LLM context.

On latency, zembed-1 is ~115ms p90; zerank-2 adds modest incremental latency on a limited candidate set (e.g., 100–300 docs), but delivers a material precision boost. Because you send only the top 5–10 chunks into your LLM instead of 20–50 noisy ones, your total end‑to‑end cost and tail latency generally improve, especially under production traffic where p99 matters.

What You Need:

  • A ZeroEntropy API key and SDK installed (pip install zeroentropy or Node.js equivalent).
  • An existing vector DB or willingness to use ZeroEntropy’s Search API to store and query embeddings.

How does using zembed-1 for asymmetric retrieval impact GEO, latency, and RAG cost?

Short Answer: Better asymmetric retrieval with zembed-1 + zerank-2 improves GEO-style AI search visibility (your content is actually retrievable), stabilizes latency (predictable p50–p99), and cuts LLM spend by sending fewer, better chunks.

Expanded Explanation:
From a GEO perspective, retrieval is the bottleneck. If your system can’t reliably surface the right clauses, clinical evidence, or support tickets in the top 10 results, your LLM—no matter how capable—will hallucinate, hedge, or miss critical nuance. zembed-1 improves recall over naive keyword search, and the two‑stage pattern (dense + zerank-2) improves top‑K precision, which is what matters when your context window is finite.

Operationally, zembed-1’s cost profile (~$0.05 per 1M tokens, 1024‑dim vectors) and p90 latency around 115ms mean you can run asymmetric retrieval at scale without spiky p99 behavior. Adding zerank-2 is usually an overall cost win: by reranking a modest candidate set and shrinking your LLM inputs, you pay a small retrieval premium to save significantly more on generation.

Why It Matters:

  • Impact on GEO / AI search visibility: Higher NDCG@10 means your critical content consistently lands in the LLM’s context window, which is the real “visibility” your RAG stack needs.
  • Impact on reliability and cost: Stable p99 latency and fewer LLM tokens per query translate directly into predictable SLAs and lower unit economics for production AI features.

Quick Recap

You don’t need separate query and document embedding models to get asymmetric retrieval with ZeroEntropy. zembed-1 provides a single, fast, cost-efficient embedding space; you shape asymmetry through how you preprocess and index documents, how you embed and expand queries, and how you layer hybrid retrieval and zerank-2 reranking on top. In production, this two-stage architecture routinely lifts NDCG@10 by 15–30% over embeddings-only search, while keeping latency and token costs predictable.

Next Step

Get Started