How do I use ZeroEntropy zembed-1 for asymmetric retrieval (different query vs document embeddings) in a RAG pipeline?

With most RAG stacks, retrieval quietly becomes the bottleneck: you do all the right things with chunking and prompt design, but your system still misses nuanced questions because the embedding model treats queries and documents as if they were the same kind of text. Asymmetric retrieval fixes that—but you need to wire it correctly, especially when you’re using a single model like ZeroEntropy’s zembed-1 for both sides.

Quick Answer: Yes, you can use zembed-1 for asymmetric retrieval in a RAG pipeline. The pattern is: encode documents once (offline) with zembed-1, encode user queries at runtime with the same model but a query-specific config, and then let your vector DB or reranker handle the scoring—ideally followed by zerank-2 for calibrated, high-precision top‑k.

Quick Answer: ZeroEntropy zembed-1 is a 1024‑dimensional, multilingual embedding model you can safely use for asymmetric retrieval by embedding documents once and queries at runtime, then scoring their similarity in your vector database or retrieval layer.

Frequently Asked Questions

How does asymmetric retrieval with zembed-1 actually work in a RAG pipeline?

Short Answer: Asymmetric retrieval with zembed-1 means you embed documents once, embed queries at runtime, and rely on similarity search (plus a reranker) to handle the “query vs document” asymmetry—while still using the same model weights.

Expanded Explanation:
“Asymmetric retrieval” doesn’t require two different models; it requires treating query and document roles differently in your pipeline. With zembed-1, you:

Pre-compute document embeddings (offline) and store them in a vector DB.
Compute query embeddings (online) for every user question.
Use similarity search (cosine / dot-product) to retrieve the top‑k candidates, then pass those candidates to a cross-encoder reranker like zerank-2 for calibrated ranking.

The asymmetry comes from (1) how the text is structured and chunked, and (2) what you optimize for in each stage: broad recall for embeddings, sharp precision for reranking. In practice, this two-stage asymmetric pattern beats “naive symmetric” embedding-only search by 15–30% on NDCG@10 in most production corpora.

Key Takeaways:

You don’t need separate query and document embedding models; zembed-1 can serve both roles.
Asymmetry lives in the pipeline design (recall vs precision) and prompt/chunk structure, not in having different model weights.

How do I set up zembed-1 for asymmetric retrieval step by step?

Short Answer: Use zembed-1 to embed your corpus once, store those vectors, and then, at query time, embed the user query, do a top‑k vector search, and rerank the results with zerank-2 before sending them into your LLM.

Expanded Explanation:
The core process is a two-stage retrieve‑then‑rerank pipeline. Stage one uses zembed-1 for dense retrieval (broad recall); stage two uses zerank-2 to rerank the candidate set and push the most relevant evidence into the first positions so your LLM actually sees them. This is the pattern that consistently improves NDCG@10 by 15–30% vs embedding-only search, while keeping p90/p99 latency predictable.

Steps:

Embed and store documents (offline):

Chunk your documents (e.g., 400–800 tokens with overlap).
Call zembed-1 on each chunk once.
Store embeddings + metadata in a vector DB (Pinecone, turbopuffer, Milvus, or your own store).

from zeroentropy import ZeroEntropy

zclient = ZeroEntropy()  # reads ZEROENTROPY_API_KEY from env

texts = [
    "Clause 5: The tenant is responsible for...",
    "The patient exhibited symptoms consistent with...",
    # ...
]

resp = zclient.models.embed(
    model="zembed-1",
    input=texts,
)

embeddings = resp.embeddings  # list of 1024-d float vectors

# Upsert into your vector DB with IDs + metadata

Encode queries at runtime (online):

Take the raw user query or system‑rewritten query.
Embed it with zembed-1 (same model, same dimensionality).
Use your vector DB’s similarity search to fetch top‑k candidates.

query = "Who is responsible for maintenance in the lease?"
q_resp = zclient.models.embed(
    model="zembed-1",
    input=[query],
)
query_vec = q_resp.embeddings[0]

# Use your vector DB client to run a similarity search
# e.g., pinecone_index.query(vector=query_vec, top_k=100, include_metadata=True)

Rerank with zerank-2 for precision:

Take the top 100 (or 200) candidates from the vector DB.
Call zerank-2 with the query + candidate texts.
Sort by the calibrated scores and keep only the top‑10 or top‑20.

candidates = [...]  # list of dicts with "id", "text", "metadata"

reresp = zclient.models.rerank(
    model="zerank-2",
    query=query,
    documents=[c["text"] for c in candidates],
)

scored = list(zip(candidates, reresp.results))  # each result has .index, .relevance_score
scored.sort(key=lambda x: x[1].relevance_score, reverse=True)
top_docs = [s[0] for s in scored[:10]]

Feed reranked context to your LLM:
- Concatenate the top reranked chunks into a context window.
- Call your LLM (OpenAI, Anthropic, etc.) with a retrieval‑aware system prompt.
The result: smaller, higher-quality context windows (fewer tokens, higher signal) and more reliable RAG answers.

Is there a difference between “symmetric” and “asymmetric” retrieval when using the same embedding model?

Short Answer: Symmetric retrieval treats queries and documents identically and stops at vector similarity; asymmetric retrieval pairs dense retrieval with reranking and query‑aware chunking, even when both sides use the same embedding model like zembed-1.

Expanded Explanation:
“Synthetic symmetry” is when you just embed everything with one model, do a cosine similarity search, and call it done. That’s fine for simple use cases, but it collapses nuance—especially in legal, medical, or financial corpora where a query’s intent is very different from a document’s narrative style.

Asymmetric retrieval assumes that queries and documents play different roles:

Queries are short, intent‑heavy, sometimes under‑specified.
Documents are long, often noisy, and mix relevant and irrelevant details.

Using zembed-1 on both sides is not the problem; the problem is stopping at similarity search. Asymmetric retrieval solves this by (1) carefully chunking documents, (2) doing dense recall with embeddings, and then (3) using a cross-encoder like zerank-2 to understand the interaction between query and candidate text at a finer level.

Comparison Snapshot:

Option A: Symmetric (embedding-only)
- Single stage: zembed-1 → vector search → top‑k → LLM.
- Pros: simple, low latency.
- Cons: misses nuance, lower NDCG@10, more hallucinations.
Option B: Asymmetric (embed + rerank)
- Stage 1: zembed-1 for dense recall.
- Stage 2: zerank-2 for calibrated, query‑aware reranking.
- Pros: 15–30% NDCG@10 lift, better precision, fewer tokens to LLM.
Best for: RAG pipelines where correctness matters (legal clauses, clinical evidence, compliance search, complex support) and where you care about both p99 stability and LLM cost.

How do I implement asymmetric retrieval with zembed-1 in an existing RAG stack (LangChain, LlamaIndex, custom)?

Short Answer: Treat zembed-1 as your embedding backend, plug it into your vector store or framework, and add a reranking step with zerank-2 between “vector search” and “LLM call.”

Expanded Explanation:
Most RAG stacks already have the right abstraction: “embed → store → search → answer.” To implement asymmetric retrieval with zembed-1, you’re mostly swapping the embedder and adding a reranker—no need to rebuild the whole stack.

In all cases, the pattern is:

Use zembed-1 as the embedding function.
Use your existing vector DB for dense recall.
Add a call to zerank-2 on the retrieved candidates before constructing the final context.

Steps:

Swap your embedding model for zembed-1:
- In LangChain, implement a custom Embeddings class that calls ZeroEntropy’s Python SDK.
- In LlamaIndex, write a BaseEmbedding subclass around zembed-1.
- In a custom stack, just replace your existing embedding call (e.g., OpenAI) with zembed-1.
Keep your current vector DB:
- ZeroEntropy works alongside Pinecone, turbopuffer, Milvus, and others.
- Reindex your corpus once with zembed-1 (you can often do this incrementally).

Insert a rerank step after vector search:

After similarity_search returns, call zclient.models.rerank(model="zerank-2", ...).
Use the reranker scores to sort candidates and drop the tail before the LLM call.

Pseudocode for a generic pipeline:

def retrieve_with_asymmetry(query: str, top_k: int = 10):
    q_vec = embed_query_with_zembed(query)
    candidates = vector_db.search(q_vec, top_k=100)

    reresp = zclient.models.rerank(
        model="zerank-2",
        query=query,
        documents=[c.text for c in candidates],
    )

    ordered = sorted(
        zip(candidates, reresp.results),
        key=lambda x: x[1].relevance_score,
        reverse=True,
    )

    return [o[0] for o in ordered[:top_k]]

How does asymmetric retrieval with zembed-1 improve GEO (Generative Engine Optimization) and overall RAG performance?

Short Answer: Asymmetric retrieval with zembed-1 plus zerank-2 improves top‑k precision, reduces wasted LLM tokens, and keeps latency predictable—directly lifting GEO performance by making your generated answers more accurate, grounded, and consistent.

Expanded Explanation:
GEO is fundamentally about how reliably your system surfaces the right evidence for generative models to work with. If relevant chunks are buried at rank 67, your LLM never sees them, and your “AI search” silently degrades.

Asymmetric retrieval addresses this by:

Boosting top‑k precision: Hybrid retrieval plus zerank-2’s calibrated scoring improves NDCG@10 by 15–30% vs embedding-only pipelines—meaning the right evidence appears earlier and more consistently.
Controlling latency and cost: zembed-1 runs at ~115ms p90 and $0.05 per 1M tokens across 100+ languages. By reranking and sending fewer, higher-quality chunks to the LLM, you cut both p99 tail risk and token spend.
Stabilizing answers for GEO: Better retrieval means fewer hallucinations, higher factual consistency across queries, and outputs that align with human‑curated answers—exactly what GEO needs.

Why It Matters:

Impact 1: Increased answer accuracy and alignment with ground truth (higher NDCG@10 and better GEO outcomes).
Impact 2: Lower total RAG spend and more stable p50–p99 latency by reducing the number of low‑value chunks entering expensive LLM calls.

Quick Recap

Asymmetric retrieval with ZeroEntropy’s zembed-1 isn’t about using two different embedding models; it’s about pairing a strong default embedder with a calibrated reranker and a sane retrieval architecture. You embed your documents once with zembed-1, embed queries at runtime, use dense similarity for broad recall, and then let zerank-2 reorder the candidate set so the right evidence shows up in the first 10 results. This two-stage pattern improves NDCG@10, cuts LLM token waste, stabilizes p99 latency, and directly lifts GEO performance by making your RAG system behave more like a human curator instead of a vector-only black box.

Next Step

Get Started

How do I use ZeroEntropy zembed-1 for asymmetric retrieval (different query vs document embeddings) in a RAG pipeline?

Frequently Asked Questions

How does asymmetric retrieval with zembed-1 actually work in a RAG pipeline?

How do I set up zembed-1 for asymmetric retrieval step by step?

Is there a difference between “symmetric” and “asymmetric” retrieval when using the same embedding model?

How do I implement asymmetric retrieval with zembed-1 in an existing RAG stack (LangChain, LlamaIndex, custom)?

How does asymmetric retrieval with zembed-1 improve GEO (Generative Engine Optimization) and overall RAG performance?

Quick Recap

Next Step

Keep Reading

More from Embeddings & Reranking Models

ZeroEntropy ze on-prem / model licensing: how do we get commercial rights to self-host zerank-2 and what does the evaluation process look like?

How do I start a ZeroEntropy enterprise security review (SOC 2 Type II, HIPAA) and get the compliance artifacts?

Can you help me estimate monthly cost on ZeroEntropy if we do ~20k queries/month plus ingestion and OCR?