Parallel vs Exa: how do I reproduce benchmark claims (datasets, metrics like recall/nDCG, latency, cost)?

Most teams comparing Parallel and Exa don’t just want leaderboard screenshots; they want a recipe they can run themselves. You should be able to take a public dataset, call both APIs under clear constraints, compute metrics like recall and nDCG, and see where latency and CPM land for your own workloads.

This guide walks through how to reproduce benchmark-style comparisons between Parallel and Exa, with enough detail that a retrieval engineer could implement them in a day.

What “reproducible” means in this context

When you see “Parallel vs Exa” tables on Parallel’s benchmarks page, “reproducible” means:

Known datasets: Queries and relevance labels you can inspect and, when licensing allows, download.
Fixed tool constraints: Each system only gets its own API—no extra re-ranking or multi-tool agents.
Standard IR metrics: Recall@k and nDCG@k on relevance-judged datasets; accuracy on QA-style tasks.
Latency bands: Median and tail latency measured at the API boundary under realistic concurrency.
Per-request cost: CPM (USD per 1000 requests), not per-token spend.

Your goal is to mirror that setup—and then swap in your own queries, domains, or cost bands as needed.

Step 1: Choose (or assemble) a benchmark dataset

A. For search quality (recall/nDCG)

You’ll want a query–document–judgment dataset. Typical structure:

query_id
query_text
doc_id
doc_url or content snippet
relevance_label (e.g., 0/1 or graded 0–3)

Common sources and patterns:

Open-domain QA / web retrieval
- Use existing IR datasets (e.g., MS MARCO, BEIR-style collections) as a template.
- Map each query to one or more “relevant” URLs or passages, then:
  - Use those URLs directly as ground truth, or
  - Ask a judge model to label retrieved URLs as relevant/irrelevant.
Domain-specific evaluations
- Pull a sample from your own production logs:
  - De-duplicate queries.
  - Manually or model-judge the top N URLs from a strong baseline search engine.
- Store relevance labels and URLs in a tabular format (CSV/Parquet).
Entity discovery / “Find all…” tasks
- For Parallel FindAll vs Exa comparisons, construct datasets where the objective is:
  - “Find all X that satisfy Y” (e.g., “Find all public AI evaluation leaderboards updated since 2023”).
- Label a gold set of entities and required attributes, then measure recall over entities rather than single URLs.

Practical tip: Start with ~500–1,000 queries so nDCG and recall numbers stabilize, then add more if variance is high.

Step 2: Lock in evaluation criteria and metrics

Parallel’s published comparisons are essentially Pareto analyses: how much accuracy/recall you get per unit of cost and latency.

You’ll want at least:

Recall@k: “What fraction of relevant items show up in the top k?”
- For simple QA: recall of at least one relevant URL in top k.
- For entity discovery: recall over the gold entity set.
nDCG@k (normalized Discounted Cumulative Gain):
- Captures ranking quality, not just presence/absence.
- Requires graded relevance labels if available.
Accuracy (for direct-answer tasks):
- If you ask each system to answer questions (instead of just retrieve URLs), you can have a judge model assess correctness.
- Parallel’s benchmarks often use accuracy (%) when the output is a final answer rather than ranked URLs.
Latency:
- Median (p50), p90, and p95 latency from request to first byte of response.
- Measured at the API edge with cold-starts included.
Cost (CPM):
- For Parallel: use the published CPM per processor/series.
- For Exa: approximate CPM from their pricing per request (and any per-result add-ons).

Parallel’s public tables look like this (illustrative snippet from internal docs):

Series    Provider   Cost (CPM)   Accuracy (%)
Parallel  parallel   156          58
Others    exa        233          29
Others    tavily     314          23
Others    perplexity 256          22
Others    openai     253          53

and for entity discovery:

Series    Model                  Cost (CPM)   Recall (%)
Parallel  FindAll Base           60           30.3
Parallel  FindAll Core           230          52.5
Parallel  FindAll Pro            1430         61.3
Others    OpenAI Deep Research   250          21
Others    Anthropic Deep Research1000         15.3
Others    Exa                    110          19.2

You’re aiming to produce this kind of table for your own dataset and k values.

Step 3: Fix the experimental protocol

To make your “Parallel vs Exa” comparison defensible:

A. Tool constraints

Parallel:
- Use the Search API for ranking/URL retrieval (<5s typical).
- Use FindAll/Task when you explicitly want entity discovery or deep research (10–60 min windows).
- No extra re-ranking outside what Parallel already does.
Exa:
- Use the main search endpoint(s) with comparable settings:
  - Similar result count (e.g., num_results=10 or k to match Parallel).
  - No additional downstream re-ranking or second-stage LLM passes unless you plan to use them in production for both.

This mirrors how Parallel designs its own benchmarks: each provider gets one tool, no multi-step orchestration.

B. Query sampling and ordering

Shuffle your query list.
Run multi-threaded or batched requests to simulate realistic usage (e.g., 10–50 parallel requests).
Randomize provider order to avoid time-of-day bias (e.g., rate limiting, network spikes).

C. Judge model and labeling

If you don’t already have human relevance labels:

Use a strong LLM (e.g., GPT-4 class, Claude Opus-class) as a judge model.
Provide:
- The query.
- The candidate URL/title/snippet.
- The known ground truth (when available).
- A rubric: 0 = irrelevant, 1 = partially relevant, 2 = fully relevant.
Store the label and, if possible, the judge’s reasoning for later audits.

Parallel calls this kind of evidence scaffolding “Basis”: every atomic fact/label carries citations, rationale, and confidence. You can mimic that pattern in your evaluation JSON to debug edge cases.

Step 4: Implement the Parallel vs Exa search loop

Below is a high-level pseudocode outline; adapt to your language of choice.

A. Normalize request parameters

Pick a fixed k (e.g., 10 or 20) and stick to it:

Parallel Search:
- top_k = 10
- processors = Base/Core/Pro depending on depth vs latency you want to test.
Exa:
- num_results = 10 (or equivalent).
- Disable features that change semantics between calls unless you explicitly want them.

B. Request loop: Parallel

import time
import requests

def parallel_search(query, k):
    t0 = time.time()
    resp = requests.post(
        "https://api.parallel.ai/search",
        headers={"Authorization": f"Bearer {PARALLEL_API_KEY}"},
        json={"query": query, "top_k": k}
    )
    latency = time.time() - t0
    data = resp.json()
    # Extract ranked URLs
    urls = [hit["url"] for hit in data["results"]]
    return urls, latency

Parallel’s Search API returns:

Ranked URLs.
Token-dense compressed excerpts optimized for LLM consumption (you can ignore or log them for this benchmark).
Citations/metadata used internally for Basis (not strictly needed for recall/nDCG).

C. Request loop: Exa

def exa_search(query, k):
    t0 = time.time()
    resp = requests.post(
        "https://api.exa.ai/search",
        headers={"Authorization": f"Bearer {EXA_API_KEY}"},
        json={"query": query, "num_results": k}
    )
    latency = time.time() - t0
    data = resp.json()
    urls = [hit["url"] for hit in data["results"]]
    return urls, latency

Match k and any obvious ranking behavior (e.g., default sort by relevance).

D. Store results

For each query:

Save:
- Query text / ID.
- Provider (“parallel” or “exa”).
- Ranked URL list.
- Latency.
Later, join with ground-truth labels for metric computation.

Step 5: Compute recall and nDCG

Assuming you have:

gold_urls[q] — set of relevant URLs for query q (or URLs mapped to entities).
ranked_urls[q][provider] — ordered list of URLs from each provider.

A. Recall@k

def recall_at_k(ranked, gold, k):
    top_k = ranked[:k]
    hits = len(set(top_k) & set(gold))
    total_relevant = len(gold)
    return hits / total_relevant if total_relevant > 0 else 0.0

Aggregate:

recalls_parallel = []
recalls_exa = []

for q in queries:
    recalls_parallel.append(recall_at_k(ranked_urls[q]["parallel"], gold_urls[q], k))
    recalls_exa.append(recall_at_k(ranked_urls[q]["exa"], gold_urls[q], k))

mean_recall_parallel = sum(recalls_parallel) / len(recalls_parallel)
mean_recall_exa = sum(recalls_exa) / len(recalls_exa)

B. nDCG@k

If you have graded relevance labels per (query, URL):

import math

def dcg_at_k(relevance_scores, k):
    return sum(
        (2**rel - 1) / math.log2(idx + 2)
        for idx, rel in enumerate(relevance_scores[:k])
    )

def ndcg_at_k(ranked_urls, gold_relevance_map, k):
    # gold_relevance_map: {url: relevance_score}
    rel_scores = [gold_relevance_map.get(url, 0) for url in ranked_urls]
    dcg = dcg_at_k(rel_scores, k)
    ideal_scores = sorted(gold_relevance_map.values(), reverse=True)
    idcg = dcg_at_k(ideal_scores, k)
    return dcg / idcg if idcg > 0 else 0.0

Compute mean nDCG per provider across queries.

Step 6: Measure latency and cost correctly

A. Latency measurement

For each provider:

Collect a list of per-query latencies.
Compute:
- p50, p90, and p95 using standard percentile functions.
Ensure:
- You’re measuring end-to-end (client → API → client).
- You include cold-start behavior if relevant to your deployment.

Parallel’s Search typically falls under 5 seconds per request; Extract (cached) ~1–3s, live crawls 60–90s; Task/FindAll asynchronous processors range from ~5 seconds up to ~30 minutes or an hour for deep runs. For an apples-to-apples Parallel vs Exa search comparison, stay in the synchronous Search band.

B. Cost (CPM)

For each provider:

Determine cost per request:
- Parallel:
  - Use the CPM given for the specific processor/series on the benchmarks/pricing page.
  - Example from internal docs:
    - Search series: CPM = 42 with accuracy 92–81% depending on configuration.
- Exa:
  - Use their pricing page: cost per request or per result.
  - Multiply by 1000 to get CPM.
Normalize:

def cpm_from_per_request(cost_per_request):
    return cost_per_request * 1000

If you test multiple Parallel processors (Lite/Base/Core/Pro/Ultra), compute a CPM and accuracy/recall point for each, then plot Accuracy vs CPM for Parallel vs Exa.

This is how you get comparative tables like:

Provider   CPM   Recall@10 (%)
Parallel   60    30.3
Parallel   230   52.5
Parallel   1430  61.3
Exa        110   19.2

Step 7: Add QA-style accuracy if you care about final answers

If you want to evaluate answer quality rather than just retrieval:

Use Parallel’s Task API or Chat with web research for answer generation.
Use Exa’s search + your own LLM (e.g., GPT-4) to generate answers grounded in Exa search results.
Have a judge model (or humans) score each answer as:
- Correct.
- Partially correct.
- Incorrect.

Compute accuracy (% of answers judged correct or above threshold). This mirrors Parallel’s “Accuracy (%) vs CPM” tables:

Provider   CPM   Accuracy (%)
Parallel   42    92
Exa        81    81
Tavily     122   87
Perplexity 95    83
OpenAI     68    90

Note: In these comparisons, Parallel collapses the pipeline—search, crawl, parse, re-rank—into a single Task/Search call, while the Exa pipeline typically includes an external LLM call you pay for separately. Track those costs explicitly if you include them.

Step 8: Document methodology for auditability

To make your numbers credible (internally or externally), include a short “Methodology” block, similar to Parallel’s benchmark pages:

Dataset:
- Number of queries, domain, sampling method, date range.
Tools:
- Parallel Search (processor series + k), optional Task/FindAll configs.
- Exa search endpoint, key parameters (k, filters).
Judge model:
- Model name/version, temperature, rubric for relevance/accuracy.
Constraints:
- Single-tool per provider.
- No external re-ranking unless applied symmetrically.
Testing dates:
- Date range when requests were run, to capture index freshness window.

Example:

About this benchmark

Dataset: 1,000 web queries sampled from internal logs (Jan–Mar 2026), each mapped to 1–3 gold URLs.

Tools: Parallel Search Core (top_k=10), Exa search (num_results=10), no additional re-ranking.

Metrics: Recall@10, nDCG@10, p50/p90 latency, CPM.

Judge model: GPT-4.1 as relevance grader with a 0–2 relevance scale.

Testing dates: April 3–5, 2026.

Putting it together: a simple decision framework

Once you’ve run the above pipeline, you’ll have:

Quality: recall@k and nDCG@k for Parallel vs Exa.
Speed: latency distributions per provider.
Economics: CPM per provider and configuration.

From there, you can:

Plot Recall vs CPM to see which system sits on the Pareto frontier for your workload.
Check latency bands to ensure production SLOs are met.
Decide where to allocate processor tiers in Parallel’s architecture:
- Lite/Base for cheap lookups.
- Core/Pro/Ultra for higher-stakes, deeper research.

If you want a starting point that already encodes this logic, Parallel’s own benchmarks show that across multiple datasets and tasks, Parallel delivers higher accuracy/recall at every tested price point compared to Exa and other providers like Tavily and Perplexity, with costs expressed as CPM (USD per 1000 requests) and clear latency bands.

Next Step

Get Started