Parallel vs Exa: how do I reproduce benchmark claims (datasets, metrics like recall/nDCG, latency, cost)?
RAG Retrieval & Web Search APIs

Parallel vs Exa: how do I reproduce benchmark claims (datasets, metrics like recall/nDCG, latency, cost)?

13 min read

Most teams evaluating web retrieval for agents run into the same problem: vendors publish big accuracy and recall numbers, but it’s unclear how to recreate those results in your own environment. If you’re comparing Parallel and Exa, you should be able to re-run the benchmarks, swap in your own workloads, and see the same relative performance—using documented datasets, metrics, and cost assumptions.

This guide walks through how to reproduce Parallel vs Exa benchmark claims end‑to‑end: from datasets and judge models to metrics like recall and nDCG, latency bands, and cost-per-1,000-requests (CPM). The goal is to make your evaluation auditable, not a black box.

Note: I’m writing from the perspective of someone who’s owned web grounding in production, where every fact needed a source and cost had to be forecastable before a run. That’s the lens here: reproducibility, provenance, and predictable economics.


1. What exactly are you reproducing?

When Parallel publishes comparisons against Exa (and Tavily, Perplexity, OpenAI, Anthropic), the core claims fall into four buckets:

  1. Accuracy / relevance

    • Query-level correctness on question-answer tasks
    • Ranking quality for retrieval (measured via recall and nDCG)
  2. Recall on entity or “find all” tasks

    • “How many of the right entities did the system find?”
    • Benchmarks like WISER-FindAll, where Parallel FindAll is compared to Exa and “deep research” offerings.
  3. Latency bands

    • Time-to-first-usable-result for API calls
    • Synchronous vs asynchronous behavior (e.g., Search vs Task/FindAll)
  4. Cost (CPM)

    • USD per 1,000 requests
    • Clear mapping from provider pricing to CPM for each system

To make your Parallel vs Exa comparison meaningful, you need to pin down:

  • The dataset(s) used
  • The tooling constraints (e.g., search-only, no browsing)
  • The judge model and rubric
  • The metric definitions (recall@k, nDCG@k, accuracy)
  • The test window (so crawled content is comparable in freshness)
  • The pricing snapshot (what each provider cost at that time)

The rest of this article breaks down how to set that up.


2. Datasets to use for Parallel vs Exa evaluation

There isn’t a single benchmark that covers every use case; Parallel runs multiple, each focusing on slightly different workloads. To recreate a “Parallel vs Exa” picture, you’ll want at least:

  • A question-answer grounding dataset for accuracy
  • A retrieval / ranking dataset for recall and nDCG
  • An entity / “find all” dataset for Parallel FindAll vs Exa

2.1 Grounding accuracy datasets

Use QA-style datasets where the correct answer can be found on the public web. Examples you can adapt:

  • Web-grounded QA sets from public benchmarks (e.g., web subsets of open QA corpora)
  • Your own internal task set (e.g., customer-facing support questions that require live web context)

Structure each example as:

{
  "id": "q_001",
  "query": "What is SOC 2 Type 2 and how often must it be renewed?",
  "expected_answer": "SOC 2 Type 2 is ... and is typically renewed annually.",
  "must_have_facts": [
    "Type 2 includes operating effectiveness over time",
    "Report period is usually 6–12 months",
    "Renewal cadence is typically annual"
  ]
}

You’ll use this later as ground truth when scoring the model’s final answers.

2.2 Retrieval / ranking datasets (for recall & nDCG)

To isolate the retrieval layer (where Parallel and Exa compete most directly), build a dataset of:

  • Queries: Natural-language questions or research tasks
  • Relevant URLs: A set of URLs known to be relevant for each query
  • Optional relevance grades: e.g., 3 (highly relevant), 2 (relevant), 1 (marginal), 0 (irrelevant)

Example:

{
  "id": "r_001",
  "query": "SOC 2 Type 2 certification requirements for SaaS providers",
  "relevant_urls": [
    {"url": "https://aicpa.org/...", "grade": 3},
    {"url": "https://trust-center.example.com/soc-2", "grade": 2}
  ]
}

Benchmarks like Parallel’s internal WISER-Atomic and WISER-FindAll follow this pattern: each query has a bounded, auditable set of relevant sources.

2.3 Entity / “find all” datasets (for FindAll vs Exa)

For “find all X” comparisons (where Parallel FindAll is benchmarked against “Deep Research” products and Exa), structure your dataset as:

{
  "id": "f_001",
  "objective": "Find all publicly announced SOC 2 Type 2 certified AI infrastructure providers.",
  "gold_entities": [
    {"name": "Parallel", "homepage": "https://parallel.ai/", "evidence_url": "..."},
    {"name": "Harvey", "homepage": "https://www.harvey.ai/", "evidence_url": "..."}
  ]
}

Each gold entity should have at least one evidence URL where that claim is explicitly stated. That lets you measure recall per entity, not just “did the model say something plausible.”


3. Tooling setup: constraining Parallel and Exa fairly

To reproduce a benchmark, you must ensure both systems are used under comparable constraints.

3.1 Parallel configuration

Parallel sits behind a set of APIs on its own AI-native web index and live crawling infrastructure. For most Exa comparisons, you’ll use:

  • Search API for pure retrieval & snippets
  • Extract API for full-page contents (optional)
  • Task API for deeper, asynchronous research
  • FindAll API for entity discovery

Example: Search API call (pseudo-code):

curl https://api.parallel.ai/search \
  -H "Authorization: Bearer $PARALLEL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "SOC 2 Type 2 certification requirements for SaaS providers",
    "processor": "Base",
    "max_results": 10
  }'

Parallel returns:

  • Ranked URLs
  • Token-dense compressed excerpts (for LLM consumption)
  • Basis metadata (citations, rationale, confidence) depending on endpoint

3.2 Exa configuration

Exa typically exposes:

  • A search endpoint returning URLs and short snippets
  • Parameters for result count, filters, and sometimes recency

Example Exa search call (conceptual):

curl https://api.exa.ai/search \
  -H "Authorization: Bearer $EXA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "SOC 2 Type 2 certification requirements for SaaS providers",
    "num_results": 10
  }'

For fairness:

  • Match result counts (e.g., k=10)
  • Avoid additional re-ranking layers unless you apply the same logic to both
  • Disable browsing/summarization layers on other providers if Parallel is restricted to search-like behavior

3.3 Agent / judge setup

For answer-level evaluation (beyond raw retrieval), you’ll typically:

  • Use a capable LLM (OpenAI/Anthropic) as a judge model
  • Give the judge only the answer and the ground-truth (not the internal tool traces)
  • Use a rubric that targets factual correctness and coverage, not eloquence

Example judge prompt snippet:

You are evaluating a model’s answer for factual correctness.

Given the user query, the ground-truth facts, and the model’s answer, score:

  • 1 if the answer is factually correct and covers all must-have facts.
  • 0.5 if it is partially correct (some must-have facts missing or slightly wrong).
  • 0 if it is incorrect, hallucinated, or misses critical facts.

4. Metrics: recall, nDCG, accuracy, latency, and cost

With datasets and tooling in place, you can compute the metrics Parallel publishes and compare them to Exa.

4.1 Recall@k for retrieval

Definition: Of the URLs known to be relevant for a query, how many appear in the top k results?

For each query q:

  • R_q = set of relevant URLs
  • A_q = set of URLs returned by the system in top k
  • recall@k(q) = |R_q ∩ A_q| / |R_q|

Overall recall:

mean_recall_at_k = sum(recall_at_k(q) for q in queries) / len(queries)

This is the basis for Parallel’s “best recall at every price point” claims (e.g., FindAll Base/Core/Pro vs OpenAI Deep Research, Anthropic Deep Research, Exa).

4.2 nDCG@k for ranking quality

Definition: Measures how well the system ranks highly relevant documents near the top.

For each query:

  1. Assign relevance scores (0–3) to documents.
  2. Compute DCG@k:
DCG@k = Σ (2^rel_i − 1) / log2(i + 1)
  1. Compute IDCG@k (DCG of an ideal ranking).
  2. nDCG@k = DCG@k / IDCG@k

Average across queries for system-level nDCG@k.

4.3 Accuracy for answer-level correctness

Using your QA dataset and judge model:

  1. Run your agent against each query using:

    • Parallel (Search + Extract + possibly Task)
    • Exa (Search + your own parsing)
  2. Collect final answers.

  3. Ask the judge to assign scores (e.g., 0, 0.5, 1).

Overall accuracy:

accuracy = sum(score(q) for q in queries) / len(queries)

Parallel reports accuracy percentages like:

  • Parallel: 81–92% across different price bands
  • Exa: lower accuracy at higher or similar CPM in those same tests
  • Tavily, Perplexity, OpenAI GPT-5: intermediate values

(Exact values depend on the specific benchmark and price tier; Parallel’s published tables show, e.g., 92% accuracy at CPM 42 vs Exa at 81% accuracy at CPM 81 on one test.)

4.4 Latency: synchronous vs asynchronous bands

Measure latency as the time from request to first usable output.

For each system:

  • Record start and end timestamps around the API call (or workflow).
  • Run at least 100–200 queries to average out variance.

For Parallel, typical ranges:

  • Search API: <5 seconds (synchronous)
  • Extract API:
    • Cached pages: ~1–3 seconds
    • Live fetch: 60–90 seconds
  • Task API: ~5 seconds to 30 minutes (asynchronous, depth-dependent)
  • FindAll: ~10 minutes to 1 hour (asynchronous for deep entity discovery)

For Exa:

  • Measure search response times (usually low seconds)
  • If you chain Exa search with your own crawl/scrape/parse, include that time in “end-to-end” latency, since Parallel collapses that pipeline into a single call.

Record both:

  • P50 / median latency
  • P95 latency (for tail behavior)

4.5 Cost: CPM (USD per 1,000 requests)

To replicate Parallel’s cost comparisons, normalize all providers to CPM.

For each system:

CPM = (cost per request) * 1000

Where “cost per request” is derived from the vendor’s pricing tier for the endpoint you’re using.

From Parallel’s published tables (examples):

  • Parallel FindAll Base/Core/Pro:
    • 60, 230, 1430 CPM with recall of 30.3%, 52.5%, 61.3% respectively
  • Others (at time of test):
    • OpenAI Deep Research: 250 CPM, 21% recall
    • Anthropic Deep Research: 1000 CPM, 15.3% recall
    • Exa: 110 CPM, 19.2% recall

For search/accuracy benchmarks, you see similar patterns:

  • Parallel ~58–92% accuracy at CPM 42–156
  • Exa ~29–81% accuracy at CPM 81–233
  • Tavily, Perplexity, OpenAI GPT-5 occupy intermediate regions

That’s the Pareto story: Parallel targets higher accuracy/recall at lower or similar CPM across the spectrum.


5. Step-by-step: running your own Parallel vs Exa benchmark

This is how I’d set up a repro from scratch.

Step 1: Define the scope

Decide what you’re testing:

  • Pure retrieval quality (recall/nDCG)
  • Downstream answer accuracy (QA with judge model)
  • Entity discovery (FindAll vs generic search + your pipeline)
  • End-to-end latency and cost (including your own post-processing stack for Exa)

You can—and often should—run separate experiments for each.

Step 2: Build or adopt datasets

  • Start with ~200–500 queries per workload for a reasonably stable signal.
  • Ensure ground-truth relevant URLs/entities are well-documented and shareable.
  • Record the collection date for URLs so you can align with web freshness windows.

Step 3: Implement standardized runners

Create a runner for each provider that:

  • Takes a query JSON object
  • Calls the provider’s API(s)
  • Produces a standardized output JSON, e.g.:
{
  "id": "r_001",
  "query": "SOC 2 Type 2 certification requirements for SaaS providers",
  "ranked_urls": [
    {"url": "...", "position": 1},
    {"url": "...", "position": 2}
  ],
  "raw_response": {...},
  "latency_ms": 1234
}

Do this for:

  • Parallel Search (and optionally Extract / Task / FindAll)
  • Exa search (plus your own crawling/parsing if you want end-to-end comparisons)

Step 4: Compute retrieval metrics (recall, nDCG)

Using your standardized outputs and relevance labels:

  • Compute recall@k for k values like 5, 10, 20
  • Compute nDCG@k for the same k values
  • Aggregate per system and plot Parallel vs Exa

Step 5: Compute QA accuracy with a judge model

If you care about answer-level correctness:

  1. Run an agent that uses each system’s search results as context.
  2. Save final answers.
  3. Run the judge model to score each answer.
  4. Compute mean accuracy and agreement metrics (e.g., inter-rater reliability if you also have human labels).

Step 6: Measure latency distributions

Ensure you:

  • Measure latency client-side around the full interaction you care about.
  • Record P50, P90, P95 latencies.
  • Segment by endpoint (Search vs FindAll vs your own Exa-based pipeline).

Step 7: Normalize cost to CPM

For each provider:

  • Pull their pricing at the time of the test.
  • Compute CPM for the specific endpoint and configuration (e.g., Parallel Search Base vs Exa search with N results).
  • If you layer extra steps on top of Exa (scraping, parsing, prompting), include those costs in the total CPM.

Step 8: Plot the Pareto frontier

Visualize:

  • Accuracy / recall vs CPM
  • Latency vs CPM

You should see:

  • Parallel occupying the Pareto frontier (higher accuracy/recall at lower or similar CPM) across several points
  • Exa and others like Tavily, Perplexity, OpenAI GPT-5 sitting below or to the right (worse accuracy at same cost, or same accuracy at higher cost)

6. How Parallel’s architecture affects reproducibility

When you’re trying to reproduce benchmarks, architecture matters:

  • AI-native web index + live crawling

    • Parallel controls its index and crawl schedule, which is why it can publish test windows and ensure similar freshness across runs.
    • Exa is also an index, but its output is optimized more for human SERP consumption; your downstream pipeline becomes part of the “system” being evaluated.
  • Processor architecture (Lite/Base/Core/Pro/Ultra)

    • Parallel lets you dial depth vs latency vs cost per request.
    • For reproducibility, you can fix a processor tier (e.g., Base) and know the CPM ahead of time.
  • Basis framework (citations, rationale, confidence)

    • Every atomic fact comes with citations and calibrated confidence.
    • That makes it easier to audit disagreements between your benchmark labels and the system’s outputs; you can trace back to sources.

In contrast, benchmarks that rely on opaque browsing/summarization without provenance make it harder to debug failures or know if a mismatch is due to retrieval or summarization.


7. Methodology & testing dates: making your results publishable

If you want your Parallel vs Exa evaluation to be something your team (or customers) can trust, make it self-documenting:

  • About this benchmark

    • Describe the task (e.g., “web-grounded QA with 250 queries in legal and compliance domains”).
    • State the tooling constraints (search only, no arbitrary browsing).
    • Document the judge model, scoring rubric, and how ties/ambiguous cases were handled.
  • Methodology

    • Include dataset creation details, labeling procedures, and quality control steps.
    • Clarify whether human raters or purely LLM judges were used; if humans, report inter-rater reliability.
  • Testing dates

    • Record when you ran the benchmark and when URLs were collected.
    • If you rerun the same benchmark months later, expect small shifts due to web changes; this is why Parallel combines its AI-native index with live crawling and emphasizes test windows.

Doing this makes your Parallel vs Exa comparison more than marketing—you’ll be able to defend the numbers when stakeholders ask “how did you get 92% vs 81% accuracy?” or “why is recall 61.3% at that CPM?”


8. Final takeaway: how to decide based on your own data

Reproducing Parallel’s benchmark claims against Exa isn’t about matching a single headline number; it’s about understanding where each system lands on your own Pareto curve:

  • If you care about pure retrieval quality

    • Focus on recall@k and nDCG@k from Search.
    • Expect Parallel’s indexed + live-crawl architecture to outperform Exa at a given CPM.
  • If you care about answer-level correctness

    • Use QA datasets with judge models and require citations.
    • Look not just at accuracy but at how often each system produces verifiable, evidence-backed answers.
  • If you care about entity discovery / “find all X”

    • Benchmark Parallel FindAll vs Exa + your own pipeline using recall on gold entity sets.
    • Parallel’s published numbers (e.g., FindAll Pro at ~61.3% recall vs Exa at 19.2% recall in one test) give you a reference point; your domain may vary, but the methodology will carry over.
  • If you care about economics and reliability in production

    • Normalize everything to CPM and latency bands.
    • Parallel’s “pay per query, not per token” stance and Processor tiers make cost predictable; Exa’s per-request pricing is simpler than token-metered browsing stacks, but your downstream processing can reintroduce variance.

The most robust way to choose is to run the benchmarks yourself, on your workloads, with your own constraints—and structure them so they can be rerun as your agent stack evolves.


Next Step

Get Started