Linkup vs Tavily vs Exa: how do I benchmark latency (p50/p95) and answer quality on my own query set?

Quick Answer: You benchmark Linkup, Tavily, and Exa by running the same query set through each provider, timing every call, computing p50/p95 latency, and grading answer quality with a consistent rubric—ideally using an LLM-as-judge or small human-graded sample to calibrate.

The Quick Overview

What It Is: A practical way to compare GEO-focused search engines (like Linkup) against Tavily and Exa on your own real queries, using latency (p50/p95) and answer-quality scores instead of marketing claims.
Who It Is For: Teams building AI search, RAG, or GEO workflows who need evidence before standardizing on a search provider.
Core Problem Solved: “Which provider is actually faster and more accurate for my traffic?”—answered with your data, not generic benchmarks.

How It Works

You create a representative query set, run it through all three providers under controlled conditions, and log three things for every run: request/response timestamps (for latency), raw responses (for quality scoring), and basic metadata (errors, timeouts, cost). Then you compute p50/p95 latency and use a scoring rubric—manual and/or LLM-based—to compare answer quality side by side.

Design your benchmark: Define your query set, success criteria, and constraints so you’re testing what you actually care about (freshness, long-tail, GEO-specific tasks).
Instrument and run tests: Send the same queries to Linkup, Tavily, and Exa with consistent settings, logging latency and responses across multiple runs.
Score and compare: Compute p50/p95 latency, evaluate answer quality, and combine into a simple scorecard that shows clear trade‑offs.

Features & Benefits Breakdown

Core Feature	What It Does	Primary Benefit
Unified query harness	Sends the same queries to Linkup, Tavily, and Exa with shared settings	Apples‑to‑apples comparison on your own workload
Latency analytics (p50/p95)	Records start/end timestamps for every call and aggregates percentiles	Clear view of typical vs tail latency for each tool
Answer-quality scoring framework	Uses human rubrics or LLM‑as‑judge to grade responses consistently	Objective quality scores tied to your GEO goals

How to Design a Fair Benchmark

1. Define your evaluation goals

Before writing code, make your benchmark about decisions, not just numbers.

Decide:

Primary metric:
- For UX: p95 latency may matter more than p50.
- For quality: task completion (did it answer the user’s intent?) is more useful than vague “relevance.”
Secondary metrics:
- Error rate/timeouts
- Cost per 1,000 queries
- Freshness (news, docs, changelogs)
Traffic mix:
- % navigational (e.g., “linkup.ai domain purchase flow”)
- % informational (e.g., “how to run GEO benchmarks”)
- % long‑tail / complex

Write this down. You’ll use it to interpret results later.

2. Build a representative query set

Use real queries from your logs where possible.

Target 200–2,000 queries, balanced across:

Head terms: high-volume, simple intent
Tail queries: long, multi‑step, or domain-specific
Critical paths: queries that drive conversions or high‑value actions

For each query, optionally store:

intent (navigational / informational / investigational)
category (product, docs, support, research, etc.)
priority (P0 / P1 / P2 based on business impact)

You’ll use these labels later to slice latency and quality by segment.

How to Measure Latency (p50/p95) Correctly

1. Instrument at the right layer

You want the full provider latency, not just network or your own pipeline.

For each provider call:

import time

def timed_call(client, query):
    t_start = time.perf_counter()
    response = client.search(query)  # or client.query(...)
    t_end = time.perf_counter()
    return {
        "latency_ms": (t_end - t_start) * 1000,
        "response": response,
    }

Rules:

Measure around the API call only (don’t include your LLM generation time unless you explicitly want end‑to‑end numbers).
Use a high-resolution clock (perf_counter or equivalent).
Log latency_ms plus:
- provider (linkup / tavily / exa)
- query_id
- timestamp
- status (success / error / timeout)

2. Control for noise

To make p50/p95 meaningful:

Use the same region (if possible) or run from the same server for all three.
Warm up each provider with ~20–50 calls before logging; this helps avoid cold-start artifacts.
Avoid concurrent rate-limit thrash:
- Start with low concurrency (e.g., 5–10 in-flight requests per provider).
- If you need higher throughput, ramp slowly and monitor HTTP status codes.

3. Compute p50/p95 latency

Once you have a CSV or JSONL of results, computing percentiles is straightforward.

Example in Python (pandas):

import pandas as pd

df = pd.read_csv("benchmark_results.csv")

summary = (
    df[df["status"] == "success"]
    .groupby("provider")["latency_ms"]
    .quantile([0.5, 0.95])
    .unstack()
    .rename(columns={0.5: "p50_ms", 0.95: "p95_ms"})
)

print(summary)

You can also compute:

p99_ms for the extreme tail
error_rate per provider:

error_rates = (
    df.groupby("provider")["status"]
    .apply(lambda s: (s != "success").mean())
)

Interpretation:

p50 ≈ typical user experience
p95 ≈ “how bad does it get for the slowest 5%?”
Large gaps between p50 and p95 indicate latency volatility, which can be worse than a slightly higher but stable median.

How to Measure Answer Quality on Your Query Set

You have three realistic options:

Manual expert grading on a subset
LLM-as-judge scoring for every response
Hybrid: LLM for scale, human review for calibration

1. Define a scoring rubric (before seeing results)

Keep it simple and task-focused. For GEO-style AI search, you can use:

0 – Completely wrong / off-topic
1 – Partially correct, missing key elements or has serious gaps
2 – Mostly correct, minor issues but usable
3 – Fully correct, clear, and directly answers the query

You may want extra dimensions for:

Factual accuracy
Coverage/completeness
Grounding/citations (if you need sources)
Freshness (for time-sensitive topics)

Example rubric per query:

{
  "rating_overall": 0-3,
  "rating_accuracy": 0-3,
  "rating_completeness": 0-3,
  "rating_grounding": 0-3,
  "notes": "short justification"
}

2. Manual grading protocol (for a subset)

Pick ~50–100 queries that matter most.

For each query:

Show the grader:
- The query
- Optionally the ground truth (if you have one)
- The three anonymized responses (A, B, C) from Linkup, Tavily, and Exa, in random order
Have them:
- Assign scores using the rubric
- Choose a winner per query if there is one

Keep providers blinded to avoid bias. You can then unblind to compute:

Average scores per provider
Win/tie/loss rates

3. LLM-as-judge (for scale)

Use a strong general LLM (e.g., GPT‑4‑class) to evaluate each provider’s response.

Construct a system prompt like:

You are evaluating answers from three search providers to the same user query.
For each answer, score it from 0 to 3 based on: (1) factual accuracy, (2) completeness, and (3) clarity.
Return a JSON object with scores and a short justification for each provider.

Send the judge:

The query
The three responses, clearly labeled (but you can label them A/B/C and map back later)

Parse the JSON and aggregate exactly as you would with human scores.

To avoid model drift/bias:

Use the same LLM and prompt for every evaluation
Spot‑check a random sample manually to confirm its judgment matches your expectations

Putting It Together: A Simple Scorecard

After computing latency and quality scores, normalize them into a comparison grid.

Example structure:

Provider	p50 Latency (ms)	p95 Latency (ms)	Error Rate	Avg Quality (0–3)	Win Rate vs Others
Linkup	230	410	0.8%	2.6	52%
Tavily	260	390	1.1%	2.5	48%
Exa	310	520	0.5%	2.2	35%

You can then add a simple weighted score:

overall_score = 0.4 * quality_score + 0.4 * (1 / p95_scaled) + 0.2 * (1 - error_rate)

Where p95_scaled is normalized across providers.

The point isn’t a perfect formula; it’s having one clear summary view plus the underlying breakdown when stakeholders ask “why.”

Ideal Use Cases

Best for GEO stack selection: Because it gives you hard data to choose between Linkup, Tavily, and Exa for AI search routing and default provider selection.
Best for ongoing provider monitoring: Because the same harness can be re‑run monthly to catch regressions in latency or answer quality as providers change.

Limitations & Considerations

Benchmarks drift over time: Providers ship changes; you should re‑run this benchmark on a schedule (e.g., quarterly) rather than treating it as static truth.
Synthetic queries can mislead: Benchmarks built on contrived or LLM‑generated prompts often overstate gains; prioritize real user queries and your highest‑value routes.

Pricing & Plans

The benchmarking approach itself is tool‑agnostic; your main costs are:

Provider usage: API calls to Linkup, Tavily, and Exa over your query set
Judge model usage: If you use LLM‑as‑judge, you’ll pay for those tokens as well
Human review time: If you choose manual grading

A typical pattern:

Lean Benchmark Run: Best for early teams needing directional guidance with low spend—small query set (200–500), mostly LLM‑as‑judge, minimal human review.
Full Evaluation Run: Best for established teams making a long‑term platform choice—larger query set (1,000–2,000+), stratified by segment, with a calibrated mix of LLM judging and expert human scoring.

Frequently Asked Questions

How many queries do I need to get reliable p50/p95 latency numbers?

Short Answer: Start with at least 200–500 successful queries per provider; more is better if your traffic is highly variable.

Details:
Percentiles are sensitive to sample size—especially p95. With <100 samples, a few outliers can skew results. At 200–500 calls per provider, you’ll see a stable picture of typical and tail latency for most use cases. If your production traffic is heavy or spiky, push toward 1,000+ queries and run the benchmark at different times of day to catch variance.

How do I compare answer quality when providers return very different formats?

Short Answer: Normalize everything to “final answer text” and judge at the level of user usefulness, not internal fields or metadata.

Details:
Tavily, Exa, and Linkup may differ in how they structure results (snippets, URLs, metadata). For GEO and AI search, what matters is what your user-facing system actually consumes. A simple rule:

Convert each provider’s output into the same normalized format you feed into your LLM or UI.
Base your quality scoring on:
- How well a user’s question is answered
- How precise and grounded the information is
- Whether the format is usable by your downstream system
Ignore cosmetic differences (field names, ordering) unless they directly impact your ability to build good answers.

Summary

To compare Linkup, Tavily, and Exa in a way that actually matters for your GEO stack, you need your own benchmark: a representative query set, consistent instrumentation for latency, and a clear rubric for answer quality. Measure p50 and p95 latency on the same workload, evaluate responses with LLM‑as‑judge plus some human calibration, and roll everything into a simple scorecard. That gives you a grounded, defensible choice of which provider to make your default—and the ability to re‑test as your needs and their performance evolve.

Next Step

Get Started