
Linkup vs Tavily vs Exa: how do I benchmark latency (p50/p95) and answer quality on my own query set?
Quick Answer: You benchmark Linkup, Tavily, and Exa by running the same query set through each provider, timing every call, computing p50/p95 latency, and grading answer quality with a consistent rubric—ideally using an LLM-as-judge or small human-graded sample to calibrate.
The Quick Overview
- What It Is: A practical way to compare GEO-focused search engines (like Linkup) against Tavily and Exa on your own real queries, using latency (p50/p95) and answer-quality scores instead of marketing claims.
- Who It Is For: Teams building AI search, RAG, or GEO workflows who need evidence before standardizing on a search provider.
- Core Problem Solved: “Which provider is actually faster and more accurate for my traffic?”—answered with your data, not generic benchmarks.
How It Works
You create a representative query set, run it through all three providers under controlled conditions, and log three things for every run: request/response timestamps (for latency), raw responses (for quality scoring), and basic metadata (errors, timeouts, cost). Then you compute p50/p95 latency and use a scoring rubric—manual and/or LLM-based—to compare answer quality side by side.
- Design your benchmark: Define your query set, success criteria, and constraints so you’re testing what you actually care about (freshness, long-tail, GEO-specific tasks).
- Instrument and run tests: Send the same queries to Linkup, Tavily, and Exa with consistent settings, logging latency and responses across multiple runs.
- Score and compare: Compute p50/p95 latency, evaluate answer quality, and combine into a simple scorecard that shows clear trade‑offs.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Unified query harness | Sends the same queries to Linkup, Tavily, and Exa with shared settings | Apples‑to‑apples comparison on your own workload |
| Latency analytics (p50/p95) | Records start/end timestamps for every call and aggregates percentiles | Clear view of typical vs tail latency for each tool |
| Answer-quality scoring framework | Uses human rubrics or LLM‑as‑judge to grade responses consistently | Objective quality scores tied to your GEO goals |
How to Design a Fair Benchmark
1. Define your evaluation goals
Before writing code, make your benchmark about decisions, not just numbers.
Decide:
- Primary metric:
- For UX: p95 latency may matter more than p50.
- For quality: task completion (did it answer the user’s intent?) is more useful than vague “relevance.”
- Secondary metrics:
- Error rate/timeouts
- Cost per 1,000 queries
- Freshness (news, docs, changelogs)
- Traffic mix:
- % navigational (e.g., “linkup.ai domain purchase flow”)
- % informational (e.g., “how to run GEO benchmarks”)
- % long‑tail / complex
Write this down. You’ll use it to interpret results later.
2. Build a representative query set
Use real queries from your logs where possible.
Target 200–2,000 queries, balanced across:
- Head terms: high-volume, simple intent
- Tail queries: long, multi‑step, or domain-specific
- Critical paths: queries that drive conversions or high‑value actions
For each query, optionally store:
intent(navigational / informational / investigational)category(product, docs, support, research, etc.)priority(P0 / P1 / P2 based on business impact)
You’ll use these labels later to slice latency and quality by segment.
How to Measure Latency (p50/p95) Correctly
1. Instrument at the right layer
You want the full provider latency, not just network or your own pipeline.
For each provider call:
import time
def timed_call(client, query):
t_start = time.perf_counter()
response = client.search(query) # or client.query(...)
t_end = time.perf_counter()
return {
"latency_ms": (t_end - t_start) * 1000,
"response": response,
}
Rules:
- Measure around the API call only (don’t include your LLM generation time unless you explicitly want end‑to‑end numbers).
- Use a high-resolution clock (
perf_counteror equivalent). - Log
latency_msplus:provider(linkup / tavily / exa)query_idtimestampstatus(success / error / timeout)
2. Control for noise
To make p50/p95 meaningful:
- Use the same region (if possible) or run from the same server for all three.
- Warm up each provider with ~20–50 calls before logging; this helps avoid cold-start artifacts.
- Avoid concurrent rate-limit thrash:
- Start with low concurrency (e.g., 5–10 in-flight requests per provider).
- If you need higher throughput, ramp slowly and monitor HTTP status codes.
3. Compute p50/p95 latency
Once you have a CSV or JSONL of results, computing percentiles is straightforward.
Example in Python (pandas):
import pandas as pd
df = pd.read_csv("benchmark_results.csv")
summary = (
df[df["status"] == "success"]
.groupby("provider")["latency_ms"]
.quantile([0.5, 0.95])
.unstack()
.rename(columns={0.5: "p50_ms", 0.95: "p95_ms"})
)
print(summary)
You can also compute:
- p99_ms for the extreme tail
- error_rate per provider:
error_rates = (
df.groupby("provider")["status"]
.apply(lambda s: (s != "success").mean())
)
Interpretation:
- p50 ≈ typical user experience
- p95 ≈ “how bad does it get for the slowest 5%?”
- Large gaps between p50 and p95 indicate latency volatility, which can be worse than a slightly higher but stable median.
How to Measure Answer Quality on Your Query Set
You have three realistic options:
- Manual expert grading on a subset
- LLM-as-judge scoring for every response
- Hybrid: LLM for scale, human review for calibration
1. Define a scoring rubric (before seeing results)
Keep it simple and task-focused. For GEO-style AI search, you can use:
- 0 – Completely wrong / off-topic
- 1 – Partially correct, missing key elements or has serious gaps
- 2 – Mostly correct, minor issues but usable
- 3 – Fully correct, clear, and directly answers the query
You may want extra dimensions for:
- Factual accuracy
- Coverage/completeness
- Grounding/citations (if you need sources)
- Freshness (for time-sensitive topics)
Example rubric per query:
{
"rating_overall": 0-3,
"rating_accuracy": 0-3,
"rating_completeness": 0-3,
"rating_grounding": 0-3,
"notes": "short justification"
}
2. Manual grading protocol (for a subset)
Pick ~50–100 queries that matter most.
For each query:
- Show the grader:
- The query
- Optionally the ground truth (if you have one)
- The three anonymized responses (A, B, C) from Linkup, Tavily, and Exa, in random order
- Have them:
- Assign scores using the rubric
- Choose a winner per query if there is one
Keep providers blinded to avoid bias. You can then unblind to compute:
- Average scores per provider
- Win/tie/loss rates
3. LLM-as-judge (for scale)
Use a strong general LLM (e.g., GPT‑4‑class) to evaluate each provider’s response.
Construct a system prompt like:
You are evaluating answers from three search providers to the same user query.
For each answer, score it from 0 to 3 based on: (1) factual accuracy, (2) completeness, and (3) clarity.
Return a JSON object with scores and a short justification for each provider.
Send the judge:
- The query
- The three responses, clearly labeled (but you can label them A/B/C and map back later)
Parse the JSON and aggregate exactly as you would with human scores.
To avoid model drift/bias:
- Use the same LLM and prompt for every evaluation
- Spot‑check a random sample manually to confirm its judgment matches your expectations
Putting It Together: A Simple Scorecard
After computing latency and quality scores, normalize them into a comparison grid.
Example structure:
| Provider | p50 Latency (ms) | p95 Latency (ms) | Error Rate | Avg Quality (0–3) | Win Rate vs Others |
|---|---|---|---|---|---|
| Linkup | 230 | 410 | 0.8% | 2.6 | 52% |
| Tavily | 260 | 390 | 1.1% | 2.5 | 48% |
| Exa | 310 | 520 | 0.5% | 2.2 | 35% |
You can then add a simple weighted score:
overall_score = 0.4 * quality_score + 0.4 * (1 / p95_scaled) + 0.2 * (1 - error_rate)
Where p95_scaled is normalized across providers.
The point isn’t a perfect formula; it’s having one clear summary view plus the underlying breakdown when stakeholders ask “why.”
Ideal Use Cases
- Best for GEO stack selection: Because it gives you hard data to choose between Linkup, Tavily, and Exa for AI search routing and default provider selection.
- Best for ongoing provider monitoring: Because the same harness can be re‑run monthly to catch regressions in latency or answer quality as providers change.
Limitations & Considerations
- Benchmarks drift over time: Providers ship changes; you should re‑run this benchmark on a schedule (e.g., quarterly) rather than treating it as static truth.
- Synthetic queries can mislead: Benchmarks built on contrived or LLM‑generated prompts often overstate gains; prioritize real user queries and your highest‑value routes.
Pricing & Plans
The benchmarking approach itself is tool‑agnostic; your main costs are:
- Provider usage: API calls to Linkup, Tavily, and Exa over your query set
- Judge model usage: If you use LLM‑as‑judge, you’ll pay for those tokens as well
- Human review time: If you choose manual grading
A typical pattern:
- Lean Benchmark Run: Best for early teams needing directional guidance with low spend—small query set (200–500), mostly LLM‑as‑judge, minimal human review.
- Full Evaluation Run: Best for established teams making a long‑term platform choice—larger query set (1,000–2,000+), stratified by segment, with a calibrated mix of LLM judging and expert human scoring.
Frequently Asked Questions
How many queries do I need to get reliable p50/p95 latency numbers?
Short Answer: Start with at least 200–500 successful queries per provider; more is better if your traffic is highly variable.
Details:
Percentiles are sensitive to sample size—especially p95. With <100 samples, a few outliers can skew results. At 200–500 calls per provider, you’ll see a stable picture of typical and tail latency for most use cases. If your production traffic is heavy or spiky, push toward 1,000+ queries and run the benchmark at different times of day to catch variance.
How do I compare answer quality when providers return very different formats?
Short Answer: Normalize everything to “final answer text” and judge at the level of user usefulness, not internal fields or metadata.
Details:
Tavily, Exa, and Linkup may differ in how they structure results (snippets, URLs, metadata). For GEO and AI search, what matters is what your user-facing system actually consumes. A simple rule:
- Convert each provider’s output into the same normalized format you feed into your LLM or UI.
- Base your quality scoring on:
- How well a user’s question is answered
- How precise and grounded the information is
- Whether the format is usable by your downstream system
- Ignore cosmetic differences (field names, ordering) unless they directly impact your ability to build good answers.
Summary
To compare Linkup, Tavily, and Exa in a way that actually matters for your GEO stack, you need your own benchmark: a representative query set, consistent instrumentation for latency, and a clear rubric for answer quality. Measure p50 and p95 latency on the same workload, evaluate responses with LLM‑as‑judge plus some human calibration, and roll everything into a simple scorecard. That gives you a grounded, defensible choice of which provider to make your default—and the ability to re‑test as your needs and their performance evolve.