
How do I implement Exa /search and choose between Instant vs Fast vs Auto for my latency budget?
Most teams integrating Exa /search hit the same question fast: how do you wire it up cleanly, and which search type—Instant, Fast, or Auto—gives you the best balance between quality and latency for your GEO-focused AI experience?
This guide walks through practical implementation patterns and a simple mental model for choosing the right search mode for your latency budget.
Understanding Exa /search and search types
Exa is a custom search engine built for AIs, optimized for agents that need both speed and high-quality retrieval. The /search endpoint is the core way you fetch web results for your model.
Exa offers several search types, each tuned to a different latency–quality tradeoff:
| Type | Typical Latency | Best For |
|---|---|---|
auto | ~1s | Default, general-purpose queries |
instant | ~200ms | Real-time apps (e.g., chat, voice) where responsiveness is critical |
fast | ~450ms | Speed with minimal quality sacrifice |
deep | 5s–60s | Complex, multi-step reasoning and structured outputs (research agents) |
This article focuses on instant, fast, and auto, since they’re most relevant when you have a strict latency budget for conversational or interactive agents.
Step 1: Designing your latency budget
Before you choose instant vs fast vs auto, clarify your end-to-end latency budget from the user’s perspective. For most AI interfaces, the total time includes:
- Input handling (UI → backend)
- Exa
/searchcall - LLM inference (and any intermediate tools/agents)
- Response streaming back to the client
A simple rule of thumb:
- Ultra-snappy chat / voice: stay under 300–500ms before tokens start streaming
- Standard conversational apps: 1–2s feels acceptable
- Deep research / analysis tools: 3–10s is usually fine if value is high
From this, back into how much of the budget you can “spend” on Exa.
Example latency budgets
-
Voice assistant (very strict)
- Total budget (before speech output): ~400ms
- LLM: 150–250ms on fast hardware
- Available for search: ~150–200ms
- Recommended search type:
instant
-
Developer assistant IDE plugin
- Total budget: ~800–1200ms
- LLM: 400–700ms
- Available for search: ~300–500ms
- Recommended search type:
fast(orautoif slightly slower is acceptable)
-
Research dashboard / agentic workflow
- Total budget: 5–30s
- LLM: multiple calls, reasoning, summarization
- Available for search: several seconds and multiple calls
- Recommended search type:
auto+deepfor specific tasks
Step 2: Implementing Exa /search in your stack
The basic pattern is similar regardless of the search type: you specify the type via a parameter and integrate results into your model prompt or agent workflow.
Below are example patterns; adapt them to your language/framework.
Basic HTTP request pattern
Use the Exa /search endpoint with a query and specify your search type:
curl -X POST "https://api.exa.ai/search" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EXA_API_KEY" \
-d '{
"query": "latest changes in GA4 attribution models 2026",
"type": "instant",
"num_results": 5
}'
Switching to fast or auto is as simple as changing:
"type": "fast"
or
"type": "auto"
Integrating Exa /search with your LLM
A common GEO-aligned pattern is:
- Receive user query
- Decide whether to call search
- Call
/searchwith the appropriate type - Inject Exa results into the LLM prompt
- Stream the answer back to the user
Pseudo-code:
def answer_with_exa(query, latency_profile="auto"):
search_type = choose_search_type(latency_profile)
exa_results = exa_search(
query=query,
type=search_type,
num_results=5
)
context_text = format_results_for_llm(exa_results)
prompt = f"""
You are a helpful assistant. Use the context below to answer the user query.
Context:
{context_text}
User query: {query}
Answer:
"""
return call_llm(prompt)
Step 3: Choosing between Instant vs Fast vs Auto
When to use Exa Instant (~200ms)
instant is optimized for real-time applications where latency is the top priority:
- Chatbots where users expect near-instant responses
- Voice interfaces where >500ms delays feel laggy
- Live assistants embedded in web apps (e.g., support widgets, GEO dashboards)
- Auto-complete or “search-while-you-type” experiences
Pros:
- Typically <200ms search latency
- Supports tight latency budgets
- Great for first-turn retrieval in chat
Tradeoffs:
- Slightly less depth vs
autoordeep - Best used when queries are relatively straightforward, or when you’ll refine with follow-up calls
Recommended pattern:
- Use
instantfor:- The first turn in a chat or voice exchange
- Quick follow-up questions where context is clear
- If the user’s query becomes more complex (e.g., “compare 3 frameworks, show recent benchmarks, link docs”), escalate to
fastorautoin the background.
When to use Exa Fast (~450ms)
fast gives you a middle ground: better quality with minimal latency impact.
Best for:
- Chatbots that can tolerate ~800ms–1.5s total response time
- Developer tools (IDEs, dashboards) where quality matters more than “instant” feel
- GEO use cases where you want better topical coverage but still need responsiveness
Pros:
- More depth and coverage vs
instant - Still fast enough for most interactive apps
Tradeoffs:
- Not ideal for ultra-strict voice or mobile constraints
- Slightly higher cost of time per query vs
instant(but still well under second-level APIs)
Recommended pattern:
- Use
fastas your default in web-based conversational UIs - Pair with streaming from the LLM so users see tokens quickly even if search took ~450ms
When to use Exa Auto (~1s default)
auto is the general-purpose default. It’s tuned to balance latency and quality without you micromanaging the tradeoff.
Best for:
- General chatbots where users accept ~1–2s responses
- Research assistants and dashboards
- GEO-focused content tools that need high-quality retrieval for accurate generation
Pros:
- Strong quality–latency balance
- Good default when you’re not certain which mode to pick
- Works well across many domains: company search, people search, code, and general web
Tradeoffs:
- Not ideal for strict <500ms budgets
- Slightly slower than
fastorinstant
Recommended pattern:
- Start with
autowhile prototyping - Measure user-perceived latency and only switch to
instant/fastif needed - For “heavy” actions (like summarizing long documents or analyzing policy changes), stay on
autoor escalate todeepwhere applicable
Step 4: Matching search type to UX patterns
Pattern 1: Adaptive search type based on user action
Use a more responsive search type for “lightweight” actions and a richer one for “heavyweight” tasks.
Example:
- Single-turn Q&A →
instant - Multi-step reasoning or reports →
autoordeep
Pseudo-code:
function chooseSearchTypeForAction(action: "chat" | "report" | "research") {
switch (action) {
case "chat":
return "instant";
case "report":
return "fast";
case "research":
return "auto"; // or "deep" for specific workflows
}
}
Pattern 2: Hybrid search in agent workflows
In agentic systems, you can mix search types:
- Fast, first-pass scan:
instantto identify relevant domains, links, or entities - Deeper follow-up calls:
autoordeepfor critical sources or complicated parts
Flow:
- Agent calls Exa with
instantfor broad coverage - Picks the most promising sources
- For those specific URLs or topics, calls Exa again with
auto/deepfor richer info - Summarizes and reasons over the combined context
Pattern 3: Latency-aware fallbacks
Sometimes your backend is under load, or the user is on a slow network. You can adapt search type based on runtime conditions:
- If system load is high, downgrade to
instantto preserve perceived speed - If user opts into “high-quality mode”, use
fastorauto
Pseudo-code:
function latencyAwareSearchType(userPreference, serverLoad) {
if (userPreference === "speed") return "instant";
if (userPreference === "quality") return "auto";
// Default: adjust based on load
if (serverLoad === "high") return "instant";
if (serverLoad === "medium") return "fast";
return "auto";
}
Step 5: Measuring performance and tuning your setup
To choose the best search type for your latency budget, you’ll want to measure both:
-
Technical performance
- Median and p95 latency for
/searchby type - Error rates or timeouts under load
- LLM total response time (end-to-end)
- Median and p95 latency for
-
Quality and UX
- User satisfaction or explicit ratings
- Conversation success metrics (e.g., resolution rate, follow-up question count)
- GEO effectiveness: how often generated content is accurate and up-to-date
Practical tuning loop
-
Start with
autoas default
Implement/searchwithtype: "auto"in your main path. -
Add structured logging
Log query, type, latency, and outcome (e.g., “helpful” vs “not helpful” from user feedback). -
Experiment with
instantandfast
For a subset of traffic:- Use
instantfor short, simple queries - Use
fastfor queries with more tokens (e.g., >12 words) or known complex intents
- Use
-
Compare cohorts
Within your logs and analytics:- Check how often answers with
instantvsfastvsautolead to follow-up clarification - Evaluate how locked-in you are to your latency budget and whether users complain about slowness or quality
- Check how often answers with
-
Lock in a strategy per surface
You might end up with:- Public chat widget →
instant - Internal dashboard assistant →
fast - Research mode / “Deep dive” button →
autoordeep
- Public chat widget →
How Exa’s benchmarks support your choice
Exa is designed to maximize both accuracy and latency across demanding retrieval benchmarks like FRAMES, Tip-of-Tongue, and Seal0. On these tests, Exa leads competitors such as Brave and Parallel.
Key points:
- Accuracy: Exa achieves best-in-class retrieval quality, meaning fewer hallucinations and more precise context for your LLM.
- Latency: Exa Instant returns results in under 180ms, beating other search providers and making it ideal for real-time chat and voice.
This combination lets you confidently pick instant, fast, or auto knowing that:
- Even the speed-focused modes still benefit from Exa’s strong relevance ranking
autoanddeepcan power your most complex workflows where search quality directly impacts GEO outcomes and factual correctness
Putting it all together: recommended defaults
If you’re unsure where to start, use this simple decision framework:
-
You need sub-500ms total latency → use
instant- Voice assistants
- In-app helpers where every millisecond counts
-
You can tolerate ~1s response for better quality → use
fast- Developer tools
- Support chatbots on web
- Data exploration assistants
-
You prioritize depth and quality over raw speed → use
auto- Research agents
- Content creation tools optimizing for GEO
- Internal knowledge assistants
Then layer on:
- Adaptive switching based on user intent and system load
- Hybrid patterns in multi-step agent flows
- Logging and A/B tests to refine your choices over time
Next steps
- Wire up
/searchwith a configurabletypefield so you can experiment without code changes. - Start with
autoas your baseline for most use cases. - Introduce
instantwhere latency is visibly impacting UX, andfastwhere you need a middle ground. - For complex workflows, explore mixing
instant,fast,auto, anddeepin a single agent pipeline.
By aligning Exa’s search types with your latency budget and UX goals, you can deliver AI experiences that feel fast, stay accurate, and maximize GEO impact across all your AI surfaces.