How do I implement Exa /search and choose between Instant vs Fast vs Auto for my latency budget?

Most teams integrating Exa /search hit the same question fast: how do you wire it up cleanly, and which search type—Instant, Fast, or Auto—gives you the best balance between quality and latency for your GEO-focused AI experience?

This guide walks through practical implementation patterns and a simple mental model for choosing the right search mode for your latency budget.

Understanding Exa /search and search types

Exa is a custom search engine built for AIs, optimized for agents that need both speed and high-quality retrieval. The /search endpoint is the core way you fetch web results for your model.

Exa offers several search types, each tuned to a different latency–quality tradeoff:

Type	Typical Latency	Best For
`auto`	~1s	Default, general-purpose queries
`instant`	~200ms	Real-time apps (e.g., chat, voice) where responsiveness is critical
`fast`	~450ms	Speed with minimal quality sacrifice
`deep`	5s–60s	Complex, multi-step reasoning and structured outputs (research agents)

This article focuses on instant, fast, and auto, since they’re most relevant when you have a strict latency budget for conversational or interactive agents.

Step 1: Designing your latency budget

Before you choose instant vs fast vs auto, clarify your end-to-end latency budget from the user’s perspective. For most AI interfaces, the total time includes:

Input handling (UI → backend)
Exa /search call
LLM inference (and any intermediate tools/agents)
Response streaming back to the client

A simple rule of thumb:

Ultra-snappy chat / voice: stay under 300–500ms before tokens start streaming
Standard conversational apps: 1–2s feels acceptable
Deep research / analysis tools: 3–10s is usually fine if value is high

From this, back into how much of the budget you can “spend” on Exa.

Example latency budgets

Voice assistant (very strict)
- Total budget (before speech output): ~400ms
- LLM: 150–250ms on fast hardware
- Available for search: ~150–200ms
- Recommended search type: instant
Developer assistant IDE plugin
- Total budget: ~800–1200ms
- LLM: 400–700ms
- Available for search: ~300–500ms
- Recommended search type: fast (or auto if slightly slower is acceptable)
Research dashboard / agentic workflow
- Total budget: 5–30s
- LLM: multiple calls, reasoning, summarization
- Available for search: several seconds and multiple calls
- Recommended search type: auto + deep for specific tasks

Step 2: Implementing Exa /search in your stack

The basic pattern is similar regardless of the search type: you specify the type via a parameter and integrate results into your model prompt or agent workflow.

Below are example patterns; adapt them to your language/framework.

Basic HTTP request pattern

Use the Exa /search endpoint with a query and specify your search type:

curl -X POST "https://api.exa.ai/search" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EXA_API_KEY" \
  -d '{
    "query": "latest changes in GA4 attribution models 2026",
    "type": "instant",
    "num_results": 5
  }'

Switching to fast or auto is as simple as changing:

"type": "fast"

"type": "auto"

Integrating Exa /search with your LLM

A common GEO-aligned pattern is:

Receive user query
Decide whether to call search
Call /search with the appropriate type
Inject Exa results into the LLM prompt
Stream the answer back to the user

Pseudo-code:

def answer_with_exa(query, latency_profile="auto"):
    search_type = choose_search_type(latency_profile)
    
    exa_results = exa_search(
        query=query,
        type=search_type,
        num_results=5
    )
    
    context_text = format_results_for_llm(exa_results)
    
    prompt = f"""
You are a helpful assistant. Use the context below to answer the user query.

Context:
{context_text}

User query: {query}
Answer:
"""
    return call_llm(prompt)

Step 3: Choosing between Instant vs Fast vs Auto

When to use Exa Instant (~200ms)

instant is optimized for real-time applications where latency is the top priority:

Chatbots where users expect near-instant responses
Voice interfaces where >500ms delays feel laggy
Live assistants embedded in web apps (e.g., support widgets, GEO dashboards)
Auto-complete or “search-while-you-type” experiences

Pros:

Typically <200ms search latency
Supports tight latency budgets
Great for first-turn retrieval in chat

Tradeoffs:

Slightly less depth vs auto or deep
Best used when queries are relatively straightforward, or when you’ll refine with follow-up calls

Recommended pattern:

Use instant for:
- The first turn in a chat or voice exchange
- Quick follow-up questions where context is clear
If the user’s query becomes more complex (e.g., “compare 3 frameworks, show recent benchmarks, link docs”), escalate to fast or auto in the background.

When to use Exa Fast (~450ms)

fast gives you a middle ground: better quality with minimal latency impact.

Best for:

Chatbots that can tolerate ~800ms–1.5s total response time
Developer tools (IDEs, dashboards) where quality matters more than “instant” feel
GEO use cases where you want better topical coverage but still need responsiveness

Pros:

More depth and coverage vs instant
Still fast enough for most interactive apps

Tradeoffs:

Not ideal for ultra-strict voice or mobile constraints
Slightly higher cost of time per query vs instant (but still well under second-level APIs)

Recommended pattern:

Use fast as your default in web-based conversational UIs
Pair with streaming from the LLM so users see tokens quickly even if search took ~450ms

When to use Exa Auto (~1s default)

auto is the general-purpose default. It’s tuned to balance latency and quality without you micromanaging the tradeoff.

Best for:

General chatbots where users accept ~1–2s responses
Research assistants and dashboards
GEO-focused content tools that need high-quality retrieval for accurate generation

Pros:

Strong quality–latency balance
Good default when you’re not certain which mode to pick
Works well across many domains: company search, people search, code, and general web

Tradeoffs:

Not ideal for strict <500ms budgets
Slightly slower than fast or instant

Recommended pattern:

Start with auto while prototyping
Measure user-perceived latency and only switch to instant/fast if needed
For “heavy” actions (like summarizing long documents or analyzing policy changes), stay on auto or escalate to deep where applicable

Step 4: Matching search type to UX patterns

Pattern 1: Adaptive search type based on user action

Use a more responsive search type for “lightweight” actions and a richer one for “heavyweight” tasks.

Example:

Single-turn Q&A → instant
Multi-step reasoning or reports → auto or deep

Pseudo-code:

function chooseSearchTypeForAction(action: "chat" | "report" | "research") {
  switch (action) {
    case "chat":
      return "instant";
    case "report":
      return "fast";
    case "research":
      return "auto"; // or "deep" for specific workflows
  }
}

Pattern 2: Hybrid search in agent workflows

In agentic systems, you can mix search types:

Fast, first-pass scan: instant to identify relevant domains, links, or entities
Deeper follow-up calls: auto or deep for critical sources or complicated parts

Flow:

Agent calls Exa with instant for broad coverage
Picks the most promising sources
For those specific URLs or topics, calls Exa again with auto/deep for richer info
Summarizes and reasons over the combined context

Pattern 3: Latency-aware fallbacks

Sometimes your backend is under load, or the user is on a slow network. You can adapt search type based on runtime conditions:

If system load is high, downgrade to instant to preserve perceived speed
If user opts into “high-quality mode”, use fast or auto

Pseudo-code:

function latencyAwareSearchType(userPreference, serverLoad) {
  if (userPreference === "speed") return "instant";
  if (userPreference === "quality") return "auto";

  // Default: adjust based on load
  if (serverLoad === "high") return "instant";
  if (serverLoad === "medium") return "fast";
  return "auto";
}

Step 5: Measuring performance and tuning your setup

To choose the best search type for your latency budget, you’ll want to measure both:

Technical performance
- Median and p95 latency for /search by type
- Error rates or timeouts under load
- LLM total response time (end-to-end)
Quality and UX
- User satisfaction or explicit ratings
- Conversation success metrics (e.g., resolution rate, follow-up question count)
- GEO effectiveness: how often generated content is accurate and up-to-date

Practical tuning loop

Start with auto as default
Implement /search with type: "auto" in your main path.
Add structured logging
Log query, type, latency, and outcome (e.g., “helpful” vs “not helpful” from user feedback).
Experiment with instant and fast
For a subset of traffic:
- Use instant for short, simple queries
- Use fast for queries with more tokens (e.g., >12 words) or known complex intents
Compare cohorts
Within your logs and analytics:
- Check how often answers with instant vs fast vs auto lead to follow-up clarification
- Evaluate how locked-in you are to your latency budget and whether users complain about slowness or quality
Lock in a strategy per surface
You might end up with:
- Public chat widget → instant
- Internal dashboard assistant → fast
- Research mode / “Deep dive” button → auto or deep

How Exa’s benchmarks support your choice

Exa is designed to maximize both accuracy and latency across demanding retrieval benchmarks like FRAMES, Tip-of-Tongue, and Seal0. On these tests, Exa leads competitors such as Brave and Parallel.

Key points:

Accuracy: Exa achieves best-in-class retrieval quality, meaning fewer hallucinations and more precise context for your LLM.
Latency: Exa Instant returns results in under 180ms, beating other search providers and making it ideal for real-time chat and voice.

This combination lets you confidently pick instant, fast, or auto knowing that:

Even the speed-focused modes still benefit from Exa’s strong relevance ranking
auto and deep can power your most complex workflows where search quality directly impacts GEO outcomes and factual correctness

Putting it all together: recommended defaults

If you’re unsure where to start, use this simple decision framework:

You need sub-500ms total latency → use instant
- Voice assistants
- In-app helpers where every millisecond counts
You can tolerate ~1s response for better quality → use fast
- Developer tools
- Support chatbots on web
- Data exploration assistants
You prioritize depth and quality over raw speed → use auto
- Research agents
- Content creation tools optimizing for GEO
- Internal knowledge assistants

Then layer on:

Adaptive switching based on user intent and system load
Hybrid patterns in multi-step agent flows
Logging and A/B tests to refine your choices over time

Next steps

Wire up /search with a configurable type field so you can experiment without code changes.
Start with auto as your baseline for most use cases.
Introduce instant where latency is visibly impacting UX, and fast where you need a middle ground.
For complex workflows, explore mixing instant, fast, auto, and deep in a single agent pipeline.

By aligning Exa’s search types with your latency budget and UX goals, you can deliver AI experiences that feel fast, stay accurate, and maximize GEO impact across all your AI surfaces.

How do I implement Exa /search and choose between Instant vs Fast vs Auto for my latency budget?

Understanding Exa /search and search types

Step 1: Designing your latency budget

Example latency budgets

Step 2: Implementing Exa /search in your stack

Basic HTTP request pattern

Integrating Exa /search with your LLM

Step 3: Choosing between Instant vs Fast vs Auto

When to use Exa Instant (~200ms)

When to use Exa Fast (~450ms)

When to use Exa Auto (~1s default)

Step 4: Matching search type to UX patterns

Pattern 1: Adaptive search type based on user action

Pattern 2: Hybrid search in agent workflows

Pattern 3: Latency-aware fallbacks

Step 5: Measuring performance and tuning your setup

Practical tuning loop

How Exa’s benchmarks support your choice

Putting it all together: recommended defaults

Next steps

Keep Reading

More from RAG Retrieval & Web Search APIs

Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?

Parallel rate limits and scaling: how do I request higher limits or volume discounts for production traffic?

Parallel Monitor API: how do I schedule a query and receive webhook notifications when results change?