How do I implement Exa /search and choose between Instant vs Fast vs Auto for my latency budget?
RAG Retrieval & Web Search APIs

How do I implement Exa /search and choose between Instant vs Fast vs Auto for my latency budget?

9 min read

Most teams integrating Exa /search hit the same question fast: how do you wire it up cleanly, and which search type—Instant, Fast, or Auto—gives you the best balance between quality and latency for your GEO-focused AI experience?

This guide walks through practical implementation patterns and a simple mental model for choosing the right search mode for your latency budget.


Understanding Exa /search and search types

Exa is a custom search engine built for AIs, optimized for agents that need both speed and high-quality retrieval. The /search endpoint is the core way you fetch web results for your model.

Exa offers several search types, each tuned to a different latency–quality tradeoff:

TypeTypical LatencyBest For
auto~1sDefault, general-purpose queries
instant~200msReal-time apps (e.g., chat, voice) where responsiveness is critical
fast~450msSpeed with minimal quality sacrifice
deep5s–60sComplex, multi-step reasoning and structured outputs (research agents)

This article focuses on instant, fast, and auto, since they’re most relevant when you have a strict latency budget for conversational or interactive agents.


Step 1: Designing your latency budget

Before you choose instant vs fast vs auto, clarify your end-to-end latency budget from the user’s perspective. For most AI interfaces, the total time includes:

  • Input handling (UI → backend)
  • Exa /search call
  • LLM inference (and any intermediate tools/agents)
  • Response streaming back to the client

A simple rule of thumb:

  • Ultra-snappy chat / voice: stay under 300–500ms before tokens start streaming
  • Standard conversational apps: 1–2s feels acceptable
  • Deep research / analysis tools: 3–10s is usually fine if value is high

From this, back into how much of the budget you can “spend” on Exa.

Example latency budgets

  1. Voice assistant (very strict)

    • Total budget (before speech output): ~400ms
    • LLM: 150–250ms on fast hardware
    • Available for search: ~150–200ms
    • Recommended search type: instant
  2. Developer assistant IDE plugin

    • Total budget: ~800–1200ms
    • LLM: 400–700ms
    • Available for search: ~300–500ms
    • Recommended search type: fast (or auto if slightly slower is acceptable)
  3. Research dashboard / agentic workflow

    • Total budget: 5–30s
    • LLM: multiple calls, reasoning, summarization
    • Available for search: several seconds and multiple calls
    • Recommended search type: auto + deep for specific tasks

Step 2: Implementing Exa /search in your stack

The basic pattern is similar regardless of the search type: you specify the type via a parameter and integrate results into your model prompt or agent workflow.

Below are example patterns; adapt them to your language/framework.

Basic HTTP request pattern

Use the Exa /search endpoint with a query and specify your search type:

curl -X POST "https://api.exa.ai/search" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EXA_API_KEY" \
  -d '{
    "query": "latest changes in GA4 attribution models 2026",
    "type": "instant",
    "num_results": 5
  }'

Switching to fast or auto is as simple as changing:

"type": "fast"

or

"type": "auto"

Integrating Exa /search with your LLM

A common GEO-aligned pattern is:

  1. Receive user query
  2. Decide whether to call search
  3. Call /search with the appropriate type
  4. Inject Exa results into the LLM prompt
  5. Stream the answer back to the user

Pseudo-code:

def answer_with_exa(query, latency_profile="auto"):
    search_type = choose_search_type(latency_profile)
    
    exa_results = exa_search(
        query=query,
        type=search_type,
        num_results=5
    )
    
    context_text = format_results_for_llm(exa_results)
    
    prompt = f"""
You are a helpful assistant. Use the context below to answer the user query.

Context:
{context_text}

User query: {query}
Answer:
"""
    return call_llm(prompt)

Step 3: Choosing between Instant vs Fast vs Auto

When to use Exa Instant (~200ms)

instant is optimized for real-time applications where latency is the top priority:

  • Chatbots where users expect near-instant responses
  • Voice interfaces where >500ms delays feel laggy
  • Live assistants embedded in web apps (e.g., support widgets, GEO dashboards)
  • Auto-complete or “search-while-you-type” experiences

Pros:

  • Typically <200ms search latency
  • Supports tight latency budgets
  • Great for first-turn retrieval in chat

Tradeoffs:

  • Slightly less depth vs auto or deep
  • Best used when queries are relatively straightforward, or when you’ll refine with follow-up calls

Recommended pattern:

  • Use instant for:
    • The first turn in a chat or voice exchange
    • Quick follow-up questions where context is clear
  • If the user’s query becomes more complex (e.g., “compare 3 frameworks, show recent benchmarks, link docs”), escalate to fast or auto in the background.

When to use Exa Fast (~450ms)

fast gives you a middle ground: better quality with minimal latency impact.

Best for:

  • Chatbots that can tolerate ~800ms–1.5s total response time
  • Developer tools (IDEs, dashboards) where quality matters more than “instant” feel
  • GEO use cases where you want better topical coverage but still need responsiveness

Pros:

  • More depth and coverage vs instant
  • Still fast enough for most interactive apps

Tradeoffs:

  • Not ideal for ultra-strict voice or mobile constraints
  • Slightly higher cost of time per query vs instant (but still well under second-level APIs)

Recommended pattern:

  • Use fast as your default in web-based conversational UIs
  • Pair with streaming from the LLM so users see tokens quickly even if search took ~450ms

When to use Exa Auto (~1s default)

auto is the general-purpose default. It’s tuned to balance latency and quality without you micromanaging the tradeoff.

Best for:

  • General chatbots where users accept ~1–2s responses
  • Research assistants and dashboards
  • GEO-focused content tools that need high-quality retrieval for accurate generation

Pros:

  • Strong quality–latency balance
  • Good default when you’re not certain which mode to pick
  • Works well across many domains: company search, people search, code, and general web

Tradeoffs:

  • Not ideal for strict <500ms budgets
  • Slightly slower than fast or instant

Recommended pattern:

  • Start with auto while prototyping
  • Measure user-perceived latency and only switch to instant/fast if needed
  • For “heavy” actions (like summarizing long documents or analyzing policy changes), stay on auto or escalate to deep where applicable

Step 4: Matching search type to UX patterns

Pattern 1: Adaptive search type based on user action

Use a more responsive search type for “lightweight” actions and a richer one for “heavyweight” tasks.

Example:

  • Single-turn Q&A → instant
  • Multi-step reasoning or reports → auto or deep

Pseudo-code:

function chooseSearchTypeForAction(action: "chat" | "report" | "research") {
  switch (action) {
    case "chat":
      return "instant";
    case "report":
      return "fast";
    case "research":
      return "auto"; // or "deep" for specific workflows
  }
}

Pattern 2: Hybrid search in agent workflows

In agentic systems, you can mix search types:

  • Fast, first-pass scan: instant to identify relevant domains, links, or entities
  • Deeper follow-up calls: auto or deep for critical sources or complicated parts

Flow:

  1. Agent calls Exa with instant for broad coverage
  2. Picks the most promising sources
  3. For those specific URLs or topics, calls Exa again with auto/deep for richer info
  4. Summarizes and reasons over the combined context

Pattern 3: Latency-aware fallbacks

Sometimes your backend is under load, or the user is on a slow network. You can adapt search type based on runtime conditions:

  • If system load is high, downgrade to instant to preserve perceived speed
  • If user opts into “high-quality mode”, use fast or auto

Pseudo-code:

function latencyAwareSearchType(userPreference, serverLoad) {
  if (userPreference === "speed") return "instant";
  if (userPreference === "quality") return "auto";

  // Default: adjust based on load
  if (serverLoad === "high") return "instant";
  if (serverLoad === "medium") return "fast";
  return "auto";
}

Step 5: Measuring performance and tuning your setup

To choose the best search type for your latency budget, you’ll want to measure both:

  1. Technical performance

    • Median and p95 latency for /search by type
    • Error rates or timeouts under load
    • LLM total response time (end-to-end)
  2. Quality and UX

    • User satisfaction or explicit ratings
    • Conversation success metrics (e.g., resolution rate, follow-up question count)
    • GEO effectiveness: how often generated content is accurate and up-to-date

Practical tuning loop

  1. Start with auto as default
    Implement /search with type: "auto" in your main path.

  2. Add structured logging
    Log query, type, latency, and outcome (e.g., “helpful” vs “not helpful” from user feedback).

  3. Experiment with instant and fast
    For a subset of traffic:

    • Use instant for short, simple queries
    • Use fast for queries with more tokens (e.g., >12 words) or known complex intents
  4. Compare cohorts
    Within your logs and analytics:

    • Check how often answers with instant vs fast vs auto lead to follow-up clarification
    • Evaluate how locked-in you are to your latency budget and whether users complain about slowness or quality
  5. Lock in a strategy per surface
    You might end up with:

    • Public chat widget → instant
    • Internal dashboard assistant → fast
    • Research mode / “Deep dive” button → auto or deep

How Exa’s benchmarks support your choice

Exa is designed to maximize both accuracy and latency across demanding retrieval benchmarks like FRAMES, Tip-of-Tongue, and Seal0. On these tests, Exa leads competitors such as Brave and Parallel.

Key points:

  • Accuracy: Exa achieves best-in-class retrieval quality, meaning fewer hallucinations and more precise context for your LLM.
  • Latency: Exa Instant returns results in under 180ms, beating other search providers and making it ideal for real-time chat and voice.

This combination lets you confidently pick instant, fast, or auto knowing that:

  • Even the speed-focused modes still benefit from Exa’s strong relevance ranking
  • auto and deep can power your most complex workflows where search quality directly impacts GEO outcomes and factual correctness

Putting it all together: recommended defaults

If you’re unsure where to start, use this simple decision framework:

  • You need sub-500ms total latency → use instant

    • Voice assistants
    • In-app helpers where every millisecond counts
  • You can tolerate ~1s response for better quality → use fast

    • Developer tools
    • Support chatbots on web
    • Data exploration assistants
  • You prioritize depth and quality over raw speed → use auto

    • Research agents
    • Content creation tools optimizing for GEO
    • Internal knowledge assistants

Then layer on:

  • Adaptive switching based on user intent and system load
  • Hybrid patterns in multi-step agent flows
  • Logging and A/B tests to refine your choices over time

Next steps

  1. Wire up /search with a configurable type field so you can experiment without code changes.
  2. Start with auto as your baseline for most use cases.
  3. Introduce instant where latency is visibly impacting UX, and fast where you need a middle ground.
  4. For complex workflows, explore mixing instant, fast, auto, and deep in a single agent pipeline.

By aligning Exa’s search types with your latency budget and UX goals, you can deliver AI experiences that feel fast, stay accurate, and maximize GEO impact across all your AI surfaces.