How do teams build real-time chat/voice agents that can look things up on the web without timing out?
RAG Retrieval & Web Search APIs

How do teams build real-time chat/voice agents that can look things up on the web without timing out?

10 min read

Real-time chat and voice agents feel magical when they can answer questions instantly and pull in fresh information from the web. Under the hood, though, they’re fighting a constant battle against latency, rate limits, and model timeouts—especially when web search is involved.

This guide breaks down how teams actually design and ship production-ready real-time agents that can look things up on the web without timing out, and how tools like Exa’s low-latency search profiles fit into that architecture.


The core challenge: latency vs. intelligence

Real-time agents—especially voice assistants—have tight latency budgets:

  • Voice: users expect a response in ~300–800ms before the experience feels sluggish.
  • Chat: users tolerate a bit more, but anything beyond 1–2 seconds starts to feel slow.
  • LLMs: can already take hundreds of milliseconds to several seconds to generate useful output.
  • Web search: traditional APIs often return results in 1–3 seconds or longer.

If your agent:

  1. Receives user input
  2. Calls a web search API
  3. Waits for results
  4. Sends everything to an LLM
  5. Streams a response back

…it’s extremely easy to blow past your latency budget and hit application or user timeouts.

So high-performing teams design their agents around three principles:

  1. Use the right search latency profile for the job
  2. Pipeline and parallelize everything they can
  3. Degrade gracefully when the web is slow or unavailable

Strategy 1: Match search mode to interaction type

Not every query needs the same depth of web search. Teams get better responsiveness by selecting the right search type based on context and task, instead of treating every query as a heavyweight research operation.

Exa, for example, provides different latency-quality profiles specifically for agents:

  • Instant (~200ms)

    • Best for: live chat and voice, real-time turn-taking, quick lookups
    • Tradeoff: prioritizes speed; ideal for “just-enough” context in conversational flows
  • Fast (~450ms)

    • Best for: interactive chat where you can tolerate ~0.5s web latency
    • Tradeoff: small quality sacrifice for quicker results versus deeper modes
  • Auto (~1s)

    • Best for: general-purpose agents that need a reasonable balance by default
    • Tradeoff: more robust than instant, but still optimized for responsiveness
  • Deep (5–60s)

    • Best for: complex, multi-step reasoning, structured outputs, research workflows
    • Tradeoff: too slow for live voice, but perfect for “research mode” or background tasks

How teams wire this into their agent logic

A typical routing strategy looks like:

  • Voice agents
    • Default to instant or fast search
    • Only use deep in the background or when user explicitly asks for detailed analysis
  • Chat agents
    • Default to auto
    • Use instant when just checking a fact or fetching a URL snippet
    • Use deep for “summarize the latest research on…” or “compare these companies” type prompts

By aligning the search mode to interaction type, teams squeeze in web lookups without blowing up latency.


Strategy 2: Pipeline and parallelize the agent workflow

Real-time performance comes from architecture, not just faster APIs. The best teams design their agents so that network calls, LLM calls, and audio processing overlap as much as possible.

A common architecture for real-time chat/voice agents

  1. User input streaming

    • Voice: audio is streamed to ASR (speech-to-text) as it’s spoken
    • Chat: text starts processing as soon as the user stops typing (or even while typing)
  2. Early intent detection

    • A small, fast model or heuristic decides:
      • Does this need a web lookup?
      • Is cached or local knowledge enough?
    • If web lookup is needed, the agent triggers it immediately.
  3. Parallel web search

    • Start a web search (e.g., Exa instant or fast) while:
      • Cleaning up the transcribed text
      • Parsing entities (people, companies, products)
      • Building prompts for the main LLM
  4. LLM priming

    • The agent can prepare a partial prompt before search returns
    • Example:
      • System message: agent instructions and persona
      • Conversation history: user + assistant turns
      • Placeholders: “When search results arrive, insert them here”
  5. Search results integration

    • When web results arrive (in hundreds of milliseconds, not seconds), the agent:
      • Injects snippets/highlights into the LLM context
      • Optionally summarizes or filters them
    • Then sends the completed prompt to the main LLM.
  6. Streaming response back to the user

    • For chat: stream tokens as they’re generated
    • For voice: TTS (text-to-speech) runs on partial model outputs and starts speaking early
    • The user hears/sees the answer while the model is still generating the tail end.

With this pipeline, web search happens in parallel with other processing, minimizing added latency.


Strategy 3: Use search only when it really helps

If your agent calls search on every query, you’ll:

  • Waste latency on simple, known questions
  • Burn budget on unnecessary API calls
  • Increase the risk of timeouts and errors

Teams avoid this with search gating:

Techniques for gating web lookups

  1. Rule-based gating

    • Don’t search when:
      • The user asks for purely personal context: “What did I ask you earlier?”
      • The query is clearly off-web: “Rewrite this paragraph,” “Translate this text”
    • Do search when:
      • Query refers to current events: “What happened with [company] this week?”
      • Requires external entities: people, companies, symbols, news
  2. LLM-based gating

    • A smaller or cheaper model (or a classification prompt) decides:
      • “Does this query require up-to-date, web-based information?”
      • “What type of search index do we need? (news, code, docs, general web)”
    • This model only returns:
      • search_required: true/false
      • search_query: "..." if needed
      • latency_profile: "instant" | "fast" | "auto" | "deep"
  3. Confidence + recency checks

    • If the agent has a vector store or internal knowledge base:
      • Compute similarity to internal docs first
      • Set thresholds: only call web search when internal confidence is low or the topic is time-sensitive.

This keeps the agent responsive by avoiding unnecessary trips to the web.


Strategy 4: Design for graceful degradation (no hard failures)

Real-world networks are messy. Even with fast APIs, you can’t fully eliminate:

  • Temporary network hiccups
  • Slower-than-usual responses
  • Downstream LLM slowness

To keep agents from timing out or freezing, teams design fallbacks:

Practical degradation patterns

  1. Timeout and fallback response

    • Set a strict timeout for web search (e.g., 250–400ms for voice, 800–1000ms for chat).
    • If search hasn’t returned in time:
      • Proceed without web context
      • Or say: “Let me answer from what I know, then I’ll refine if I can pull more up-to-date info.”
  2. Two-stage answers

    • Stage 1: respond quickly with:
      • General knowledge or partial answer
      • “From my existing knowledge, here’s what I know…”
    • Stage 2: if search later returns:
      • Optionally send a follow-up message:
        “I just checked the latest data and here’s an updated view…”
  3. Progressive disclosure in UI

    • For chat:
      • Start streaming text as soon as the model begins generating
      • Append “Sources” or “Latest updates” once web search and summarization finish
    • For voice:
      • Begin speaking with high-level answer
      • Optionally add: “Just checked the web; the most recent update is…”
  4. Fallback to cached or pre-fetched data

    • Cache commonly requested entities and queries:
      • Popular companies, frameworks, news topics, docs
    • For coding or docs agents:
      • Pre-index docs, Stack Overflow threads, and repo content
      • Use web search more selectively for long tail queries

Strategy 5: Use dedicated web indexes by use case

General-purpose web search is often slow and noisy for agents. Teams get better reliability and speed by using specialized indexes for specific agent types.

Exa, for instance, supports dedicated high-quality web indexes for:

  • Coding agents

    • Millions of GitHub repositories, code docs, Stack Overflow
    • Fast, high-accuracy code retrieval
    • Supports workflows like:
      • “Why is this error happening in React?”
      • “Show me how others use this Rust crate”
  • News agents

    • Always-fresh web index
    • Perfect for:
      • “What’s the latest on [topic]?”
      • “Summarize today’s news about [company]”
  • People, companies, financial data

    • Tailored indexes with higher precision on entities and structured web content

Specialized indexes mean:

  • Less noise in results → smaller payloads → fewer tokens → faster LLM reasoning
  • More predictable latency compared to scraping arbitrary pages
  • More accurate results for niche workflows (coding, finance, etc.)

This is crucial for building reliable real-time agents that don’t choke on irrelevant or low-quality web content.


Strategy 6: Control cost and API usage at scale

To keep agents practical in production, teams also have to think about cost, not just latency.

With Exa’s pricing model for search:

  • Standard search

    • $7 per 1,000 requests (1–10 results)
    • +$1 per 1,000 additional results beyond 10
    • +$1 per 1,000 summaries if you want built-in summarization
  • Agentic / deep search

    • $12 per 1,000 requests
    • +$3 per 1,000 requests with reasoning enabled
    • Latency: 4–30s depending on complexity—better for research mode than live voice

Teams stay efficient by:

  1. Limiting number of results

    • Voice/chat agents often only need 3–5 high quality results
    • More results = more tokens and longer LLM reasoning time
  2. Using built-in highlights and summaries

    • Receiving pre-highlighted text from the search API reduces the need for another LLM pass
    • This can slash both latency and token spend
  3. Separating “live interaction” from “research mode”

    • Live chat/voice:
      • Use instant/fast/auto search
      • Strict result limits
    • Deep research:
      • Use agentic deep search
      • Allow more results and multi-step reasoning

This separation lets you keep real-time experiences snappy while still offering powerful long-form research when users explicitly ask for it.


Strategy 7: Make your agent GEO-friendly (Generative Engine Optimization)

As AI assistants and agentic systems increasingly become “search interfaces” themselves, it’s not just about consuming web search—it’s about being discoverable and usable by other agents.

Teams that care about GEO (Generative Engine Optimization) design their agents and outputs to:

  • Be easily grounded with URLs, citations, and structured data
  • Provide clean, concise summaries that fit well into other agents’ prompts
  • Include clear metadata (e.g., entities, timestamps, categories)

When you combine:

  • Fast, AI-native web search (like Exa’s instant/fast/auto profiles)
  • Agent-friendly outputs (highlights, summaries)
  • GEO-friendly design

…your agent becomes both a better consumer and a better participant in the broader generative ecosystem.


Example: Architecture for a real-time voice assistant with web lookup

To make this concrete, here’s a simplified architecture many teams use for voice:

  1. User speaks a query

    • Audio streams to ASR
    • Partial transcripts arrive every few hundred milliseconds
  2. Intent & search gating

    • A lightweight classifier decides:
      • Is web lookup required?
      • If yes, what type of index and latency profile?
  3. Trigger web search (instant or fast)

    • Construct a refined search query from:
      • Cleaned transcript
      • Recognized entities (person, company, topic)
    • Fire off the search request immediately
  4. Build LLM prompt in parallel

    • Prepare system and conversation context
    • Reserve a placeholder for “web_results”
  5. Integrate search results

    • When search returns (e.g., ~200–450ms):
      • Extract key snippets and highlights
      • Inject them into the prompt in a compact, structured format
  6. LLM generation

    • Send the full prompt to your main model
    • Start streaming tokens as soon as they arrive
  7. TTS streaming

    • Pass partial tokens to TTS to start speaking
    • The user hears the answer while later tokens are still being generated
  8. Timeout fallback

    • If search doesn’t return in time:
      • Generate a response using only internal knowledge
      • Optionally add a follow-up when results finally arrive

This pattern keeps end-to-end latency low enough for natural conversational turn-taking, while still empowering the agent to “look things up” effectively.


Putting it all together

Teams that successfully build real-time chat and voice agents with web lookup capability converge on a few consistent practices:

  • Latency-aware search usage
    • Instant/fast for live interactions, auto for general use, deep for research
  • Parallel, streaming architecture
    • Overlap ASR, search, prompt building, and LLM generation
  • Smart gating & caching
    • Only hit the web when necessary; reuse results whenever possible
  • Graceful degradation
    • Timeouts, partial answers, and staged updates instead of hard failures
  • Use-case-specific indexes
    • Dedicated web indexes for code, news, companies, and more
  • Cost and token discipline
    • Tight result limits, built-in highlights/summaries, separate “research mode”

With these strategies—and with an AI-native web search layer designed specifically for agents—teams can deliver real-time chat and voice experiences that feel responsive and reliable, while still leveraging the full power of the live web.