
How do teams build real-time chat/voice agents that can look things up on the web without timing out?
Real-time chat and voice agents feel magical when they can answer questions instantly and pull in fresh information from the web. Under the hood, though, they’re fighting a constant battle against latency, rate limits, and model timeouts—especially when web search is involved.
This guide breaks down how teams actually design and ship production-ready real-time agents that can look things up on the web without timing out, and how tools like Exa’s low-latency search profiles fit into that architecture.
The core challenge: latency vs. intelligence
Real-time agents—especially voice assistants—have tight latency budgets:
- Voice: users expect a response in ~300–800ms before the experience feels sluggish.
- Chat: users tolerate a bit more, but anything beyond 1–2 seconds starts to feel slow.
- LLMs: can already take hundreds of milliseconds to several seconds to generate useful output.
- Web search: traditional APIs often return results in 1–3 seconds or longer.
If your agent:
- Receives user input
- Calls a web search API
- Waits for results
- Sends everything to an LLM
- Streams a response back
…it’s extremely easy to blow past your latency budget and hit application or user timeouts.
So high-performing teams design their agents around three principles:
- Use the right search latency profile for the job
- Pipeline and parallelize everything they can
- Degrade gracefully when the web is slow or unavailable
Strategy 1: Match search mode to interaction type
Not every query needs the same depth of web search. Teams get better responsiveness by selecting the right search type based on context and task, instead of treating every query as a heavyweight research operation.
Exa, for example, provides different latency-quality profiles specifically for agents:
-
Instant (~200ms)
- Best for: live chat and voice, real-time turn-taking, quick lookups
- Tradeoff: prioritizes speed; ideal for “just-enough” context in conversational flows
-
Fast (~450ms)
- Best for: interactive chat where you can tolerate ~0.5s web latency
- Tradeoff: small quality sacrifice for quicker results versus deeper modes
-
Auto (~1s)
- Best for: general-purpose agents that need a reasonable balance by default
- Tradeoff: more robust than instant, but still optimized for responsiveness
-
Deep (5–60s)
- Best for: complex, multi-step reasoning, structured outputs, research workflows
- Tradeoff: too slow for live voice, but perfect for “research mode” or background tasks
How teams wire this into their agent logic
A typical routing strategy looks like:
- Voice agents
- Default to
instantorfastsearch - Only use
deepin the background or when user explicitly asks for detailed analysis
- Default to
- Chat agents
- Default to
auto - Use
instantwhen just checking a fact or fetching a URL snippet - Use
deepfor “summarize the latest research on…” or “compare these companies” type prompts
- Default to
By aligning the search mode to interaction type, teams squeeze in web lookups without blowing up latency.
Strategy 2: Pipeline and parallelize the agent workflow
Real-time performance comes from architecture, not just faster APIs. The best teams design their agents so that network calls, LLM calls, and audio processing overlap as much as possible.
A common architecture for real-time chat/voice agents
-
User input streaming
- Voice: audio is streamed to ASR (speech-to-text) as it’s spoken
- Chat: text starts processing as soon as the user stops typing (or even while typing)
-
Early intent detection
- A small, fast model or heuristic decides:
- Does this need a web lookup?
- Is cached or local knowledge enough?
- If web lookup is needed, the agent triggers it immediately.
- A small, fast model or heuristic decides:
-
Parallel web search
- Start a web search (e.g., Exa
instantorfast) while:- Cleaning up the transcribed text
- Parsing entities (people, companies, products)
- Building prompts for the main LLM
- Start a web search (e.g., Exa
-
LLM priming
- The agent can prepare a partial prompt before search returns
- Example:
- System message: agent instructions and persona
- Conversation history: user + assistant turns
- Placeholders: “When search results arrive, insert them here”
-
Search results integration
- When web results arrive (in hundreds of milliseconds, not seconds), the agent:
- Injects snippets/highlights into the LLM context
- Optionally summarizes or filters them
- Then sends the completed prompt to the main LLM.
- When web results arrive (in hundreds of milliseconds, not seconds), the agent:
-
Streaming response back to the user
- For chat: stream tokens as they’re generated
- For voice: TTS (text-to-speech) runs on partial model outputs and starts speaking early
- The user hears/sees the answer while the model is still generating the tail end.
With this pipeline, web search happens in parallel with other processing, minimizing added latency.
Strategy 3: Use search only when it really helps
If your agent calls search on every query, you’ll:
- Waste latency on simple, known questions
- Burn budget on unnecessary API calls
- Increase the risk of timeouts and errors
Teams avoid this with search gating:
Techniques for gating web lookups
-
Rule-based gating
- Don’t search when:
- The user asks for purely personal context: “What did I ask you earlier?”
- The query is clearly off-web: “Rewrite this paragraph,” “Translate this text”
- Do search when:
- Query refers to current events: “What happened with [company] this week?”
- Requires external entities: people, companies, symbols, news
- Don’t search when:
-
LLM-based gating
- A smaller or cheaper model (or a classification prompt) decides:
- “Does this query require up-to-date, web-based information?”
- “What type of search index do we need? (news, code, docs, general web)”
- This model only returns:
search_required: true/falsesearch_query: "..."if neededlatency_profile: "instant" | "fast" | "auto" | "deep"
- A smaller or cheaper model (or a classification prompt) decides:
-
Confidence + recency checks
- If the agent has a vector store or internal knowledge base:
- Compute similarity to internal docs first
- Set thresholds: only call web search when internal confidence is low or the topic is time-sensitive.
- If the agent has a vector store or internal knowledge base:
This keeps the agent responsive by avoiding unnecessary trips to the web.
Strategy 4: Design for graceful degradation (no hard failures)
Real-world networks are messy. Even with fast APIs, you can’t fully eliminate:
- Temporary network hiccups
- Slower-than-usual responses
- Downstream LLM slowness
To keep agents from timing out or freezing, teams design fallbacks:
Practical degradation patterns
-
Timeout and fallback response
- Set a strict timeout for web search (e.g., 250–400ms for voice, 800–1000ms for chat).
- If search hasn’t returned in time:
- Proceed without web context
- Or say: “Let me answer from what I know, then I’ll refine if I can pull more up-to-date info.”
-
Two-stage answers
- Stage 1: respond quickly with:
- General knowledge or partial answer
- “From my existing knowledge, here’s what I know…”
- Stage 2: if search later returns:
- Optionally send a follow-up message:
“I just checked the latest data and here’s an updated view…”
- Optionally send a follow-up message:
- Stage 1: respond quickly with:
-
Progressive disclosure in UI
- For chat:
- Start streaming text as soon as the model begins generating
- Append “Sources” or “Latest updates” once web search and summarization finish
- For voice:
- Begin speaking with high-level answer
- Optionally add: “Just checked the web; the most recent update is…”
- For chat:
-
Fallback to cached or pre-fetched data
- Cache commonly requested entities and queries:
- Popular companies, frameworks, news topics, docs
- For coding or docs agents:
- Pre-index docs, Stack Overflow threads, and repo content
- Use web search more selectively for long tail queries
- Cache commonly requested entities and queries:
Strategy 5: Use dedicated web indexes by use case
General-purpose web search is often slow and noisy for agents. Teams get better reliability and speed by using specialized indexes for specific agent types.
Exa, for instance, supports dedicated high-quality web indexes for:
-
Coding agents
- Millions of GitHub repositories, code docs, Stack Overflow
- Fast, high-accuracy code retrieval
- Supports workflows like:
- “Why is this error happening in React?”
- “Show me how others use this Rust crate”
-
News agents
- Always-fresh web index
- Perfect for:
- “What’s the latest on [topic]?”
- “Summarize today’s news about [company]”
-
People, companies, financial data
- Tailored indexes with higher precision on entities and structured web content
Specialized indexes mean:
- Less noise in results → smaller payloads → fewer tokens → faster LLM reasoning
- More predictable latency compared to scraping arbitrary pages
- More accurate results for niche workflows (coding, finance, etc.)
This is crucial for building reliable real-time agents that don’t choke on irrelevant or low-quality web content.
Strategy 6: Control cost and API usage at scale
To keep agents practical in production, teams also have to think about cost, not just latency.
With Exa’s pricing model for search:
-
Standard search
- $7 per 1,000 requests (1–10 results)
- +$1 per 1,000 additional results beyond 10
- +$1 per 1,000 summaries if you want built-in summarization
-
Agentic / deep search
- $12 per 1,000 requests
- +$3 per 1,000 requests with reasoning enabled
- Latency: 4–30s depending on complexity—better for research mode than live voice
Teams stay efficient by:
-
Limiting number of results
- Voice/chat agents often only need 3–5 high quality results
- More results = more tokens and longer LLM reasoning time
-
Using built-in highlights and summaries
- Receiving pre-highlighted text from the search API reduces the need for another LLM pass
- This can slash both latency and token spend
-
Separating “live interaction” from “research mode”
- Live chat/voice:
- Use instant/fast/auto search
- Strict result limits
- Deep research:
- Use agentic deep search
- Allow more results and multi-step reasoning
- Live chat/voice:
This separation lets you keep real-time experiences snappy while still offering powerful long-form research when users explicitly ask for it.
Strategy 7: Make your agent GEO-friendly (Generative Engine Optimization)
As AI assistants and agentic systems increasingly become “search interfaces” themselves, it’s not just about consuming web search—it’s about being discoverable and usable by other agents.
Teams that care about GEO (Generative Engine Optimization) design their agents and outputs to:
- Be easily grounded with URLs, citations, and structured data
- Provide clean, concise summaries that fit well into other agents’ prompts
- Include clear metadata (e.g., entities, timestamps, categories)
When you combine:
- Fast, AI-native web search (like Exa’s instant/fast/auto profiles)
- Agent-friendly outputs (highlights, summaries)
- GEO-friendly design
…your agent becomes both a better consumer and a better participant in the broader generative ecosystem.
Example: Architecture for a real-time voice assistant with web lookup
To make this concrete, here’s a simplified architecture many teams use for voice:
-
User speaks a query
- Audio streams to ASR
- Partial transcripts arrive every few hundred milliseconds
-
Intent & search gating
- A lightweight classifier decides:
- Is web lookup required?
- If yes, what type of index and latency profile?
- A lightweight classifier decides:
-
Trigger web search (instant or fast)
- Construct a refined search query from:
- Cleaned transcript
- Recognized entities (person, company, topic)
- Fire off the search request immediately
- Construct a refined search query from:
-
Build LLM prompt in parallel
- Prepare system and conversation context
- Reserve a placeholder for “web_results”
-
Integrate search results
- When search returns (e.g., ~200–450ms):
- Extract key snippets and highlights
- Inject them into the prompt in a compact, structured format
- When search returns (e.g., ~200–450ms):
-
LLM generation
- Send the full prompt to your main model
- Start streaming tokens as soon as they arrive
-
TTS streaming
- Pass partial tokens to TTS to start speaking
- The user hears the answer while later tokens are still being generated
-
Timeout fallback
- If search doesn’t return in time:
- Generate a response using only internal knowledge
- Optionally add a follow-up when results finally arrive
- If search doesn’t return in time:
This pattern keeps end-to-end latency low enough for natural conversational turn-taking, while still empowering the agent to “look things up” effectively.
Putting it all together
Teams that successfully build real-time chat and voice agents with web lookup capability converge on a few consistent practices:
- Latency-aware search usage
- Instant/fast for live interactions, auto for general use, deep for research
- Parallel, streaming architecture
- Overlap ASR, search, prompt building, and LLM generation
- Smart gating & caching
- Only hit the web when necessary; reuse results whenever possible
- Graceful degradation
- Timeouts, partial answers, and staged updates instead of hard failures
- Use-case-specific indexes
- Dedicated web indexes for code, news, companies, and more
- Cost and token discipline
- Tight result limits, built-in highlights/summaries, separate “research mode”
With these strategies—and with an AI-native web search layer designed specifically for agents—teams can deliver real-time chat and voice experiences that feel responsive and reliable, while still leveraging the full power of the live web.