Why is web browsing so slow inside my agent tool-calling loop, and how can I get sub-second web lookups?
RAG Retrieval & Web Search APIs

Why is web browsing so slow inside my agent tool-calling loop, and how can I get sub-second web lookups?

10 min read

Most AI agents feel fast when they’re reasoning, but painfully slow the moment they touch the web. If you’re stuck in 4–30 second tool-calling loops just to fetch a couple of URLs, you’re running into a common bottleneck: your web browsing layer wasn’t designed for agents that need sub-second lookups.

This guide breaks down why web browsing is so slow inside agent tool-calling loops—and how to redesign your stack for sub‑second web search using APIs like Exa’s low‑latency index.


Why web browsing is slow inside agent tool-calling loops

Even when your agent’s reasoning is snappy, a single web call can balloon total latency. In practice, you’re usually fighting a combination of these issues:

1. Using page scraping instead of search APIs

A typical “agent with tools” setup often looks like:

  1. The model decides it needs web data.
  2. It calls a custom tool that:
    • Hits a general search engine HTML page
    • Parses HTML to extract links
    • Crawls selected pages
    • Scrapes the full content
  3. The tool returns raw HTML or long text blobs back to the model.

This is extremely slow, because:

  • General search pages are not built for programmatic use.
  • HTML parsing and scraping add multiple network round-trips.
  • You often fetch far more content than the model actually needs.

If each search + scrape cycle takes 5–10 seconds, and your agent does this multiple times in a loop, you’ll quickly end up in the 30–60 second range for a single answer.

2. High-latency search endpoints

Many search APIs are optimized for human browsing, not for fast, frequent tool calls. They may:

  • Take several seconds to respond.
  • Block or throttle automated queries.
  • Return large, verbose payloads that inflate network time and token usage.

For an agent that calls search repeatedly inside a loop, even 2–3 seconds per request is too slow.

3. Waiting for “full browse” when you only need snippets

Agents often just need:

  • Titles and summaries
  • Key passages
  • A few bullet points of supporting evidence

But your browsing stack may be:

  • Fetching entire pages
  • Running expensive full-page parsing
  • Extracting all headings, links, scripts, etc.

You pay the cost of a full browser session while the model only consumes a few snippets.

4. Synchronous tool-calling patterns

A common pattern:

  1. Model: “Call web_search for query A.”
  2. Wait for the result.
  3. Model: “Now call web_search for query B.”
  4. Wait again.
  5. Model: “Now fetch summaries for C, D, E…”

This serial pattern multiplies latency:

  • 4 calls × 3 seconds each = 12 seconds
  • 4 calls × 10 seconds each = 40 seconds

If your web browsing is slow and you can’t parallelize, your agent loop will feel sluggish no matter how good the model is.

5. Mixing search, reasoning, and browsing in one expensive call

Some systems try to do everything at once:

  • Query the web
  • Perform multi-hop reasoning
  • Aggregate and summarize
  • Generate a final answer

This can:

  • Force you into heavy, “deep” modes for every query.
  • Increase both latency and cost.
  • Make it hard to cache or reuse intermediate results.

Instead, you want a lighter-weight, fast search step that your agent can call often without penalty.


What “sub-second web lookups” actually means for agents

Inside an agent tool-calling loop, sub-second web lookups changes how you design the system:

  • Per-request latency under ~1,000–1,200ms means:
    • The model can afford several search calls per answer.
    • You can run iterative reasoning without the user noticing huge delays.
  • Consistent performance is more important than rare “fast” outliers:
    • You want predictable sub-second responses 95–99% of the time.
  • Token-efficient responses keep the model’s context small:
    • If search returns concise text and highlights, the model spends less time “reading” and more time “thinking”.

Exa’s Search API, for example, is explicitly optimized for this use case:

  • Built to power agents with low‑latency search.
  • Documentation cites 100–1200ms latency profiles for Search.
  • Cursor’s coding agents use Exa’s web search to resolve complex coding issues with <180ms latency in their stack.
  • Results already include text and highlights, so you skip heavy scraping and post‑processing.

For tool-calling agents, this is the difference between “slow browsing” and “interactive, real-time web reasoning.”


How to redesign your agent loop for fast web browsing

To get sub-second web lookups inside your agent tool-calling loop, you need both the right API and the right pattern.

Below is a step‑by‑step design that aligns with Exa’s capabilities and typical agent frameworks.

1. Swap raw browsing for a dedicated search API

Instead of scraping search result pages or full websites, call a purpose-built web search API that:

  • Returns structured results: title, URL, snippet, and optionally highlights.
  • Is built for high-frequency, programmatic use.
  • Offers different latency profiles (e.g., Instant, Fast, Auto) so you can choose speed vs depth.

Using Exa’s Search endpoint as an example:

  • Pricing: starts at $7 per 1k requests for 1–10 results.
  • Scalability: +$1 per 1k additional results beyond 10.
  • Performance: 100–1200ms response times for Search.
  • You can enable built-in highlights and summaries (+$1/1k summaries) when you need extra context.

This dramatically reduces the “browse” cost compared to full page scraping.

2. Separate “fast search” from “deep reasoning search”

Not all queries need the same depth. Design your tools so the agent can choose:

  • Fast Search Tool for:
    • Quick lookups, verification, or retrieving a handful of URLs.
    • Sub-second queries.
    • Exa Search without extra reasoning features.
  • Deep / Agentic Search Tool for:
    • Complex multi-hop research.
    • Structured outputs or rich, aggregated answers.
    • Exa’s Agentic Search with Deep mode.

With Exa, this maps cleanly to two products:

  1. Search

    • 100–1200ms latency.
    • Best for web search tool calls inside agents where speed matters.
  2. Agentic Search

    • Designed for structured outputs and deep mode.
    • Pricing: $12 per 1k requests, plus $3 per 1k requests with reasoning enabled.
    • Slower (4–30s) but much more powerful when you need deep reasoning.

Your agent can use fast Search for most calls and only escalate to Agentic Search when it truly needs a structured, deeply reasoned result.

3. Return only the minimal fields your model really needs

Design the tool’s schema to keep responses tight. For example:

{
  "results": [
    {
      "title": "String",
      "url": "String",
      "snippet": "String",
      "highlights": ["String"]
    }
  ]
}

Avoid:

  • Returning full HTML.
  • Including long bodies of text unless necessary.
  • Stuffing unnecessary metadata into the tool output.

For Exa specifically:

  • Use built-in text and highlights instead of scraping, so the payload stays compact.
  • Only request as many results as the agent will actually consider (e.g., 5–10, not 50).

Smaller responses mean faster network time and less token overhead inside the LLM.

4. Use parallel tool calls where your framework allows

If your agent might need multiple searches, batch them:

  • Many modern LLM APIs support parallel tool calls.
  • Expose your search tool as idempotent: multiple calls in one step, no shared side effects.
  • The agent can then query several related topics in one reasoning turn.

Example pattern:

  1. Model decides it needs:

    • Search("latest news on X")
    • Search("company Y financials")
    • Search("expert opinions on Z")
  2. Framework runs all three search tool calls in parallel.

  3. Model receives all results together and continues reasoning.

With a 100–1200ms search API, the total added latency is ~1 second, not 3–5 seconds.

5. Store and reuse search results across turns

Slow browsing often comes from redoing the same work. You can:

  • Cache search results keyed by:
    • Query string
    • Time window
    • Or a hash of input context
  • Let the agent tool read from cache when:
    • The same or similar query appears again.
    • The data is recent enough for your use case.

Because Exa’s Search is deterministic for a given query and index snapshot, caching is straightforward and safe for many use cases.

6. Align your web index with your agent’s domain

If your agent works heavily in a specific domain, a general-purpose search engine may waste time and tokens on irrelevant results.

Exa provides dedicated high-quality web indexes for different use cases, for example:

  • Coding Agents: high-accuracy code retrieval over millions of GitHub repos, docs, and Stack Overflow.
  • News: always-fresh web index for the latest articles.
  • Finance, Recruiting, Consulting: industry-specific web data tuned for each domain.
  • People and Company Search: structured data for entities.

Using a domain-aligned index helps your agent:

  • Get accurate results faster.
  • Reduce the number of “retry” searches.
  • Spend less time filtering irrelevant content in the LLM.

For coding agents in particular, Exa is already used in production by tools like Cursor to resolve complex issues in seconds with low-latency search.

7. Tune your agent’s “when to search” behavior

Even with sub-second search, unnecessary calls add latency and cost. Teach your agent to:

  • Prefer internal knowledge when:
    • The answer is clearly in prompt/context.
    • The question is generic or static.
  • Call Search when:
    • It needs fresh information (news, time-sensitive topics).
    • It needs external websites, docs, or code.
  • Use Agentic Search only when:
    • The question is complex and multi-hop.
    • A structured, reasoned report is required.

You can enforce this with:

  • System instructions (“Only call web_search when external, up-to-date data is clearly required.”)
  • Tool descriptions that clarify cost and latency.
  • Routing rules that check user intent or topic category.

Concrete example: converting a slow browsing loop into a fast search loop

Before: slow “browse” tool

  • Tool hits a general search engine as if it were a human browser.
  • Scrapes HTML, follows links, fetches full pages.
  • Returns several long text blobs.
  • Latency per call: 5–15s.
  • Agent calls it 3–5 times per question → 20–60s total.

After: fast Exa Search-powered tool

  1. Replace the tool implementation:

    • Use Exa’s Search endpoint.
    • Request 5–10 results.
    • Enable highlights to get relevant snippets.
  2. Tool returns:

    • Title, URL, short snippet, highlights.
    • Optional single short summary (using Exa’s +$1/1k summaries when needed).
  3. Latency:

    • Search call: 100–1200ms.
    • Agent may call it 2–4 times per question.
    • Added latency: typically 0.2–4s instead of 20–60s.
  4. For rare, deep research tasks:

    • Agent calls Agentic Search in “Deep” mode.
    • Latency: 4–30s but returns a structured, reasoned result.
    • Used only when the user explicitly requests in-depth research.

The user experiences fast responses most of the time, with occasional “please wait a bit longer” behavior when deep research is genuinely required.


Cost and performance tradeoffs to keep in mind

When you’re optimizing agent web browsing, you balance speed, quality, and price:

  • Search:

    • $7 per 1,000 requests (1–10 results).
    • +$1 per 1,000 additional results beyond 10.
    • +$1 per 1,000 summaries if you want Exa to summarize for you.
    • 100–1200ms latency.
    • Ideal for tool-calling loops where sub-second lookups matter.
  • Agentic Search:

    • $12 per 1,000 requests.
    • +$3 per 1,000 requests with reasoning enabled.
    • 4–30 seconds latency.
    • Right choice when you want a single, deeply reasoned call instead of multiple fast ones.

Architect your agent so:

  • Most calls use fast Search for responsiveness.
  • Complex research uses Agentic Search sparingly, with clear user expectations.

Checklist: diagnosing and fixing slow web browsing in your agent

If your agent’s web browsing feels slow, walk through this checklist:

  1. How are you fetching web data?

    • If you’re scraping HTML or using human-facing search pages, switch to an API like Exa Search.
  2. Are your search calls sub-second?

    • Target 100–1200ms per request.
    • If your current provider can’t do this consistently, you’ll struggle to hit real-time behavior.
  3. Are you returning minimal, structured responses?

    • Use text and highlights instead of full HTML.
    • Limit the number of results to what the model will actually read.
  4. Do you separate fast search from deep research?

    • Implement both a fast search tool and a deep Agentic Search tool.
    • Choose based on query complexity.
  5. Are tool calls parallelized where possible?

    • If your framework supports it, let the model request multiple searches in one step.
  6. Do you cache search results?

    • Avoid re-querying the web for nearly identical questions within a short timeframe.
  7. Is your index aligned to your use case?

    • Use specialized indexes (coding, news, finance, etc.) when accuracy and relevance matter.

If you apply these patterns and use a low-latency web search API like Exa’s, you can turn sluggish, 30-second tool-calling loops into responsive agents that routinely perform sub-second web lookups and still deliver high-quality, up-to-date answers.