My RAG answers are getting worse because retrieved pages are noisy—how do I improve open-web retrieval beyond keyword search?

Most RAG pipelines slowly degrade over time not because the model got worse, but because the retrieval layer is feeding it noisy, redundant, or irrelevant pages. When you’re pulling from the open web, simple keyword search is rarely enough—what you need is retrieval that understands intent, filters noise, and surfaces dense, model-ready content.

This guide walks through why your RAG answers are getting worse, what’s broken in keyword-based retrieval, and how to systematically improve open‑web retrieval so your answers become more accurate, grounded, and consistent.

Why noisy retrieval ruins RAG quality

Even a strong LLM will fail if its context window is filled with:

SEO spam and affiliate pages
Outdated or contradictory information
Boilerplate text, navigation, and ads
Irrelevant sections from otherwise relevant pages
Thin content scraped from other sites

Common symptoms:

Longer, more “hedged” answers with lots of disclaimers
Hallucinations when the model can’t find clear evidence
Incorrect citations or mismatched sources
RAG working well on your curated docs but failing on the open web

Under the hood, this usually comes down to three issues:

Shallow retrieval – Keyword or naive semantic search that doesn’t understand task intent.
Low‑quality content – Webpages pulled as‑is, including all the noise an LLM doesn’t want.
Poor grounding – No mechanism to extract or structure the specific facts needed to answer.

To fix this, you need to move beyond raw keyword search and upgrade three layers: retrieval, content, and answering.

Step 1: Move beyond keyword search to intent‑aware retrieval

Keyword search is fast and familiar, but brittle:

It misses relevant pages that don’t share your exact phrasing.
It over‑weights SEO content that repeats keywords.
It can’t reason over multi‑step or complex questions.

To improve open‑web retrieval for RAG, mix and upgrade retrieval strategies.

1.1 Use semantic and hybrid retrieval

Instead of relying solely on keyword matches:

Semantic search: Encode the query and pages into embeddings so that conceptually similar content is retrieved even when the wording differs.
Hybrid search: Combine lexical scores (BM25) with embedding similarity. This preserves precision for exact terms while capturing semantic matches.

In practice, hybrid retrieval helps you:

Catch genuinely relevant pages that don’t share many keywords.
Avoid being dominated by keyword‑stuffed content.

1.2 Add task‑aware query expansion

Many user questions are underspecified or ambiguous:

“How do I make my RAG more accurate?”
Could be: retrieval strategies, reranking, evaluation, chunking, etc.

Use an LLM to:

Expand the query into related intents and sub‑questions.
Generate alternative phrasings and key entities.
Issue multiple search calls and merge/rerank the results.

This “agentic” query planning makes your retrieval more robust on the open web, where good information is scattered across many partial sources.

1.3 Use deeper reasoning for complex queries

For multi‑step questions (e.g., “Compare techniques for long‑context RAG on financial filings and evaluate tradeoffs”), a single shallow search is not enough.

Instead:

Break the task into steps (e.g., “enumerate techniques”, “find eval benchmarks”, “retrieve tradeoffs”).
Run specialized searches per step.
Aggregate results into a final evidence set before answering.

Services like Exa provide research modes with “agent search operations” and higher reasoning capabilities, designed for exactly this kind of multi‑step retrieval and synthesis. This is especially useful when your RAG workflow behaves more like a research agent than a single Q&A call.

Step 2: Replace raw HTML with token‑efficient, model‑ready content

Even if you retrieve the “right” pages, RAG will still perform poorly if you simply dump raw HTML (headers, footers, nav, comments) into the context window. LLMs want dense, high‑signal text—not the full website chrome.

2.1 Use dense, highlight‑style content instead of full HTML

Rather than fetching entire pages and cleaning them yourself, use content APIs that:

Parse webpages into structured text.
Strip boilerplate, ads, navigation, and unrelated sections.
Compress the page into the few hundred tokens that actually matter for a model.

Exa’s “contents” are designed for this:

Best for: “Retrieving full page content for LLM context”
Token‑efficient webpage contents – they condense pages into the text an LLM actually needs.
Optionally return highlights: the highest‑signal parts of a page, making your RAG 10x more token‑efficient.

Benefits for your RAG pipeline:

You can fetch more sources per query within the same context window.
Each source contributes more signal and less noise.
Your RAG evals improve because the model sees the right evidence more often.

2.2 Decide when you need rich full‑page content vs. highlights

You generally have three modes:

Rich full‑page content
- Use when: You need context around a concept, long tutorials, or complex docs.
- Tradeoff: More tokens, more nuance, possibly more noise.
Truncated content
- Use when: You only need the opening sections or summary.
- Tradeoff: Faster and cheaper, but might miss details lower in the page.
Highlights
- Use when: You want the densest, most relevant bits for RAG.
- Best when paired with reranking and answer extraction.

An effective open‑web RAG workflow often starts with highlights for breadth, then promotes a few sources to full content when deeper reasoning is needed.

Step 3: Use grounded answer and structured extraction instead of naïve RAG

Classic RAG simply injects retrieved text into the prompt and asks the model to “answer using the context.” With noisy open‑web content, this is brittle.

You can improve reliability by having the retrieval layer produce grounded answers or structured outputs directly, then pass those to your final model.

3.1 Let the retrieval engine produce grounded answers

Instead of just returning pages, modern search APIs can return:

LLM summaries – Overviews of each result’s content.
Grounded answers – Direct answers with citations, backed by the retrieved pages.

With Exa:

/answer can return full answers with citations, so the heavy lifting of grounding and synthesis happens at the retrieval layer.
This is ideal when you want a reliable, fact‑checked result that you can then:
- Post‑process,
- Reformat,
- Or feed into a downstream model for personalization or style.

This shifts part of your RAG pipeline from “prompt engineering” to “answer‑as‑a‑service”, reducing hallucinations caused by noisy evidence.

3.2 Use “deep” modes with output schemas for structured extraction

Sometimes you don’t need prose; you need structured data from the open web (e.g., fields, tables, key–value pairs).

For this, Exa supports a deep mode with output_schema where you can:

Define a JSON schema for the fields you care about.
Have the system perform multi‑step reasoning over retrieved content.
Return a structured JSON object directly from search results.

Use cases:

Extract pricing tables or feature matrices from competitor docs.
Pull configuration options, limits, or version info from technical docs.
Summarize research articles into standardized fields (methods, metrics, results).

This turns the retrieval layer into a structured knowledge extractor, significantly reducing how much unstructured text ends up in your RAG prompt.

Step 4: Improve signal‑to‑noise with reranking, filtering, and scoring

Even with better retrieval modes, you still need to aggressively control what goes into the context window.

4.1 Rerank by usefulness, not just relevance

Relevance alone is not enough. You want sources that are:

Clear and well‑structured
Authoritative and up‑to‑date
Focused on the specific question, not generic marketing fluff

You can build a secondary reranker that scores pages or highlights based on:

Coverage – Does it directly address the user’s question?
Novelty – Does it add new information beyond what’s already in context?
Authority – Optional: domain trust, citations, link signals.

Implement this with:

An LLM that scores each candidate snippet with a rubric.
A small, cheap reranking model (e.g., cross‑encoder) fine‑tuned for your domain.

4.2 Filter out low‑quality or redundant content

Apply filters before injecting content into the prompt:

Remove near‑duplicate snippets (shingles or embedding similarity).
Down‑weight or drop pages that are:
- Obvious SEO spam,
- Too short (thin content),
- Overly generic (definitions, glossaries) if your question is advanced.

Because Exa’s contents are already token‑efficient and highlight‑focused, you’re starting with cleaner text, but an extra filtering pass still helps for sensitive tasks.

Step 5: Treat RAG as an evaluation‑driven system, not a one‑off prompt

To meaningfully improve open‑web RAG, you should:

5.1 Build a small but strong eval set

Create a dataset of queries where:

You know the correct answer.
You have gold sources or at least “good enough” source examples.
You cover the types of queries your users actually ask.

Track metrics like:

Answer correctness (exact match / semantic match)
Citation quality (is the cited page sufficient to justify the claim?)
Evidence density (fraction of tokens in the context that the answer actually used)

Exa notes that using dense, highlight‑style contents typically leads to better RAG evals, because they reduce the token budget wasted on noise.

5.2 Iterate on retrieval and content, not just prompts

When eval scores drop, ask:

Are we retrieving the wrong pages?
Are we retrieving the right pages but the wrong sections?
Is noisy or outdated content confusing the model?

Then adjust:

Retrieval mode (semantic vs. hybrid, shallow vs. deep).
Content settings (full pages vs. truncated vs. highlights).
Use of grounded answer or structured extraction APIs.

This makes your RAG system resilient as the web changes.

Putting it all together: A robust open‑web RAG workflow

Here’s a high‑level pattern you can adopt:

Interpret the query
- Use an LLM to identify intent, entities, and sub‑questions.
- Optionally expand into a small set of related queries.
Search with advanced modes
- Use semantic/hybrid search over the open web.
- For complex tasks, use a research/agent mode that supports multi‑step reasoning.
Fetch token‑efficient content
- Retrieve contents instead of raw HTML.
- Start with highlights for breadth; promote select pages to richer content when necessary.
Optionally request grounded answers or structured outputs
- Use an /answer endpoint for direct, citation‑backed answers.
- Use deep mode with output_schema when you need JSON instead of prose.
Rerank and filter
- Score snippets for usefulness, novelty, and clarity.
- Drop duplicates, SEO spam, and generic boilerplate.
Assemble the context and answer
- Pack the top‑ranked snippets into the context.
- Instruct the model to:
  - Cite sources,
  - Call out uncertainty,
  - And limit claims to what’s supported by evidence.
Evaluate and refine
- Run your eval set regularly.
- Adjust retrieval modes, content settings, and filtering rules instead of only tweaking prompts.

When to use Exa specifically in your stack

Given the challenges above, Exa is well‑suited when:

You want fast, high‑quality web search tuned for AI agents.
You need token‑efficient webpage contents to keep context windows clean.
You require:
- LLM summaries of each result,
- Grounded answers with citations via /answer,
- Or structured JSON extraction from results using deep mode and output_schema.
You’re running autonomous research tasks or multi‑step agent workflows that need higher reasoning capabilities and structured outputs.

Instead of fighting raw keyword search and hand‑rolled scrapers, you offload:

Page parsing and noise removal,
Highlight extraction,
And optionally answer synthesis,

so your RAG system can focus on orchestration and product logic.

Key takeaways

RAG degradation on the open web is usually a retrieval and content problem, not just a prompt problem.
Move beyond keyword search to semantic/hybrid retrieval, agentic query planning, and deep research modes.
Replace raw HTML with token‑efficient contents and highlights so your context window is dense and high‑signal.
Use grounded answer and structured extraction capabilities to reduce hallucinations and improve consistency.
Treat RAG like a system: instrument it, build evals, and tune retrieval and content parameters over time.

If your RAG answers are getting worse as your traffic grows and the web shifts, re‑architecting the retrieval layer in this way will give you the most leverage—and make every token of context count.