How can I ground an LLM on a webpage without stuffing raw HTML into the context window?
RAG Retrieval & Web Search APIs

How can I ground an LLM on a webpage without stuffing raw HTML into the context window?

8 min read

Most teams trying to ground LLMs on the live web end up doing the same thing I did for years: pull a page with Playwright, shovel the raw HTML into the prompt, and hope the model can make sense of it. It works for toy demos, but it explodes context windows, increases latency, and leads to brittle, hallucination‑prone agents in production.

Quick Answer: Instead of stuffing raw HTML into the context window, ground your LLM on structured JSON extracted from the page. Use a schema‑first query layer (like AgentQL) to define the shape of the data you need, let AI locate it on the page, and feed the resulting JSON into your LLM as compact, reliable context.

Why This Matters

LLMs don’t “think in HTML.” When you give them reams of markup, they waste tokens parsing layout noise instead of reasoning on the data you care about. That drives up costs, slows responses, and makes your web agents fragile whenever the DOM changes.

Grounding on structured JSON instead:

  • Keeps prompts small and predictable.
  • Reduces hallucinations by removing layout noise and boilerplate.
  • Makes your web automation behave more like an API contract: query in, JSON out.

Key Benefits:

  • Smaller, cheaper prompts: JSON with only the fields you need is a fraction of the size of full HTML, so you stay well within context limits and cut token costs.
  • More reliable grounding: Schema‑first extraction removes ads, navigation, and layout noise, improving the signal your LLM sees and making answers more trustworthy.
  • Reusable, self‑healing workflows: A good query layer survives DOM tweaks and dynamic content, so you don’t rewrite scrapers every time a website nudges its UI.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Schema‑first groundingDefining the exact shape of the data you want from a page (e.g., products[] { name price url }) before extraction.Turns the web into predictable JSON for your LLM, just like an API contract. Less noise, more control.
AI‑powered selectorsUsing AI to analyze page structure and semantics instead of hand‑rolled XPath/CSS selectors.Eliminates fragile selectors that break on minor DOM changes and lets you reuse the same query across similar pages.
HTML‑free contextFeeding only structured JSON (plus minimal metadata) into the LLM rather than raw HTML.Keeps context windows under control, reduces hallucinations, and simplifies your prompt templates.

How It Works (Step‑by‑Step)

At a high level, grounding an LLM on a webpage without raw HTML looks like this:

  1. Fetch the page in a browser (real or headless).
  2. Run a structured query against it to get JSON.
  3. Pass that JSON into your LLM as the grounding context.

1. Define the shape of your data with a query

Start from the job‑to‑be‑done: what does the LLM actually need to know from the page?

For example, say you want to ground an LLM on a product listing page so it can answer comparison questions. Instead of pulling all HTML, you define an AgentQL query like:

{
  products[] {
    product_name
    product_price(include currency symbol)
    product_rating(optional)
    product_url
  }
}

This tells the extraction engine:

  • “Find all products on this page.”
  • “For each, find the name, price, rating if present, and URL.”
  • “Return exactly those fields as structured JSON.”

Because AgentQL uses AI to analyze the page’s structure instead of hard‑coding XPath/DOM/CSS selectors, the same query is designed to be reusable across similar product pages—even when the UI changes.

2. Run the query via SDK or REST API

You can connect to the page using:

  • Python or JavaScript SDKs (Playwright‑based) when you need to interact with the page (login flows, clicking tabs, pagination, infinite scroll).
  • Browserless REST API when you just need URL → JSON extraction without managing browsers.

Example with a Playwright + AgentQL–style flow in JavaScript:

import { chromium } from 'playwright';
import { AgentQLClient } from '@agentql/js'; // illustrative

const client = new AgentQLClient({ apiKey: process.env.AGENTQL_API_KEY });

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com/products');

  const query = `
    {
      products[] {
        product_name
        product_price(include currency symbol)
        product_rating(optional)
        product_url
      }
    }
  `;

  const data = await client.queryPage({ page, query });

  console.log(JSON.stringify(data, null, 2));

  await browser.close();
})();

Typical JSON you’d get back:

{
  "products": [
    {
      "product_name": "Noise‑Cancelling Headphones X200",
      "product_price": "$249.99",
      "product_rating": "4.6",
      "product_url": "https://example.com/products/x200"
    },
    {
      "product_name": "Wireless Earbuds Pro",
      "product_price": "$129.00",
      "product_rating": "4.3",
      "product_url": "https://example.com/products/earbuds-pro"
    }
  ]
}

That JSON is already “LLM‑ready” and dramatically smaller than the full DOM, yet it preserves the semantics you care about.

3. Ground your LLM on the JSON, not the HTML

Now you feed the JSON into your LLM prompt as context. The LLM never sees <div> or <span> at all.

Example (pseudo‑prompt):

System: You are an assistant that answers questions using only the product data provided in JSON.

User: Here is the data extracted from the current product listing page:

{{products_json}}

Answer the user's question strictly based on this data.
If the data is missing something, say so explicitly.

User question: "Which product is the best value under $200?"

LLM sees a compact, structured view of the page, grounds on it, and returns an answer that’s:

  • Cheaper to compute (fewer tokens).
  • Less likely to hallucinate elements that aren’t actually on the page.
  • Easier to debug (you can inspect the exact JSON that was used).

If you’re building an agent, the agent loop becomes:

  1. Navigate with Playwright or call the REST API.
  2. Run an AgentQL query to get JSON.
  3. Pass that JSON into the LLM as tool output / context.
  4. Repeat as needed for additional pages.

Common Mistakes to Avoid

  • Stuffing raw HTML “just in case”:
    Teams often still add HTML alongside their structured data “for safety.” This defeats the purpose: it bloats context and reintroduces noise.
    How to avoid it: Treat your query as the source of truth. If you’re missing a field, refine the query and rerun—don’t fall back to dumping HTML.

  • Hard‑coding selectors around the query layer:
    Wrapping AgentQL or similar tools in your own XPath/CSS logic re‑creates the fragility you were trying to escape.
    How to avoid it: Let AI analyze page structure. Use AgentQL queries directly (or natural‑language descriptions in the Playground/IDE) instead of tying fields to specific DOM paths.


Real-World Example

Suppose you’re building a research assistant that summarizes the latest posts from Medium or LinkedIn and answers questions about them. The old way:

  1. Load the article/profile with Playwright.
  2. Pull the entire HTML.
  3. Compress it (maybe via another LLM).
  4. Stuff it into the main model’s context.
  5. Hope the model extracts the correct title, author, and content.

You hit context limits quickly and you’re constantly dealing with markup noise, tracking scripts, and layout cruft.

With a schema‑first approach using AgentQL, the flow becomes:

  1. Navigate to the article/profile URL via Playwright or REST.

  2. Run a query like:

    {
      article {
        title
        author_name
        published_date
        tags[]
        body_text
      }
    }
    
  3. Get back clean JSON:

    {
      "article": {
        "title": "Grounding LLMs on the Web Without HTML Bloat",
        "author_name": "Maya Chen",
        "published_date": "2026-04-10",
        "tags": ["LLM", "web automation", "data extraction"],
        "body_text": "When you ship web agents to production, raw HTML becomes a liability..."
      }
    }
    
  4. Pass that JSON directly to your LLM and ask:
    “Summarize this article in 3 bullet points for a staff engineer evaluating web‑based grounding strategies.”

No HTML in the context window, no brittle selectors, and you can reuse the same query across thousands of similar pages, even when Medium or LinkedIn tweaks their layout.

Pro Tip: Use the AgentQL IDE browser extension or Playground to iterate on your queries live against real pages (Medium, LinkedIn, Creative Commons, etc.). Once you’re happy with the JSON shape, drop that exact query into your SDK or REST flow and wire it into your agent—no more trial‑and‑error XPath debugging.


Summary

If your question is “How can I ground an LLM on a webpage without stuffing raw HTML into the context window?”, the most scalable answer is: stop treating the web like a blob of markup and start treating it like an API.

Define the data shape you need (schema‑first), let AI analyze the page structure to find it, and ground your LLM on the resulting JSON instead of raw HTML. Tools like AgentQL give you that layer: you write a query, they return structured data, and you plug that into your LLM via Playwright or a browserless REST API.

You get smaller prompts, fewer hallucinations, and web agents that keep working even when the DOM shifts.

Next Step

Get Started