
How can I scrape dynamic websites without relying on brittle CSS/XPath selectors?
Most scraping pipelines eventually hit the same wall: as soon as a dynamic site ships a UI tweak, your carefully tuned CSS/XPath selectors start throwing 404s-on-the-DOM. If you’re trying to keep data extraction or web agents stable in production, “chasing the DOM” is not a strategy—it’s an operational tax. The good news: you can scrape dynamic websites without relying on brittle CSS/XPath selectors by switching to schema-first extraction, powered by an engine that understands page structure and returns structured JSON directly.
Quick Answer: You can scrape dynamic websites more reliably by defining the shape of the data you want (schema-first) and letting an AI-powered selector engine like AgentQL analyze the page structure for you, instead of hand-writing CSS/XPath. With AgentQL, you write a query describing your fields (e.g.,
products[] { name price }), run it via Python/JavaScript SDKs or a REST API, and get clean JSON that stays consistent even as the UI changes.
Why This Matters
Dynamic, JavaScript-heavy sites change constantly—new div wrappers, different class names, reflowed layouts. Traditional scrapers that anchor on exact DOM paths break on these changes, forcing you to:
- Rewrite selectors after every redesign
- Re-deploy crawlers just to fix trivial UI tweaks
- Overfeed LLMs with raw HTML and hope they don’t hallucinate
If your job is to keep data flowing for pricing intelligence, lead-gen, monitoring, or LLM grounding, you need scraping that behaves like an API contract: define the output shape once, and keep getting consistent JSON even when the front-end moves around. That’s exactly where schema-first, AI-assisted querying (like AgentQL) replaces brittle selectors.
Key Benefits:
- Less breakage from UI changes: AI analyzes the page’s structure instead of depending on fixed CSS/XPath, making extraction more self-healing across dynamic websites.
- Schema-first, JSON-ready data: You define the shape of your output once, and get structured JSON that’s ready for pipelines, databases, or LLM grounding.
- Faster iteration and debugging: Use SDKs, a browser extension, and a Playground to refine queries in real time, instead of spelunking through raw HTML and CSS selectors.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Schema‑first extraction | Defining the desired output structure (fields, arrays, nesting) and letting the engine map that schema to the page. | Treats web scraping like an API contract: stable JSON shape, easier downstream integration, and fewer “DOM archaeology” sessions. |
| AI‑powered selectors | An engine (like AgentQL) that uses AI to analyze page layout, semantics, and patterns to locate data, rather than static CSS/XPath. | A robust alternative to fragile selectors that keeps working despite dynamic content and layout changes. |
| Query → JSON workflow | Writing a query (in AgentQL) to describe what you want, running it via SDK/REST, and receiving structured JSON in response. | Eliminates manual parsing and reduces context-window blowups when connecting LLMs and agents to the web. |
How It Works (Step-by-Step)
Instead of binding scrapers to CSS/XPath, you:
- Define the shape of your data with an AgentQL query
- Test and refine it in a browser debugger or Playground
- Run it at scale via SDKs or a browserless REST API
1. Define the shape of your data
Start with the job-to-be-done: what JSON do you actually need?
Let’s say you’re scraping a dynamic e‑commerce category page and want product name, price (with currency symbol), and URL.
With CSS/XPath, you’d do something like:
// Old way: brittle selectors
const name = await page.$eval('.product-card .title span', el => el.textContent.trim());
const price = await page.$eval('.product-card .price .amount', el => el.textContent.trim());
const url = await page.$eval('.product-card a.details-link', el => el.href);
Every .product-card change can break you.
With AgentQL, you describe the JSON you want:
{
products[] {
product_name
product_price(include currency symbol)
product_url
}
}
You’re not tying yourself to class names or DOM depth. You’re declaring a schema (array of products, each with three fields) and letting AgentQL figure out how to extract it from the page.
2. Test and refine in real time
Use the AgentQL IDE browser extension or Playground:
- Open the dynamic website (e.g.,
https://example-shop.com/laptops) - Launch the AgentQL extension
- Paste and run your query
- Inspect the returned JSON; tweak field names or hints if needed
If the site adds badges, wraps content in new <div>s, or changes classes, you typically don’t touch the query—you just re-run it. AgentQL uses AI to analyze the structure and still find product_name, product_price, and product_url.
3. Run via SDKs or REST API
Once the query is stable, you move it into code.
Python + Playwright SDK example:
from agentql import AgentQLClient
client = AgentQLClient(api_key="YOUR_API_KEY")
query = """
{
products[] {
product_name
product_price(include currency symbol)
product_url
}
}
"""
result = client.query_url("https://example-shop.com/laptops", query)
print(result.json())
Sample JSON output:
{
"products": [
{
"product_name": "Lenovo ThinkPad X1 Carbon",
"product_price": "$1,499.00",
"product_url": "https://example-shop.com/product/lenovo-thinkpad-x1-carbon"
},
{
"product_name": "MacBook Air 13\" M3",
"product_price": "$1,199.00",
"product_url": "https://example-shop.com/product/macbook-air-13-m3"
}
]
}
Same query, same JSON shape, even if the internal DOM structure shifts. You can also use:
- JavaScript SDK + Playwright for Node.js environments
- Browserless REST API (URL → JSON, no browser management required)
- Headless browsing with concurrency and per-minute call limits appropriate for production workloads
Common Mistakes to Avoid
-
Treating AI selectors like magic and skipping schema design:
If you don’t define a clear output shape, your downstream systems still get inconsistent data. Always start with a precise AgentQL query that matches the JSON your pipelines expect. -
Mixing raw HTML scraping and AgentQL in the same flow without a plan:
Using AgentQL for core fields but falling back to ad‑hoc DOM parsing for everything else brings back fragility. Prefer a single schema‑first query that covers all required fields and keep your code focused on consuming JSON, not re-parsing HTML.
Real-World Example
Say you need to scrape Google search results, Medium articles, Twitter timelines, or a CDN provider like Cloudflare—all highly dynamic properties that change UI frequently.
With CSS/XPath, each site demands its own selector zoo. Every time search results layouts or feed cards change, you’re scanning HTML, adjusting selectors, and re-deploying.
With AgentQL, the pattern is the same for all:
{
search_results[] {
title
snippet
result_url
}
}
or
{
posts[] {
author_name
post_title
post_url
publish_date
}
}
You point this query at https://google.com, https://medium.com, https://twitter.com, or https://cdnjs.cloudflare.com via the SDKs or REST API. AgentQL’s AI analyzes each page’s structure and returns consistent JSON fields (title, snippet, result_url or posts[] with nested fields), even as each product ships new designs.
Pro Tip: Design your AgentQL queries at the “pattern” level (e.g.,
search_results[],posts[],products[]) and re-use them across similar sites. It’s often possible to keep a single query working across many domains with only minor tweaks, which is impossible with tightly coupled CSS/XPath selectors.
Summary
To scrape dynamic websites without relying on brittle CSS/XPath selectors, flip the model:
- Stop binding your extraction logic to precise DOM paths.
- Start defining the JSON you want and use an AI-powered selector engine like AgentQL to map that schema to the page.
- Run the same AgentQL query across many dynamic pages, and keep getting consistent, structured results—without re-writing selectors every time the front-end flexes.
This schema-first, query → JSON approach turns web scraping into a stable, testable contract that plays well with Playwright, LLMs, and production data pipelines.