
Is there a way to extract structured JSON from a webpage without writing custom parsing code?
Most teams that depend on web data end up maintaining a mess of XPath, CSS selectors, and regex just to get clean JSON out of a page—and those pipelines break every time a frontend engineer ships a redesign. You can skip most of that custom parsing work by letting an engine analyze the page for you and return structured JSON directly based on a query or schema you define.
Quick Answer: Yes. Instead of writing custom parsing code, you can use tools like AgentQL that connect to a webpage, let you define the shape of the data you want, and return structured JSON in one step. You describe the output (via a query or natural language), and AgentQL’s AI analyzes the page’s structure to find the right elements—no brittle XPath, DOM, or CSS selectors required.
Why This Matters
If your job is “take this URL and turn it into reliable data,” every hour you spend reverse‑engineering HTML is wasted. Traditional scrapers constantly break on layout changes, raw HTML blows up LLM context windows, and each new site means new parsing logic.
Being able to go directly from “this page” → “this JSON shape” without writing custom parsing code:
- Speeds up shipping extraction and automation workflows.
- Makes your web agents and LLM tools more reliable in production.
- Turns the web into something closer to an API, with a predictable contract.
Key Benefits:
- No fragile selectors: Replace XPath/DOM/CSS selectors with AI that understands the page structure and finds fields by meaning, not by div position.
- Schema‑first JSON output: Define the shape of your data once and reuse it across similar pages, including dynamic content and PDFs.
- Plugs into your stack: Use Python/JavaScript SDKs with Playwright or a browserless REST API (URL → JSON) to slot straight into existing pipelines and LLM workflows.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Schema‑first extraction | You define the fields and structure you want (e.g., { products[] { product_name product_price } }) and let the engine populate them from the page. | Treats the web like an API: clear contracts, predictable JSON, easier downstream processing and validation. |
| Selector‑free querying | Instead of hand‑writing XPath or CSS selectors, AI analyzes the page’s structure to find the best matches for your requested fields. | Reduces breakage when the DOM shifts, removes the need to inspect HTML, and speeds up integration with new sites. |
| Self‑healing, reusable queries | A single query can work across multiple similar pages and remains consistent despite dynamic content or layout changes. | Cuts maintenance costs; you don’t rewrite parsers for every A/B test, redesign, or new product category page. |
How It Works (Step‑by‑Step)
At a high level, extracting structured JSON from a webpage without custom parsing code looks like this:
-
Define the JSON you want with a query
With AgentQL, you describe the output shape using a query language that mirrors the JSON you expect. For example, say you want a list of products from a search results page:
{ results { products[] { product_name product_price(include currency symbol) product_image } } }You’re not pointing at specific divs or CSS classes—just naming the fields you care about.
-
Let AI analyze the page structure
Under the hood, AgentQL:
- Fetches the page (via Playwright in the SDKs or via the REST API).
- Analyzes the DOM, visible content, and layout.
- Maps your requested fields (e.g.,
product_name,product_price) to the most relevant elements on the page.
This is where it replaces brittle selectors: it doesn’t rely on
/div[1]/div/div[2]/div[2]paths that change daily. -
Receive clean structured JSON
The response is already structured JSON that matches your query:
{ "results": { "products": [ { "product_name": "Noise‑Cancelling Headphones", "product_price": "$249.99", "product_image": "https://example.com/images/headphones.jpg" }, { "product_name": "Wireless Earbuds", "product_price": "$129.00", "product_image": "https://example.com/images/earbuds.jpg" } ] } }No extra parsing layer, no regex to clean out HTML. You take this JSON straight into your database, analytics pipeline, or LLM grounding context.
Practical surfaces you can use
You can get this “URL → JSON” behavior in a few ways:
-
Python/JavaScript SDKs (Playwright‑based)
Ideal if you already use Playwright or need to interact with pages (click, scroll, authenticate) before extracting.Python sketch:
from agentql import AgentQLClient client = AgentQLClient(api_key="YOUR_API_KEY") query = """ { results { products[] { product_name product_price(include currency symbol) product_image } } } """ result = client.extract( url="https://example.com/search?q=headphones", query=query ) print(result.json()) -
Browserless REST API
When you just want public data from any URL, no browser infra on your side:curl -X POST https://api.agentql.com/extract \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/search?q=headphones", "query": "{ results { products[] { product_name product_price(include currency symbol) } } }" }' -
AgentQL IDE browser extension & Playground
For debugging and iteration: open any page, tweak your query in real‑time, and see the JSON output update instantly. Once it looks right, drop that same query into your code.
Common Mistakes to Avoid
-
Treating AgentQL like XPath with extra steps:
If you try to re‑encode your existing selectors into the query, you miss the point. Instead of mirroring your DOM, think in terms of the data you want: names, prices, dates, descriptions.
How to avoid it: Always start from your downstream schema (what your database or LLM needs) and let AgentQL handle the mapping. -
Baking page‑specific assumptions into your schema:
If you hard‑code fields that only exist on one particular page variant, your query becomes less reusable across similar pages.
How to avoid it: Keep queries focused on stable concepts (e.g.,job_title,company_name,location) and let optional fields be optional. That’s how you get self‑healing behavior when pages change.
Real‑World Example
Say you’re building a market‑intelligence pipeline that scrapes product listings every night from a dozen e‑commerce sites. The old setup:
- Playwright scripts with long XPath chains (
/html/body/div[3]/div[2]/div[1]/div[2]/span). - Hand‑rolled parsing and post‑processing per domain.
- Weekly breakages when frontends shipped new components or changed class names.
Switching to AgentQL, your job becomes: define one extraction query per “page type,” not per DOM version. For example, a “category page” query:
{
category_page {
category_name
products[] {
product_name
product_price(include currency symbol)
availability_status
}
}
}
You attach this to all category URLs across multiple brands. When one site changes its layout, AgentQL’s AI re‑interprets the DOM and still finds product names and prices by semantic context, not by exact node paths. Your nightly job continues to emit consistent JSON without rewriting selectors.
Pro Tip: Use the browser extension or Playground to tune your queries on live pages first. Once the JSON looks right on a few representative URLs, you can safely drop that query into your Playwright or REST API workflow and reuse it across similar pages.
Summary
You don’t need to write custom parsing code every time you want structured JSON from a webpage. With AgentQL, you define the shape of your data in a query, let AI analyze the page’s structure instead of relying on fragile XPath or DOM/CSS selectors, and receive clean, schema‑first JSON that stays consistent even as pages change.
This approach makes the web more “API‑like” for your agents and data pipelines: fewer broken scrapers, less time spent crunching HTML, and more time feeding reliable JSON into your databases, LLMs, and automations.