Is there a way to extract structured JSON from a webpage without writing custom parsing code?
RAG Retrieval & Web Search APIs

Is there a way to extract structured JSON from a webpage without writing custom parsing code?

7 min read

Most teams that depend on web data end up maintaining a mess of XPath, CSS selectors, and regex just to get clean JSON out of a page—and those pipelines break every time a frontend engineer ships a redesign. You can skip most of that custom parsing work by letting an engine analyze the page for you and return structured JSON directly based on a query or schema you define.

Quick Answer: Yes. Instead of writing custom parsing code, you can use tools like AgentQL that connect to a webpage, let you define the shape of the data you want, and return structured JSON in one step. You describe the output (via a query or natural language), and AgentQL’s AI analyzes the page’s structure to find the right elements—no brittle XPath, DOM, or CSS selectors required.

Why This Matters

If your job is “take this URL and turn it into reliable data,” every hour you spend reverse‑engineering HTML is wasted. Traditional scrapers constantly break on layout changes, raw HTML blows up LLM context windows, and each new site means new parsing logic.

Being able to go directly from “this page” → “this JSON shape” without writing custom parsing code:

  • Speeds up shipping extraction and automation workflows.
  • Makes your web agents and LLM tools more reliable in production.
  • Turns the web into something closer to an API, with a predictable contract.

Key Benefits:

  • No fragile selectors: Replace XPath/DOM/CSS selectors with AI that understands the page structure and finds fields by meaning, not by div position.
  • Schema‑first JSON output: Define the shape of your data once and reuse it across similar pages, including dynamic content and PDFs.
  • Plugs into your stack: Use Python/JavaScript SDKs with Playwright or a browserless REST API (URL → JSON) to slot straight into existing pipelines and LLM workflows.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Schema‑first extractionYou define the fields and structure you want (e.g., { products[] { product_name product_price } }) and let the engine populate them from the page.Treats the web like an API: clear contracts, predictable JSON, easier downstream processing and validation.
Selector‑free queryingInstead of hand‑writing XPath or CSS selectors, AI analyzes the page’s structure to find the best matches for your requested fields.Reduces breakage when the DOM shifts, removes the need to inspect HTML, and speeds up integration with new sites.
Self‑healing, reusable queriesA single query can work across multiple similar pages and remains consistent despite dynamic content or layout changes.Cuts maintenance costs; you don’t rewrite parsers for every A/B test, redesign, or new product category page.

How It Works (Step‑by‑Step)

At a high level, extracting structured JSON from a webpage without custom parsing code looks like this:

  1. Define the JSON you want with a query

    With AgentQL, you describe the output shape using a query language that mirrors the JSON you expect. For example, say you want a list of products from a search results page:

    {
      results {
        products[] {
          product_name
          product_price(include currency symbol)
          product_image
        }
      }
    }
    

    You’re not pointing at specific divs or CSS classes—just naming the fields you care about.

  2. Let AI analyze the page structure

    Under the hood, AgentQL:

    • Fetches the page (via Playwright in the SDKs or via the REST API).
    • Analyzes the DOM, visible content, and layout.
    • Maps your requested fields (e.g., product_name, product_price) to the most relevant elements on the page.

    This is where it replaces brittle selectors: it doesn’t rely on /div[1]/div/div[2]/div[2] paths that change daily.

  3. Receive clean structured JSON

    The response is already structured JSON that matches your query:

    {
      "results": {
        "products": [
          {
            "product_name": "Noise‑Cancelling Headphones",
            "product_price": "$249.99",
            "product_image": "https://example.com/images/headphones.jpg"
          },
          {
            "product_name": "Wireless Earbuds",
            "product_price": "$129.00",
            "product_image": "https://example.com/images/earbuds.jpg"
          }
        ]
      }
    }
    

    No extra parsing layer, no regex to clean out HTML. You take this JSON straight into your database, analytics pipeline, or LLM grounding context.

Practical surfaces you can use

You can get this “URL → JSON” behavior in a few ways:

  • Python/JavaScript SDKs (Playwright‑based)
    Ideal if you already use Playwright or need to interact with pages (click, scroll, authenticate) before extracting.

    Python sketch:

    from agentql import AgentQLClient
    
    client = AgentQLClient(api_key="YOUR_API_KEY")
    
    query = """
    {
      results {
        products[] {
          product_name
          product_price(include currency symbol)
          product_image
        }
      }
    }
    """
    
    result = client.extract(
        url="https://example.com/search?q=headphones",
        query=query
    )
    
    print(result.json())
    
  • Browserless REST API
    When you just want public data from any URL, no browser infra on your side:

    curl -X POST https://api.agentql.com/extract \
      -H "Authorization: Bearer YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/search?q=headphones",
        "query": "{ results { products[] { product_name product_price(include currency symbol) } } }"
      }'
    
  • AgentQL IDE browser extension & Playground
    For debugging and iteration: open any page, tweak your query in real‑time, and see the JSON output update instantly. Once it looks right, drop that same query into your code.

Common Mistakes to Avoid

  • Treating AgentQL like XPath with extra steps:
    If you try to re‑encode your existing selectors into the query, you miss the point. Instead of mirroring your DOM, think in terms of the data you want: names, prices, dates, descriptions.
    How to avoid it: Always start from your downstream schema (what your database or LLM needs) and let AgentQL handle the mapping.

  • Baking page‑specific assumptions into your schema:
    If you hard‑code fields that only exist on one particular page variant, your query becomes less reusable across similar pages.
    How to avoid it: Keep queries focused on stable concepts (e.g., job_title, company_name, location) and let optional fields be optional. That’s how you get self‑healing behavior when pages change.

Real‑World Example

Say you’re building a market‑intelligence pipeline that scrapes product listings every night from a dozen e‑commerce sites. The old setup:

  • Playwright scripts with long XPath chains (/html/body/div[3]/div[2]/div[1]/div[2]/span).
  • Hand‑rolled parsing and post‑processing per domain.
  • Weekly breakages when frontends shipped new components or changed class names.

Switching to AgentQL, your job becomes: define one extraction query per “page type,” not per DOM version. For example, a “category page” query:

{
  category_page {
    category_name
    products[] {
      product_name
      product_price(include currency symbol)
      availability_status
    }
  }
}

You attach this to all category URLs across multiple brands. When one site changes its layout, AgentQL’s AI re‑interprets the DOM and still finds product names and prices by semantic context, not by exact node paths. Your nightly job continues to emit consistent JSON without rewriting selectors.

Pro Tip: Use the browser extension or Playground to tune your queries on live pages first. Once the JSON looks right on a few representative URLs, you can safely drop that query into your Playwright or REST API workflow and reuse it across similar pages.

Summary

You don’t need to write custom parsing code every time you want structured JSON from a webpage. With AgentQL, you define the shape of your data in a query, let AI analyze the page’s structure instead of relying on fragile XPath or DOM/CSS selectors, and receive clean, schema‑first JSON that stays consistent even as pages change.

This approach makes the web more “API‑like” for your agents and data pipelines: fewer broken scrapers, less time spent crunching HTML, and more time feeding reliable JSON into your databases, LLMs, and automations.

Next Step

Get Started