
AgentQL Python SDK quickstart: integrate with Playwright and extract JSON from a page
Most developers discover the limits of web scraping the hard way: brittle XPath selectors, DOM changes breaking scripts, and LLMs choking on reams of HTML. The AgentQL Python SDK is built to avoid all of that—define the JSON you want, let AgentQL + Playwright find it, and ship extraction that survives layout changes.
Quick Answer: With the AgentQL Python SDK, you connect Playwright to any web page, define the shape of your data in an AgentQL query, and receive clean JSON instead of parsing HTML. Install the SDK, write a short script that uses
page.query_data(...), and you’ll have structured JSON extraction running in a few minutes.
Why This Matters
If you’re wiring LLMs or agents to the web, DOM-level scraping becomes a bottleneck fast. Every new site means new XPath/CSS selectors, every redesign means another round of firefighting, and feeding raw HTML into LLMs is both expensive and unreliable.
AgentQL changes the workflow: you define the output contract (JSON schema) once, then let AgentQL’s AI analyze each page’s structure to locate the data—no manual DOM spelunking. The result is extraction that’s:
- more robust to UI changes,
- easier to plug into agents and pipelines,
- and simpler to debug via a browser-based playground instead of logging HTML.
Key Benefits:
- Schema-first extraction: Define the JSON shape you want in an AgentQL query; the SDK returns structured data, not HTML.
- Self-healing selectors: AgentQL uses AI to analyze page structure, acting as a robust alternative to fragile XPath/DOM/CSS selectors.
- Playwright-native workflow: Use Python + Playwright (via the AgentQL SDK) to interact with pages, extract data, and automate flows without rewriting DOM parsing for every site.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| AgentQL query | A schema-like description of the data you want (e.g. { products[] { product_name product_price(include currency symbol) } }). | Lets you define the shape of your JSON once and reuse it across similar pages instead of hand-writing selectors. |
| Python SDK + Playwright | AgentQL’s Python package that wraps Playwright to navigate pages and execute AgentQL queries. | Keeps your familiar Playwright workflow—navigate, click, wait—but replaces brittle scraping code with query_data(...). |
| Structured JSON output | The data AgentQL returns when a query runs against a loaded page or document. | Ready-made for LLM grounding, analytics pipelines, or databases, without HTML cleanup or ad-hoc parsing. |
How It Works (Step-by-Step)
At a high level, using the AgentQL Python SDK with Playwright looks like this:
- Install the SDK and Playwright.
- Write an AgentQL query that defines your desired JSON.
- Use Python + Playwright to open a page and run the query.
- Receive structured JSON that you can feed into your data pipeline or LLM.
1. Install the SDK
Install the Python SDK:
pip3 install agentql
If you haven’t set up Playwright yet:
pip3 install playwright
playwright install
Initialize a sample AgentQL project (optional but recommended for scaffolding):
agentql init
This sets up basic example scripts you can adapt to your use case.
2. Define the data you want with an AgentQL query
AgentQL is schema-first: you define the structure of your output JSON, not how to find it in the DOM.
Example: imagine an e-commerce product listing page. You might want product names and prices with currency symbols:
{
products[] {
product_name
product_price(include currency symbol)
}
}
AgentQL then uses AI to analyze the page’s structure and map these fields to real elements, without you writing XPath or CSS.
3. Write a Python script with Playwright + AgentQL
Below is a minimal end-to-end script that:
- launches a browser,
- visits a URL,
- evaluates an AgentQL query against the page,
- and prints the JSON.
import asyncio
from playwright.async_api import async_playwright
from agentql import AgentQLClient
AGENTQL_API_KEY = "YOUR_API_KEY" # if your setup uses an API key
QUERY = """
{
products[] {
product_name
product_price(include currency symbol)
}
}
"""
URL = "https://example.com/products"
async def main():
# Initialize AgentQL client (if your environment requires an API key)
client = AgentQLClient(api_key=AGENTQL_API_KEY)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(URL, wait_until="networkidle")
# Core: run AgentQL query against the current page
result = await client.query_data(page=page, query=QUERY)
# `result` is already structured JSON
print(result)
await browser.close()
if __name__ == "__main__":
asyncio.run(main())
On a typical product list page, the JSON output will look like:
{
"products": [
{
"product_name": "Stainless Steel Water Bottle",
"product_price": "$24.99"
},
{
"product_name": "Insulated Travel Mug",
"product_price": "$18.50"
}
]
}
You never touch querySelector, XPath, or BeautifulSoup. You declare the shape once; AgentQL handles the rest.
4. Test and refine in the AgentQL debugger
Before you wire queries into scripts, it’s often faster to refine them visually:
- Install the AgentQL browser extension (IDE / debugger).
- Open the target page in your browser.
- Use the debugger to:
- write or paste your AgentQL query,
- test it live,
- see the JSON result immediately.
Once you like the output, copy the query into your Python script. This tight feedback loop saves you from repeated code edits and manual HTML inspection.
5. Run your script at scale
When you’re satisfied with your query:
-
Run your script normally:
python example.py -
Scale it up by:
- driving multiple URLs,
- running multiple concurrent Playwright contexts,
- or using AgentQL’s browserless REST API for “URL → JSON” flows without managing browsers yourself.
Because AgentQL is built for consistent results despite dynamic content and layout changes, you can reuse the same query across similar pages and across time.
Common Mistakes to Avoid
-
Treating AgentQL like XPath:
AgentQL is not a selector language—don’t try to encode DOM paths in queries. Instead, describe the logical data entities you want (e.g.job_title,company_name,salary_range), and let AgentQL’s AI analyze the page structure. -
Skipping query refinement in the debugger:
Writing queries directly in Python and “hoping it works” slows you down. Use the browser extension or Playground to iterate quickly, then paste the working query into your script. -
Overloading the query with too many fields at once:
If you request a very large schema on a complex page, it becomes harder to debug when one field is off. Start with a minimal subset (e.g. name + price), confirm correctness, then expand. -
Feeding raw HTML to LLMs instead of JSON:
Once you have structured JSON via AgentQL, resist the temptation to still send full HTML to your LLM. Use the JSON as the primary grounding data—it’s cheaper, more controllable, and reduces hallucinations.
Real-World Example
Suppose you’re building a price-monitoring system for a marketplace. Previously, you:
- wrote Playwright scripts that used CSS selectors like
.product-card .titleand.product-card .price, - converted extracted text to structured objects with custom Python,
- watched everything break the moment the front-end team tweaked the layout or class names.
With the AgentQL Python SDK, the workflow becomes:
-
Use the browser extension on a sample product list page.
-
Define the query:
{ products[] { product_name product_price(include currency symbol) product_url } } -
Confirm the JSON output in the debugger:
{ "products": [ { "product_name": "Noise-Cancelling Headphones", "product_price": "$129.99", "product_url": "https://example.com/p/noise-cancelling-headphones" } ] } -
Paste that query into your Python + Playwright script and run it across all product category URLs.
Weeks later, when the marketplace tweaks its CSS or rearranges cards, your script still returns the same JSON structure. AgentQL’s AI re-analyzes the page instead of relying on brittle selectors. Your monitoring pipeline keeps running; you update queries only when the logical data model changes, not when the markup does.
Pro Tip: Treat your AgentQL queries like API contracts. Version them (e.g.
product_list_v1,product_list_v2), write tests that assert the JSON schema, and only change queries when your downstream consumers need a new field or structure.
Summary
Using the AgentQL Python SDK with Playwright lets you skip DOM scraping and go straight from “URL” to “clean JSON” with a schema-first workflow. You:
- install the SDK,
- define your output shape in an AgentQL query,
- run that query against any loaded page via Playwright,
- and get consistent, self-healing JSON extraction without writing XPath/CSS selectors.
This approach plays nicely with LLM agents, analytics pipelines, and any system that expects stable, structured data—even as web pages change.