
How do I scrape authenticated pages (logged-in dashboards) in a reliable way?
Most teams figure out how to scrape public pages long before they get authenticated dashboards working reliably. Logins, expiring sessions, 2FA, and constantly changing DOMs turn “just run Playwright” into a maintenance job. The goal isn’t just to log in once—it’s to keep scraping logged‑in dashboards in a reliable, low‑touch way that doesn’t break every time the UI changes.
Quick Answer: The most reliable way to scrape authenticated pages is to treat login as a first‑class step in your automation (via Playwright or a headless browser), then extract data via a schema‑first layer like AgentQL instead of brittle XPath/DOM selectors. You persist auth state (cookies/tokens), reuse it across runs, and let AgentQL’s AI selectors self‑heal when the dashboard layout changes.
Why This Matters
Logged‑in dashboards are where the important data actually lives: revenue, ops metrics, CRM records, marketplace analytics, admin tools. If your scraping pipeline keeps failing whenever a vendor tweaks their UI, you lose monitoring, break analytics, and waste engineering time debugging selectors and re‑recording flows.
Reliable scraping of authenticated dashboards means:
- Your data workflows stay consistent even as front‑end teams ship new designs.
- You can connect LLMs and AI agents to “the real” data in your internal tools, not just public marketing pages.
- You move away from “screenscraper as a fragile script” toward something closer to an API contract: query → JSON.
Key Benefits:
- Fewer broken scripts: Replace fragile XPath/DOM selectors with an AI‑driven, self‑healing querying layer that understands page structure.
- Production‑ready reliability: Persist sessions, handle logins deliberately, and keep the same queries running across dashboard UI changes.
- LLM‑ready outputs: Get clean JSON from authenticated pages so agents can ground on structured data instead of reams of raw HTML.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Authenticated scraping | Programmatically accessing pages behind login (cookies, tokens, sessions) using a browser automation layer like Playwright. | Dashboards and internal tools usually sit behind auth; you can’t rely on simple requests.get() for these flows. |
| Schema-first extraction | Defining the shape of the data you want (fields, arrays, types) before scraping, then mapping the page into that schema. | Treats scraping like an API contract; ensures stable JSON outputs even when the UI changes. |
| Self-healing selectors (AgentQL) | Letting AI analyze the page’s structure to find your data rather than hard‑coding XPath/CSS selectors. | Provides consistent results despite dynamic content and page changes, cutting down on selector maintenance. |
How It Works (Step-by-Step)
At a high level, a reliable pipeline for scraping authenticated dashboards looks like:
- Automate login once with Playwright.
- Persist and reuse auth state (cookies, local storage).
- Use AgentQL queries to turn dashboards into JSON.
Here’s how each step works in practice.
1. Automate login with a headless browser
Use Playwright (via Python or JavaScript) as your browser automation layer. You’ll:
- Navigate to the login page.
- Fill in credentials from a secure store (env vars, vault, secret manager).
- Handle 2FA if needed (more on that below).
- Save the authenticated browser state for reuse.
Python (Playwright + AgentQL SDK) — first login run:
from playwright.sync_api import sync_playwright
from agentql import AgentQLClient
client = AgentQLClient(api_key="YOUR_AGENTQL_API_KEY")
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context()
page = context.new_page()
# 1) Log in
page.goto("https://example-dashboard.com/login")
page.fill("input[name='email']", "user@example.com")
page.fill("input[name='password']", "YOUR_PASSWORD")
page.click("button[type='submit']")
page.wait_for_url("**/dashboard")
# 2) Persist auth state
context.storage_state(path="auth_state.json")
browser.close()
You only need to run this when auth changes (password rotation, cookie expiry) or periodically as part of your ops.
2. Reuse auth state for ongoing scraping
On subsequent runs, start Playwright with the saved auth_state.json so you don’t perform login every time. This dramatically improves reliability and reduces the risk of hitting rate limits or suspicious activity detection.
from playwright.sync_api import sync_playwright
from agentql import AgentQLClient
client = AgentQLClient(api_key="YOUR_AGENTQL_API_KEY")
AGENTQL_QUERY = """
{
dashboard {
kpis[] {
name
value
trend_direction
}
last_updated
}
}
"""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
# Reuse the login session
context = browser.new_context(storage_state="auth_state.json")
page = context.new_page()
page.goto("https://example-dashboard.com/dashboard", wait_until="networkidle")
# AgentQL: define the shape of your data with your query
result = client.extract(page, AGENTQL_QUERY)
print(result)
browser.close()
Here, the heavy lifting is shifting from “find this DOM node” to “these are the fields I want in my JSON.”
3. Use AgentQL for schema-first, self-healing extraction
Instead of hand‑wiring XPath like:
//*[@id="root"]/div[2]/div[1]/div[1]/span[2]
you tell AgentQL what JSON you want back and let it analyze the page structure.
AgentQL query:
{
dashboard {
kpis[] {
metric_name
metric_value
metric_change(include percentage symbol)
}
filters[] {
filter_label
selected_value
}
}
}
AgentQL runs this query in the browser context (via the SDK or REST API) and returns structured JSON:
{
"dashboard": {
"kpis": [
{
"metric_name": "Monthly Recurring Revenue",
"metric_value": "$128,430",
"metric_change": "+7.2%"
},
{
"metric_name": "Active Users",
"metric_value": "14,203",
"metric_change": "+3.9%"
}
],
"filters": [
{
"filter_label": "Time range",
"selected_value": "Last 30 days"
},
{
"filter_label": "Region",
"selected_value": "All"
}
]
}
}
If the dashboard UI shifts from left sidebar to top nav or changes class names, the query stays the same. AgentQL uses AI to analyze the page’s structure to find metric_name, metric_value, etc., making it a robust alternative to fragile XPath and DOM/CSS selectors.
Common Mistakes to Avoid
-
Hard‑coding fragile selectors for dashboards:
- How to avoid it: Don’t treat dashboard scraping like a one‑off script. Use a schema‑first query layer (AgentQL) instead of baking in CSS/XPath tied to a specific layout. When the UI changes, you ideally leave the query alone or adjust the schema, not dozens of selectors.
-
Ignoring session lifecycle and 2FA:
- How to avoid it: Explicitly design around auth. Persist
storage_state, implement refresh logic when it expires, and decide how you’ll handle 2FA (long‑lived session, service account, or human‑in‑the‑loop bootstrap for tokens). Don’t rely on “it works on my laptop” cookies embedded directly in scripts.
- How to avoid it: Explicitly design around auth. Persist
Real-World Example
Imagine you’re scraping an internal analytics dashboard only available to logged‑in users. The front‑end team ships redesigns every few weeks: KPIs move, CSS classes change, a new sidebar appears. With a traditional Playwright + XPath combo, every release breaks your extraction, and your team spends time:
- Re‑recording selectors.
- Patch‑testing scripts across multiple dashboard variants.
- Explaining why last week’s data pipeline silently dropped a metric.
With a schema‑first setup using AgentQL:
- You log in once with Playwright, store
auth_state.json, and reuse it across 100 concurrent remote browser sessions if needed. - You define your extraction contract in an AgentQL query (e.g.,
kpis[] { metric_name metric_value metric_change }). - When the UI changes, AgentQL still returns your JSON because it’s using AI to read the page rather than relying on a specific DOM path.
If a major redesign truly changes the information architecture (not just styling and positioning), you update the query in the AgentQL Playground or IDE browser extension, test it live against the dashboard, and redeploy. The amount of “selector surgery” is drastically smaller, and your dashboards behave much more like stable APIs.
Pro Tip: Use the AgentQL browser extension to open your logged‑in dashboard, iterate on your query in real time until the JSON matches your ideal schema, then copy that query directly into your Playwright + SDK script. This tight feedback loop is the difference between guessing at selectors and operating with a verifiable contract.
Summary
To scrape authenticated pages and logged‑in dashboards in a reliable way, you need to treat three things as first‑class:
- Auth automation: Use Playwright to log in once, then persist and reuse auth state instead of re‑logging every scrape.
- Schema‑first extraction: Define the JSON shape you want up front so dashboards feel like APIs, not brittle HTML.
- Self‑healing selectors: Let AgentQL’s AI analyze page structure instead of maintaining XPath/CSS by hand, keeping your extractions stable despite dashboard UI changes.
This stack turns “scraping a logged‑in dashboard” from a brittle script into a repeatable, production‑grade data workflow that plays nicely with LLMs and other downstream systems.