Best tools for scraping logged-in sites with Playwright without brittle selectors
RAG Retrieval & Web Search APIs

Best tools for scraping logged-in sites with Playwright without brittle selectors

10 min read

Most Playwright scraping setups work fine until you add authentication and dynamic UIs—then DOM/CSS/XPath selectors start breaking weekly, and “quick fixes” turn into a maintenance job. If you want to scrape logged‑in sites reliably without babysitting brittle selectors, you need tools that can both handle auth and locate elements in a more resilient, semantic way.

Quick Answer: The best stack for scraping logged‑in sites with Playwright without brittle selectors combines Playwright for session/auth handling and navigation, and an AI‑driven selector/query layer like AgentQL for resilient element targeting and structured JSON extraction. Supplement this with a browserless runner, a good debugger/IDE for your queries, and robust storage for cookies/sessions.

Why This Matters

For authenticated web apps—dashboards, SaaS tools, internal portals—the HTML is noisy, class names are often obfuscated, and layouts change frequently. Hard‑coded selectors (XPath, CSS, DOM traversal) work for a week, then silently fail and corrupt your downstream data.

A more robust approach treats the page like an API: define the output shape (schema), let a smarter selector layer find the data, and reuse the same query across similar pages—even as the UI evolves.

Key Benefits:

  • Far fewer broken scripts: AI‑driven, “self‑healing” selectors survive UI changes better than manual XPath/CSS.
  • Cleaner, ready‑to‑use data: Extract structured JSON directly instead of crunching reams of HTML inside your LLMs or ETL.
  • Faster iteration: Debug your queries live in a browser and roll them out across many similar logged‑in pages.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Selector robustnessThe ability of your element locators to keep working despite DOM, class, or layout changes.Logged‑in sites ship new UI constantly; fragile selectors multiply maintenance and breakage risk.
Schema‑first extractionDefining the exact JSON shape you want (fields, arrays, nesting) before scraping.Turns web pages into predictable “APIs,” simplifies downstream ETL, and plays nicely with LLM grounding.
AI‑driven querying (AgentQL)Using an AI layer to analyze page structure and fill your requested JSON schema instead of hand‑writing selectors.Replaces fragile XPath/DOM/CSS, makes selectors self‑healing, and lets the same query work across similar pages.

How It Works (Step‑by‑Step)

At a high level, a modern, low‑maintenance stack for logged‑in scraping looks like this:

  1. Use Playwright for auth and navigation
  2. Hand off extraction to a robust selector/query layer (AgentQL)
  3. Return structured JSON, not HTML

Below is how that actually plays out in practice.

1. Use Playwright for authentication and sessions

Playwright is still the right tool to:

  • Log in with username/password, SSO, or multi‑step flows
  • Persist and reuse sessions (cookies, local storage)
  • Navigate through authenticated pages and handle redirects

You can:

  • Use the Playwright Test runner or @playwright/test in Node
  • Or run Playwright as part of your own Python/Node script

Example (Node.js) for logging in and saving storage state:

import { chromium } from 'playwright';

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://example-portal.com/login');
  await page.fill('input[name="email"]', process.env.USER_EMAIL!);
  await page.fill('input[name="password"]', process.env.USER_PASSWORD!);
  await page.click('button[type="submit"]');
  await page.waitForURL('**/dashboard');

  // Save session for reuse
  await context.storageState({ path: 'storage-state.json' });
  await browser.close();
})();

Now you can reuse this storage-state.json in later scraping runs without re‑logging in.

2. Attach AgentQL for resilient, schema‑first extraction

Instead of writing selectors like:

const price = await page.textContent(
  'div.product-card:nth-of-type(1) span.price'
);

you define the data you want as an AgentQL query. AgentQL uses AI to analyze the page’s structure and find those fields for you—no manual CSS/XPath.

Example AgentQL query for a logged‑in dashboard listing invoices:

{
  invoices[] {
    invoice_id
    invoice_date
    client_name
    total_amount(include currency symbol)
    status
  }
}

Run that via the AgentQL JavaScript or Python SDK (Playwright‑based) and you get clean JSON back:

{
  "invoices": [
    {
      "invoice_id": "INV-3421",
      "invoice_date": "2026-04-01",
      "client_name": "Acme Corp",
      "total_amount": "$4,120.00",
      "status": "Paid"
    },
    {
      "invoice_id": "INV-3422",
      "invoice_date": "2026-04-08",
      "client_name": "Globex Inc.",
      "total_amount": "$1,980.00",
      "status": "Pending"
    }
  ]
}

You don’t care where those elements sit in the DOM or what their classes are. AgentQL is a robust alternative to fragile XPath and DOM/CSS selectors.

Wiring AgentQL into Playwright (Node.js example)

Once you have your Playwright context/session, you can run an AgentQL query against any authenticated page:

import { chromium } from 'playwright';
import { AgentQLClient } from '@agentql/js'; // hypothetical SDK import

const agentql = new AgentQLClient({
  apiKey: process.env.AGENTQL_API_KEY!,
});

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    storageState: 'storage-state.json', // reuse logged-in session
  });

  const page = await context.newPage();
  await page.goto('https://example-portal.com/invoices');

  const query = `
    {
      invoices[] {
        invoice_id
        invoice_date
        client_name
        total_amount(include currency symbol)
        status
      }
    }
  `;

  const result = await agentql.queryPage({
    page,
    query,
  });

  console.log(JSON.stringify(result, null, 2));
  await browser.close();
})();

The same pattern works in Python via the AgentQL Python SDK.

3. Iterate fast with the AgentQL IDE and Playground

To avoid edit‑run‑wait cycles:

  • Install the AgentQL browser extension (IDE) to debug queries directly on any page—including logged‑in pages while you’re inspecting them.
  • Use the Playground to test queries against URLs (public or private, depending on your setup) and see the JSON output immediately.

You can refine queries like:

{
  user_profile {
    full_name
    job_title
    company
    email
  }
}

until you get exactly what you need, then paste that query into your Playwright + AgentQL code.

Best Tools for Scraping Logged‑in Sites Without Brittle Selectors

Here’s how I’d evaluate your options, specifically for the job described by the slug best-tools-for-scraping-logged-in-sites-with-playwright-without-brittle-selector:

  1. Playwright (core runtime)

    • Handles login flows, cookies, storage state, navigation.
    • Language support: TypeScript/JavaScript, Python, .NET, Java.
    • Strength: robust browser automation for authenticated flows.
    • Weakness: vanilla Playwright selectors are still DOM/CSS/XPath; you need something on top for robustness.
  2. AgentQL (AI‑driven selectors + JSON extraction)

    • What it does:
      • Connects Playwright to an AI selector layer that analyzes page structure to find data.
      • Lets you define the shape of your data with an AgentQL query.
      • Returns structured JSON instead of HTML.
    • Surfaces:
      • Python and JavaScript SDKs (Playwright‑based)
      • Browserless REST API: send URL (+ cookies/auth) → get JSON
      • Browser extension IDE & Playground
    • Key properties:
      • Works on any page (public or private, any URL, even behind authentication).
      • Self‑healing selectors: consistent results despite dynamic content and page changes.
      • Reusable code: the same query works across multiple similar pages.
    • Where it shines:
      • You’re tired of maintaining XPath/CSS.
      • You’re building LLM‑powered agents that need clean JSON from logged‑in web apps.
      • You want schema‑first scraping: query → JSON → downstream pipeline.
  3. Browserless / hosted Playwright runners

    • Examples: browserless.io, custom remote Chromium grids, etc.
    • Use case:
      • Scale out your Playwright + AgentQL workflows without running your own browser fleet.
      • Run headless, concurrent sessions with limits like calls/minute and concurrency controls.
    • Relevance:
      • For teams scraping many logged‑in accounts in parallel, a hosted runner plus AgentQL is a strong pattern.
  4. Storage state & credential vaults

    • You’ll want:
      • Secure storage for storage-state.json files or similar.
      • A secret manager for credentials (Vault, AWS Secrets Manager, etc.).
    • Not a “selector tool,” but critical to logged‑in scraping at scale.
  5. Traditional selector helpers (Playwright locators, test IDs)

    • Useful when:
      • You control the target application and can add data-testid attributes.
      • You want explicit coupling between UI and scraping.
    • Less useful when:
      • You’re scraping third‑party SaaS where you can’t add stable attributes.
    • In those cases, AgentQL’s AI‑driven analysis is more resilient.

How It Works (Step‑by‑Step) with AgentQL + Playwright

Here’s a complete end‑to‑end pattern you can adopt.

  1. Install dependencies

    JavaScript:

    npm install playwright @agentql/js
    

    Python:

    pip install playwright agentql
    playwright install chromium
    
  2. Capture a logged‑in Playwright session

    • Run a one‑off script to log in and save storage-state.json.
    • Store it securely; treat it like a credential.
  3. Design your AgentQL query (schema‑first)

    Ask yourself: “If this page were an API, what JSON would I want?”

    Example for a logged‑in analytics dashboard:

    {
      summary {
        date_range
        total_users
        active_users
        churn_rate
      }
      plans[] {
        name
        monthly_price(include currency symbol)
        active_subscriptions
      }
    }
    
  4. Test and refine in the AgentQL IDE / Playground

    • Open the logged‑in page in your browser.
    • Use the AgentQL browser extension to run your query.
    • Adjust field names until the JSON matches your expected shape.
  5. Run your script (Playwright + AgentQL)

    Node.js example:

    import { chromium } from 'playwright';
    import { AgentQLClient } from '@agentql/js';
    
    const agentql = new AgentQLClient({ apiKey: process.env.AGENTQL_API_KEY! });
    
    const query = `
      {
        summary {
          date_range
          total_users
          active_users
          churn_rate
        }
        plans[] {
          name
          monthly_price(include currency symbol)
          active_subscriptions
        }
      }
    `;
    
    (async () => {
      const browser = await chromium.launch({ headless: true });
      const context = await browser.newContext({
        storageState: 'storage-state.json'
      });
    
      const page = await context.newPage();
      await page.goto('https://example-saas.com/dashboard');
    
      const result = await agentql.queryPage({ page, query });
      console.log(JSON.stringify(result, null, 2));
    
      await browser.close();
    })();
    
  6. Wire JSON into your pipeline or LLM

    At this point, you have a stable JSON contract. You can:

    • Store it in your data warehouse
    • Feed it into downstream ETL
    • Use it as structured grounding for your LLM agents (far better than dumping raw HTML into the context window)

Common Mistakes to Avoid

  • Relying only on fragile XPath/CSS selectors:

    • Problem: They break whenever class names, order, or layout change.
    • How to avoid it: Delegate element identification to AI‑driven selectors via AgentQL; define what you want in JSON terms instead of DOM terms.
  • Scraping raw HTML and parsing it with your LLM:

    • Problem: Blows up context windows, increases hallucinations, and hides failure modes.
    • How to avoid it: Use schema‑first extraction with AgentQL so your Playwright run returns clean JSON by default.
  • Skipping session persistence for logged‑in sites:

    • Problem: Re‑running full login flows on every scrape is slow and brittle, and can trigger security flags.
    • How to avoid it: Use Playwright’s storageState to persist session cookies and re‑use them until they expire.

Real‑World Example

Say you’re scraping an authenticated B2B marketplace dashboard where each user sees private pricing and volume metrics. The layout shifts a few times a quarter, and classes are minified (.css-1a2b3c). Your old Playwright setup used selectors like:

await page.textContent('div.css-1a2b3c span:nth-of-type(2)');

Every minor CSS refactor broke your pipeline, and because the script still “ran,” you didn’t notice corrupted numbers until downstream.

With Playwright + AgentQL:

  1. You log in once, save storage-state.json.

  2. In the AgentQL IDE on the dashboard page, you define:

    {
      metrics {
        total_orders
        total_revenue(include currency symbol)
        average_order_value(include currency symbol)
        last_updated_at
      }
    }
    
  3. The AgentQL query returns:

    {
      "metrics": {
        "total_orders": 9234,
        "total_revenue": "$1,248,900.00",
        "average_order_value": "$135.23",
        "last_updated_at": "2026-04-11T22:15:00Z"
      }
    }
    
  4. A month later, the marketplace ships a redesign. The DOM structure changes, your old CSS selectors would have failed—but the AgentQL query continues to produce the same JSON, because the AI selector layer re‑interprets the updated page structure.

Pro Tip: Treat your AgentQL query as an API contract—version it, add tests that assert the JSON shape and types, and alert when fields go missing. You’ll catch real regressions (e.g., metrics removed) instead of chasing harmless UI changes.

Summary

Scraping logged‑in sites with Playwright without brittle selectors comes down to separating concerns:

  • Let Playwright handle browsers, auth, and navigation.
  • Let AgentQL handle element discovery and structured JSON extraction using AI‑driven, self‑healing selectors.
  • Use schema‑first queries so pages behave like stable APIs, even when the UI changes.

This combination dramatically reduces maintenance, gives you production‑ready JSON for your data and LLM workflows, and makes your Playwright scrapers resilient across logged‑in sites and private dashboards.

Next Step

Get Started