
AgentQL vs Import.io for authenticated scraping (logged-in flows) — what’s the practical setup?
Most logged-in scraping problems aren’t about “can this tool log in?”—they’re about how brittle your setup becomes once you’re inside the session. If you’re comparing AgentQL and Import.io for authenticated scraping, the real question is: how do they differ in day‑2 operations—cookie handling, flows with 2FA, self‑healing selectors, and plugging into your existing Playwright / LLM stack?
Quick Answer: Import.io gives you a hosted, point‑and‑click path into authenticated scraping, but you’ll still end up fighting selectors and layout drift as sites change. AgentQL treats logged‑in flows like any other Playwright‑based session: you handle authentication in code (cookies, login steps, SSO), and AgentQL turns the resulting page into structured JSON via AI‑powered “self‑healing” selectors that survive DOM and layout changes. For teams building web agents and production data pipelines, AgentQL generally wins on robustness, developer control, and LLM‑readiness.
Why This Matters
Once you move from public marketing pages to logged‑in dashboards—analytics tools, marketplaces, SaaS back‑offices—everything gets harder: anti‑bot systems, evolving UIs, dynamic tables, embedded PDFs, and multi‑step flows.
Choosing the wrong approach means:
- Scripts that break whenever a front‑end team tweaks the React tree
- LLM agents that hallucinate off raw HTML or blow their context windows
- Painful migrations when SSO, MFA, or session handling changes
AgentQL is built for this exact world: you connect via Playwright or REST, run your login flow however you want (credentials, cookies, headless, remote browser), then describe the output schema you need. AgentQL uses AI to analyze the page structure and return consistent JSON, even as the page evolves—without you hand‑maintaining XPath or CSS selectors.
Key Benefits:
- Schema‑first extraction: Define the JSON you want (e.g.,
{ users[] { name email role } }), not DOM selectors; AgentQL handles the page. - Self‑healing logged‑in flows: As dashboards, tables, and UI components change, AgentQL’s AI locator adapts without you recoding selectors.
- LLM‑ready data: Instead of feeding models “reams of HTML,” you ground them on compact, structured JSON from authenticated pages, reducing hallucinations and context blowups.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Schema‑first queries | In AgentQL, you define the shape of the data you want (fields, lists, nested objects) and send that as the query. | You stop thinking in CSS/XPath and start thinking in JSON contracts; this makes logged‑in scraping behave more like calling an API. |
| AI‑powered, self‑healing selectors | AgentQL analyzes the visual and structural layout of the page to locate elements, instead of relying on fragile DOM paths. | When a SaaS app changes its DOM, you keep using the same query; AgentQL adapts and continues returning consistent results. |
| Playwright & REST surfaces | AgentQL plugs into your existing Playwright flows via Python/JS SDKs, or runs browserless via a REST API that takes URLs and queries. | You can reuse your authentication logic (cookies, sessions, MFA flows) and still get clean JSON out, without rebuilding your stack around a hosted scraper UI. |
How It Works (Step‑by‑Step)
1. Practical setup with AgentQL for authenticated flows
From a developer’s perspective, authenticated scraping with AgentQL looks like “Playwright + AgentQL query → JSON”:
-
Install the SDK (Python or JS)
Use the AgentQL SDK to control a browser (Playwright) and run queries on the pages you reach.# JavaScript npm install @agentql/playwright # or Python pip install agentql-playwright -
Log in using Playwright (your way)
You own the authentication logic—credentials, cookies, SSO, MFA workarounds. For example (JS):import { chromium } from 'playwright'; import { AgentQLClient } from '@agentql/playwright'; (async () => { const browser = await chromium.launch({ headless: true }); const page = await browser.newPage(); // 1) Navigate to login await page.goto('https://your-saas.com/login'); // 2) Perform login await page.fill('#email', process.env.SAAS_EMAIL!); await page.fill('#password', process.env.SAAS_PASSWORD!); await page.click('button[type="submit"]'); await page.waitForURL('**/dashboard'); // 3) Attach AgentQL const agentql = new AgentQLClient(page); // 4) Run query on the authenticated page const query = ` { usage_summary { current_plan monthly_quota usage_percent } } `; const result = await agentql.query(query); console.log(JSON.stringify(result, null, 2)); await browser.close(); })();Returned JSON might look like:
{ "usage_summary": { "current_plan": "Pro", "monthly_quota": "10,000 API calls", "usage_percent": 47 } }You never wrote a single XPath or CSS selector—AgentQL analyzed the dashboard page and located the data.
-
Define the shape of your data with AgentQL queries
For a logged‑in “customers” table:{ customers[] { name email company status last_login(include time) } }AgentQL returns:
{ "customers": [ { "name": "Alex Romero", "email": "alex@example.com", "company": "Northwind Analytics", "status": "Active", "last_login": "2026-04-10 13:42 UTC" }, { "name": "Priya Desai", "email": "priya@example.com", "company": "Acme Corp", "status": "Trial", "last_login": "2026-04-09 09:15 UTC" } ] }If the vendor ships a new React table that rearranges DOM nodes or renames classes, the same query is designed to keep working—this is the “self‑healing” effect.
-
Refine queries live in the AgentQL IDE & Playground
- Install the browser extension, open your authenticated page, and craft the query with immediate JSON feedback.
- Copy the working query into your Playwright script.
This closes the loop between “what you see in the browser” and “what your code receives as JSON.”
-
Scale across pages and accounts
- Reuse the same query against multiple tenant accounts or similar pages.
- Use concurrency and remote browsers (limits typically expressed in API calls/minute and remote browser hours) to keep throughput predictable.
- For enterprise needs, you can deploy on‑premise with 24/7 premium support and a dedicated account manager.
2. Practical setup with Import.io for authenticated flows
Import.io’s model is more hosted and UI‑driven:
-
Configure a connector for your target site
- Log into Import.io’s UI.
- Add a new extractor/connector for the target hostname (e.g., your SaaS, marketplace, or analytics tool).
-
Handle authentication in the platform
Depending on your plan and target site, this typically means:- Providing credentials in Import.io’s interface.
- Recording a login sequence or defining form fields.
- Sometimes storing cookies or session data managed by Import.io.
The platform then tries to preserve that session for future runs.
-
Point‑and‑click element selection
- Navigate to the logged‑in page in Import.io’s browser.
- Click elements or columns to define the data you want.
- Import.io generates a schema and underlying selectors based on that interaction.
-
Schedule and fetch data
- Schedule extractions on a cadence (e.g., hourly / daily).
- Fetch CSV/JSON via their API or download from the dashboard.
When the target site changes, you often return to the UI, fix broken fields, and redeploy.
3. Side‑by‑side: where AgentQL is different for logged‑in flows
| Area | AgentQL | Import.io |
|---|---|---|
| Auth mechanics | You own login via Playwright or cookies; full control over 2FA, SSO, headers, proxies, etc. | Auth is configured in their platform; less control, more dependence on their browser/session handling. |
| Selector strategy | AI analyzes page structure; schema‑first query → JSON; self‑healing across DOM/layout changes. | Generated selectors from point‑and‑click; more fragile when the app’s DOM or CSS shifts. |
| LLM integration | Designed to feed LLM agents: compact JSON, avoids grounding on raw HTML and hallucinations. | Primarily data extraction; LLM use is indirect—you still handle transformation and grounding yourself. |
| Debugging & iteration | AgentQL IDE browser extension + Playground; work directly on live pages you’re authenticated into. | Fixes go through the Import.io UI; less direct control in your own editor/CI. |
| Deployment model | SDKs (Python/JS) with Playwright; REST API for browserless extraction; on‑premise option for compliance. | Fully hosted SaaS platform; some on‑prem/enterprise offerings depending on plan. |
| Reuse across sites | Same query can often be reused across similar layouts (e.g., different tenants, regions, A/B variants). | Scrapers tend to be more site/variant‑specific due to selector fragility. |
Common Mistakes to Avoid
-
Treating logged‑in flows like public scraping:
Authenticated pages have CSRF tokens, anti‑automation, and stricter rate limits. With AgentQL, keep your Playwright login flows robust: handle redirects, wait for network idle, and persist sessions (e.g.,browserContext.storageState()in Playwright) so you’re not re‑authing every run. -
Grounding LLMs on raw HTML from dashboards:
This is where teams “crunch reams of HTML” and hit context window limits. Instead, design AgentQL queries that return only the fields you need—{ metrics[] { label value change_percent } }—and pass that JSON into your model. Users report fewer hallucinations and more predictable grounding when they replace HTML grounding with AgentQL JSON.
Real‑World Example
Imagine you’re building a weekly “usage and billing” report across 30 vendors that only expose the data in logged‑in dashboards:
-
With Import.io, you’d create 30 separate connectors, record 30 login flows, and point‑and‑click 30 sets of selectors. When any vendor redesigns their billing page, one or more connectors break, and someone has to go into the Import.io UI and re‑train the extractor.
-
With AgentQL, your team maintains a single Playwright repo:
-
Each vendor gets a small “login + navigate to billing page” function.
-
You define an AgentQL query per vendor that encodes the JSON contract you want:
{ billing_summary { current_plan next_invoice_date(include timezone) next_invoice_amount(include currency symbol) invoices[] { date(include timezone) amount(include currency) status } } } -
You test the query in the AgentQL IDE on a real session, copy it into code, and deploy.
-
When Vendor X ships a new UI with different classes and nested divs, your Playwright “go to billing page” still works, and AgentQL keeps returning the same
billing_summaryJSON without you touching a CSS selector.
-
Pro Tip: For multi‑vendor, multi‑tenant dashboards, define a thin interface layer in your code—
fetchBillingSummary(client)—that always returns the same JSON shape, and implement each vendor with its own AgentQL query. This turns brittle UI scraping into a stable API‑like contract that your downstream systems can rely on.
Summary
For authenticated scraping and logged‑in flows, the crucial decision isn’t “which tool can log in?” but “which setup stays stable as the UI evolves and my LLM/data pipelines grow?”
- Import.io offers a hosted, point‑and‑click experience that works well for one‑off or smaller scale extraction, but you’re still fundamentally tied to fragile selectors and a UI‑driven maintenance loop.
- AgentQL plugs into your existing Playwright or REST stack, lets you handle authentication however you want, and replaces XPath/CSS with schema‑first, AI‑powered queries that produce self‑healing, LLM‑ready JSON—even as dashboards and DOM structures change.
If you think of your logged‑in scraping as an API contract—query → JSON—AgentQL is closer to the mental model you want for durable, production‑grade workflows.