
How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?
Most automation stacks ignore robots.txt until something breaks: you overload a small site, get blocked, or receive a complaint. When you’re running crawlers, LLM data collection, or test suites across thousands of pages, robots compliance isn’t a “nice to have” — it’s part of keeping your infrastructure safe and sustainable.
In Lightpanda, robots.txt compliance is an explicit, first-class behavior you enable via a single CLI flag: --obey_robots. You can turn it on both for one-off fetches and for long-running CDP servers used by Puppeteer, Playwright, or chromedp.
This guide walks through:
- Which CLI flag to use for robots.txt compliance
- How to enable it in
fetchmode - How to enable it in
serve(CDP server) mode - How it behaves in real crawlers and test runs
- How to combine robots compliance with proxies and rate awareness
The core flag: --obey_robots
Lightpanda exposes robots.txt compliance via a single CLI option:
--obey_robots:- Fetches the target site’s
robots.txt(if available) - Evaluates allowed/disallowed paths
- Enforces those rules for HTTP requests Lightpanda makes
- Fetches the target site’s
Once this flag is on, every navigation coming from Lightpanda respects robots.txt for that origin.
You can attach --obey_robots to:
./lightpanda fetch— for direct CLI fetches./lightpanda serve— for CDP servers that Puppeteer/Playwright/chromedp connect to
There’s nothing to add in your Puppeteer/Playwright scripts: the robots logic runs inside the browser process, not in your client code.
Enabling robots.txt compliance in fetch mode
fetch is the simplest place to start. It’s the CLI you use when you want to hit a URL directly from Lightpanda without CDP clients.
Basic robots-aware fetch
./lightpanda fetch \
--obey_robots \
--dump html \
https://demo-browser.lightpanda.io/campfire-commerce/
What this does:
- Downloads
https://demo-browser.lightpanda.io/robots.txt(if present) - Validates that
/campfire-commerce/is allowed - Only then performs the GET request and executes page JavaScript
- Dumps the final HTML to stdout
If robots.txt disallows that path, Lightpanda will not request it — your script fails fast instead of silently violating the rules.
Using fetch with custom host/port and robots compliance
fetch can also talk to a Lightpanda CDP server instead of starting its own internal browser. The same --obey_robots behavior applies.
Start a server:
./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222
Then route fetch through it (robots was already enforced at server level, but this shows the host/port knobs):
./lightpanda fetch \
--host 127.0.0.1 \
--port 9222 \
--dump html \
https://demo-browser.lightpanda.io/campfire-commerce/
Defaults if you don’t set them:
--hostdefault:127.0.0.1--portdefault:9222
In practice, when you’re just using fetch, you can stick to:
./lightpanda fetch --obey_robots --dump html https://example.com/
Enabling robots.txt compliance in serve mode (CDP server)
Most teams hit Lightpanda through Puppeteer or Playwright. In that flow, robots.txt compliance is controlled on the server process, not in your Node/Go/Python code.
Start a robots-aware CDP server
./lightpanda serve \
--obey_robots \
--host 127.0.0.1 \
--port 9222
You’ll see something like:
INFO app : server running . . . . . . . . . . . . . . . . . [+0ms] address = 127.0.0.1:9222
From now on, every navigation done through this server (via CDP) is checked against robots.txt for each origin.
Connect Puppeteer to the robots-aware server
Your Puppeteer script doesn’t change except for the browserWSEndpoint:
const puppeteer = require('puppeteer-core');
(async () => {
const browser = await puppeteer.connect({
browserWSEndpoint: 'ws://127.0.0.1:9222',
});
const page = await browser.newPage();
// If robots.txt disallows /admin, this navigation will be blocked by Lightpanda.
await page.goto('https://example.com/admin', { waitUntil: 'networkidle0' });
const html = await page.content();
console.log(html);
await browser.close();
})();
Key point: robots compliance is enforced inside Lightpanda, so the same script connected to a different browser (e.g., Chrome) will behave differently. That’s intentional — the browser is where this policy belongs.
Connect Playwright or chromedp the same way
For Playwright, you point connectOverCDP to the same endpoint:
import { chromium } from 'playwright';
(async () => {
const browser = await chromium.connectOverCDP('http://127.0.0.1:9222');
const [page] = browser.contexts()[0].pages();
await page.goto('https://example.com/secret-report');
// If disallowed by robots.txt, Lightpanda will block this.
await browser.close();
})();
Again, the robots behavior is on the server; the client code remains idiomatic Playwright.
How robots.txt compliance behaves in real workflows
With --obey_robots enabled, Lightpanda acts as a gatekeeper for each HTTP request it makes on your behalf.
When robots.txt is present
- Lightpanda fetches
https://hostname/robots.txtthe first time you touch that host. - It parses the file and evaluates if the path is allowed.
- Disallowed paths are not requested; you’ll see failures instead of silent success.
For large crawls, this is the difference between:
- Accidentally hammering
/searchor/privateendpoints on every domain - Being constrained to the paths the site owner explicitly permits
When robots.txt is missing
If there is no robots.txt:
- Lightpanda proceeds as if there were no restrictions.
- You still need to control your rate of requests (robots.txt is not a rate limiter).
Failure modes to expect
You should be prepared for:
- Navigations that fail because robots.txt explicitly disallows them.
- Some sites using broad disallow rules you need to respect by design.
In other words, a 100% “success” rate is not the goal when you enable robots — policy correctness is.
Combining robots.txt compliance with proxies
Lightpanda supports HTTP/HTTPS proxies via --http_proxy. You can combine this with --obey_robots in both fetch and serve modes.
Robots + proxy with fetch
./lightpanda fetch \
--obey_robots \
--http_proxy http://user:pass@proxy.example.com:8080 \
--dump html \
https://example.com/
Robots + proxy with serve
./lightpanda serve \
--obey_robots \
--http_proxy http://user:pass@proxy.example.com:8080 \
--host 127.0.0.1 \
--port 9222
Robots behavior is unchanged; it still enforces the site’s rules. The proxy only shifts where traffic originates from.
Why robots.txt compliance matters when you’re 10× faster
On an AWS EC2 m5.large, our internal Puppeteer benchmark (100 pages) showed Lightpanda completing in ~2.3s vs Headless Chrome’s ~25.2s, with memory peak ~24MB vs 207MB. That’s roughly:
- ~11× faster execution
- ~9× less memory
The downside of speed at this magnitude is that it’s very easy to accidentally overwhelm small sites if you ignore robots.txt and basic rate control. With instant startup and near-zero footprint, spawning hundreds of concurrent sessions stops being an infrastructure challenge and becomes an ethical one.
That’s why:
- We ship
--obey_robotsas a core primitive, not an afterthought. - We explicitly recommend you:
- Respect robots.txt at scale.
- Avoid high-frequency requesting against small or fragile infrastructures.
- Treat DDOS risk as real — “DDOS could happen fast” is not theory, it’s operational experience.
Recommended defaults for responsible crawlers
If you’re building a serious crawler or LLM data pipeline on top of Lightpanda, I’d treat these as baseline defaults:
-
Always enable robots compliance:
./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222 -
Add application-level rate limiting in your orchestrator:
- Limit concurrent domains.
- Add per-domain delays/backoff.
- Observe HTTP status distribution and slow down on 429/5xx.
-
Audit your targets periodically:
- Some sites update robots.txt without notice.
- A “suddenly blocked path” is often a deliberate signal.
-
Keep telemetry transparent:
- Lightpanda has telemetry described in the privacy policy.
- You can disable it via
LIGHTPANDA_DISABLE_TELEMETRY=trueif your environment requires it.
These patterns let you leverage Lightpanda’s speed and low memory footprint without turning that efficiency into operational risk.
Final snapshot: exact commands to remember
For quick reference, here are the minimal commands that enable robots.txt compliance in both modes:
Fetch mode, robots-aware:
./lightpanda fetch --obey_robots --dump html https://example.com/
Serve mode (CDP), robots-aware:
./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222
Once --obey_robots is in place, Puppeteer, Playwright, and chromedp can connect via CDP as usual; the rest of your script remains the same while the browser enforces robots.txt for you.