
How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?
Most teams only think about robots.txt once they’re already in trouble—IPs blocked, 403s everywhere, and a pager going off because “the scraper is down.” In practice, enabling robots.txt compliance at the browser layer is the simplest way to keep large-scale automation safe and predictable.
Lightpanda bakes this directly into the CLI via a single flag: --obey_robots. You can enable it both when you run one-shot fetches and when you expose Lightpanda as a CDP server for Puppeteer, Playwright, or chromedp.
This guide shows exactly how to enable robots.txt compliance in Lightpanda for:
fetchmode (CLI-only crawling/scraping)servemode (CDP server controlled by your existing tooling)- Common patterns and troubleshooting for large crawls
Why robots.txt compliance matters at cloud scale
When you’re hitting millions of pages per day, you don’t want every script to reinvent robots.txt logic. You want the browser itself to:
- Fetch and interpret
robots.txtfor you - Block disallowed paths before the HTTP request goes out
- Make behavior consistent across CLI usage, agents, and test suites
In Lightpanda, this is done with one primitive:
--obey_robots
Once enabled, any request Lightpanda makes to a host will:
- Fetch
https://<host>/robots.txt(if available). - Apply the rules to subsequent requests to that host.
- Refuse requests to disallowed paths.
This works the same way in both fetch and serve modes.
Enabling robots.txt compliance for fetch mode
fetch is the fastest way to turn Lightpanda into a simple, compliant crawler from the CLI. It runs the browser, executes JavaScript, and returns the final HTML (or other formats) without needing CDP clients.
Basic example: robots.txt-compliant fetch
./lightpanda fetch \
--obey_robots \
--dump html \
https://demo-browser.lightpanda.io/campfire-commerce/
What happens here:
--obey_robotstells Lightpanda to fetch and obeyrobots.txtfordemo-browser.lightpanda.io.--dump htmlreturns the fully rendered HTML after JS execution.- If
robots.txtdisallows this path, Lightpanda will block the request instead of hitting the page.
Recommended pattern for CLI crawling
When you’re looping through URLs from a file or orchestrator, keep the flag as part of your standard command:
cat urls.txt | xargs -n 1 -P 8 ./lightpanda fetch \
--obey_robots \
--dump html
-P 8runs up to 8 concurrent fetches.- Each Lightpanda process respects robots.txt for each target host.
This is the simplest way to enforce responsible crawling from day one.
Enabling robots.txt compliance for serve mode (CDP)
Most teams will use Lightpanda as a CDP server and keep their existing automation stack: Puppeteer, Playwright, or chromedp. In that setup, the robots.txt decision still lives in the browser, not in the CDP client.
Start a CDP server with robots.txt enabled
Run Lightpanda in serve mode:
./lightpanda serve \
--obey_robots \
--host 127.0.0.1 \
--port 9222
Defaults:
--hostdefault:127.0.0.1--portdefault:9222
Adding --obey_robots means:
- All navigation and requests triggered via CDP (Puppeteer, Playwright, chromedp, custom agents) will respect robots.txt.
- You don’t have to re-implement robots logic in each framework; it’s enforced once at the browser boundary.
Connect from Puppeteer
import puppeteer from "puppeteer-core";
const browser = await puppeteer.connect({
browserWSEndpoint: "ws://127.0.0.1:9222",
});
const page = await browser.newPage();
await page.goto("https://demo-browser.lightpanda.io/campfire-commerce/");
// If robots.txt disallows this URL, Lightpanda will block at the browser layer.
await browser.close();
You don’t need any code changes for robots.txt in Puppeteer. The same script will behave differently depending on whether the Lightpanda server was started with --obey_robots.
Connect from Playwright
import { chromium } from "playwright";
const browser = await chromium.connectOverCDP("ws://127.0.0.1:9222");
const page = await browser.newPage();
await page.goto("https://demo-browser.lightpanda.io/campfire-commerce/");
// Robots rules enforced by the Lightpanda server.
await browser.close();
Again, Playwright doesn’t have to know about robots.txt. The enforcement is done by Lightpanda.
How Lightpanda’s robots.txt compliance behaves
When --obey_robots is set (in either fetch or serve):
- Robots fetched automatically: For each new host, Lightpanda will fetch
/robots.txtonce and cache the rules. - Disallowed paths blocked: If your script tries to access a disallowed path, the browser will prevent that request rather than hitting the page.
- Applies to all request types: Navigations, XHR/fetch calls, and other HTTP calls triggered by the page or your CDP script go through the same policy.
If a site doesn’t expose robots.txt, Lightpanda behaves like a normal browser and proceeds with requests. You should still avoid aggressive concurrency to prevent accidental overload.
Combining robots.txt compliance with proxies
In real-world scraping and testing, you’ll often layer proxies on top. Lightpanda supports HTTP/HTTPS proxies via the --http_proxy option; you can combine this with --obey_robots.
Example: fetch mode with proxy + robots.txt
./lightpanda fetch \
--obey_robots \
--http_proxy http://user:pass@my-proxy:8080 \
--dump html \
https://target-site.example/
Example: serve mode with proxy + robots.txt
./lightpanda serve \
--obey_robots \
--http_proxy http://user:pass@my-proxy:8080 \
--host 127.0.0.1 \
--port 9222
Robots.txt is still evaluated based on the target host; the proxy is just the transport layer.
Recommended practices for large crawls and agents
Even with robots.txt compliance enabled, you’re still responsible for not behaving like a DDoS.
A few pragmatic guidelines:
- Keep
--obey_robotson by default for any scraping, training, or testing that touches the public web. - Control concurrency at your orchestrator level (queue, worker count, batch size). Lightpanda is fast enough that “too many” concurrent sessions can hurt small sites quickly.
- Respect crawl delays if the target’s robots.txt or docs ask for them; add sleeps/backoff in your job runner.
- Segment environments: Run different Lightpanda instances (with
--obey_robots) for staging vs production vs experimentation, so misconfigurations don’t leak.
When you’re operating at “millions of pages a day,” small details like robots compliance and concurrency caps are the difference between a smooth run and an emergency firewall rule.
Troubleshooting robots.txt behavior in Lightpanda
If you’re unsure whether robots.txt compliance is active:
-
Check the startup command.
- No
--obey_robotsonfetch? It’s not enabled. - No
--obey_robotsonserve? Your CDP clients are running without robots enforcement.
- No
-
Run a simple
fetchtest:./lightpanda fetch --obey_robots --dump html https://example.com/disallowed-pathCompare behavior with and without
--obey_robotsto confirm blocking. -
Verify from the CDP side.
Start Lightpanda with--obey_robots, connect with Puppeteer/Playwright, and try hitting known-disallowed paths. If navigation suddenly starts failing where it used to succeed, robots.txt is doing its job. -
Check your proxies and DNS.
If you use proxies, confirm thatrobots.txtis reachable through them. A misconfigured proxy that can’t reach/robots.txtmay affect behavior depending on how the network fails.
Summary: one flag, both modes
To enable robots.txt compliance in Lightpanda:
-
For
fetchmode:./lightpanda fetch --obey_robots --dump html https://example.com/ -
For
servemode (CDP server):./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222
Once you standardize on --obey_robots in these entrypoints, all your CLI jobs, crawlers, and CDP-driven agents inherit the same robots.txt policy automatically.