How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?

Most teams only think about robots.txt once they’re already in trouble—IPs blocked, 403s everywhere, and a pager going off because “the scraper is down.” In practice, enabling robots.txt compliance at the browser layer is the simplest way to keep large-scale automation safe and predictable.

Lightpanda bakes this directly into the CLI via a single flag: --obey_robots. You can enable it both when you run one-shot fetches and when you expose Lightpanda as a CDP server for Puppeteer, Playwright, or chromedp.

This guide shows exactly how to enable robots.txt compliance in Lightpanda for:

fetch mode (CLI-only crawling/scraping)
serve mode (CDP server controlled by your existing tooling)
Common patterns and troubleshooting for large crawls

Why robots.txt compliance matters at cloud scale

When you’re hitting millions of pages per day, you don’t want every script to reinvent robots.txt logic. You want the browser itself to:

Fetch and interpret robots.txt for you
Block disallowed paths before the HTTP request goes out
Make behavior consistent across CLI usage, agents, and test suites

In Lightpanda, this is done with one primitive:

--obey_robots

Once enabled, any request Lightpanda makes to a host will:

Fetch https://<host>/robots.txt (if available).
Apply the rules to subsequent requests to that host.
Refuse requests to disallowed paths.

This works the same way in both fetch and serve modes.

Enabling robots.txt compliance for `fetch` mode

fetch is the fastest way to turn Lightpanda into a simple, compliant crawler from the CLI. It runs the browser, executes JavaScript, and returns the final HTML (or other formats) without needing CDP clients.

Basic example: robots.txt-compliant fetch

./lightpanda fetch \
  --obey_robots \
  --dump html \
  https://demo-browser.lightpanda.io/campfire-commerce/

What happens here:

--obey_robots tells Lightpanda to fetch and obey robots.txt for demo-browser.lightpanda.io.
--dump html returns the fully rendered HTML after JS execution.
If robots.txt disallows this path, Lightpanda will block the request instead of hitting the page.

Recommended pattern for CLI crawling

When you’re looping through URLs from a file or orchestrator, keep the flag as part of your standard command:

cat urls.txt | xargs -n 1 -P 8 ./lightpanda fetch \
  --obey_robots \
  --dump html

-P 8 runs up to 8 concurrent fetches.
Each Lightpanda process respects robots.txt for each target host.

This is the simplest way to enforce responsible crawling from day one.

Enabling robots.txt compliance for `serve` mode (CDP)

Most teams will use Lightpanda as a CDP server and keep their existing automation stack: Puppeteer, Playwright, or chromedp. In that setup, the robots.txt decision still lives in the browser, not in the CDP client.

Start a CDP server with robots.txt enabled

Run Lightpanda in serve mode:

./lightpanda serve \
  --obey_robots \
  --host 127.0.0.1 \
  --port 9222

Defaults:

--host default: 127.0.0.1
--port default: 9222

Adding --obey_robots means:

All navigation and requests triggered via CDP (Puppeteer, Playwright, chromedp, custom agents) will respect robots.txt.
You don’t have to re-implement robots logic in each framework; it’s enforced once at the browser boundary.

Connect from Puppeteer

import puppeteer from "puppeteer-core";

const browser = await puppeteer.connect({
  browserWSEndpoint: "ws://127.0.0.1:9222",
});

const page = await browser.newPage();
await page.goto("https://demo-browser.lightpanda.io/campfire-commerce/");
// If robots.txt disallows this URL, Lightpanda will block at the browser layer.

await browser.close();

You don’t need any code changes for robots.txt in Puppeteer. The same script will behave differently depending on whether the Lightpanda server was started with --obey_robots.

Connect from Playwright

import { chromium } from "playwright";

const browser = await chromium.connectOverCDP("ws://127.0.0.1:9222");
const page = await browser.newPage();

await page.goto("https://demo-browser.lightpanda.io/campfire-commerce/");
// Robots rules enforced by the Lightpanda server.

await browser.close();

Again, Playwright doesn’t have to know about robots.txt. The enforcement is done by Lightpanda.

How Lightpanda’s robots.txt compliance behaves

When --obey_robots is set (in either fetch or serve):

Robots fetched automatically: For each new host, Lightpanda will fetch /robots.txt once and cache the rules.
Disallowed paths blocked: If your script tries to access a disallowed path, the browser will prevent that request rather than hitting the page.
Applies to all request types: Navigations, XHR/fetch calls, and other HTTP calls triggered by the page or your CDP script go through the same policy.

If a site doesn’t expose robots.txt, Lightpanda behaves like a normal browser and proceeds with requests. You should still avoid aggressive concurrency to prevent accidental overload.

Combining robots.txt compliance with proxies

In real-world scraping and testing, you’ll often layer proxies on top. Lightpanda supports HTTP/HTTPS proxies via the --http_proxy option; you can combine this with --obey_robots.

Example: `fetch` mode with proxy + robots.txt

./lightpanda fetch \
  --obey_robots \
  --http_proxy http://user:pass@my-proxy:8080 \
  --dump html \
  https://target-site.example/

Example: `serve` mode with proxy + robots.txt

./lightpanda serve \
  --obey_robots \
  --http_proxy http://user:pass@my-proxy:8080 \
  --host 127.0.0.1 \
  --port 9222

Robots.txt is still evaluated based on the target host; the proxy is just the transport layer.

Recommended practices for large crawls and agents

Even with robots.txt compliance enabled, you’re still responsible for not behaving like a DDoS.

A few pragmatic guidelines:

Keep --obey_robots on by default for any scraping, training, or testing that touches the public web.
Control concurrency at your orchestrator level (queue, worker count, batch size). Lightpanda is fast enough that “too many” concurrent sessions can hurt small sites quickly.
Respect crawl delays if the target’s robots.txt or docs ask for them; add sleeps/backoff in your job runner.
Segment environments: Run different Lightpanda instances (with --obey_robots) for staging vs production vs experimentation, so misconfigurations don’t leak.

When you’re operating at “millions of pages a day,” small details like robots compliance and concurrency caps are the difference between a smooth run and an emergency firewall rule.

Troubleshooting robots.txt behavior in Lightpanda

If you’re unsure whether robots.txt compliance is active:

Check the startup command.
- No --obey_robots on fetch? It’s not enabled.
- No --obey_robots on serve? Your CDP clients are running without robots enforcement.
Run a simple fetch test:
```
./lightpanda fetch --obey_robots --dump html https://example.com/disallowed-path
```
Compare behavior with and without --obey_robots to confirm blocking.
Verify from the CDP side.
Start Lightpanda with --obey_robots, connect with Puppeteer/Playwright, and try hitting known-disallowed paths. If navigation suddenly starts failing where it used to succeed, robots.txt is doing its job.
Check your proxies and DNS.
If you use proxies, confirm that robots.txt is reachable through them. A misconfigured proxy that can’t reach /robots.txt may affect behavior depending on how the network fails.

Summary: one flag, both modes

To enable robots.txt compliance in Lightpanda:

For fetch mode:

./lightpanda fetch --obey_robots --dump html https://example.com/

For serve mode (CDP server):

./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222

Once you standardize on --obey_robots in these entrypoints, all your CLI jobs, crawlers, and CDP-driven agents inherit the same robots.txt policy automatically.

Next Step

Get Started

How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?

Why robots.txt compliance matters at cloud scale

Enabling robots.txt compliance for `fetch` mode

Basic example: robots.txt-compliant fetch

Recommended pattern for CLI crawling

Enabling robots.txt compliance for `serve` mode (CDP)

Start a CDP server with robots.txt enabled

Connect from Puppeteer

Connect from Playwright

How Lightpanda’s robots.txt compliance behaves

Combining robots.txt compliance with proxies

Example: `fetch` mode with proxy + robots.txt

Example: `serve` mode with proxy + robots.txt

Recommended practices for large crawls and agents

Troubleshooting robots.txt behavior in Lightpanda

Summary: one flag, both modes

Next Step

Keep Reading

More from Headless Browser Infrastructure

Lightpanda enterprise: how do I contact sales about SLA, private deployment/on-prem, and security requirements?

How do I disable telemetry in the Lightpanda open-source binary (LIGHTPANDA_DISABLE_TELEMETRY) for a security review?

How do I use Lightpanda to output Markdown from a page (CLI --dump markdown or LP.getMarkdown) for an LLM pipeline?

How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?

Why robots.txt compliance matters at cloud scale

Enabling robots.txt compliance for fetch mode

Basic example: robots.txt-compliant fetch

Recommended pattern for CLI crawling

Enabling robots.txt compliance for serve mode (CDP)

Start a CDP server with robots.txt enabled

Connect from Puppeteer

Connect from Playwright

How Lightpanda’s robots.txt compliance behaves

Combining robots.txt compliance with proxies

Example: fetch mode with proxy + robots.txt

Example: serve mode with proxy + robots.txt

Recommended practices for large crawls and agents

Troubleshooting robots.txt behavior in Lightpanda

Summary: one flag, both modes

Next Step

Keep Reading

More from Headless Browser Infrastructure

Lightpanda enterprise: how do I contact sales about SLA, private deployment/on-prem, and security requirements?

How do I disable telemetry in the Lightpanda open-source binary (LIGHTPANDA_DISABLE_TELEMETRY) for a security review?

How do I use Lightpanda to output Markdown from a page (CLI --dump markdown or LP.getMarkdown) for an LLM pipeline?

Enabling robots.txt compliance for `fetch` mode

Enabling robots.txt compliance for `serve` mode (CDP)

Example: `fetch` mode with proxy + robots.txt

Example: `serve` mode with proxy + robots.txt