How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?
Headless Browser Infrastructure

How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?

6 min read

Most teams only think about robots.txt once they’re already in trouble—IPs blocked, 403s everywhere, and a pager going off because “the scraper is down.” In practice, enabling robots.txt compliance at the browser layer is the simplest way to keep large-scale automation safe and predictable.

Lightpanda bakes this directly into the CLI via a single flag: --obey_robots. You can enable it both when you run one-shot fetches and when you expose Lightpanda as a CDP server for Puppeteer, Playwright, or chromedp.

This guide shows exactly how to enable robots.txt compliance in Lightpanda for:

  • fetch mode (CLI-only crawling/scraping)
  • serve mode (CDP server controlled by your existing tooling)
  • Common patterns and troubleshooting for large crawls

Why robots.txt compliance matters at cloud scale

When you’re hitting millions of pages per day, you don’t want every script to reinvent robots.txt logic. You want the browser itself to:

  • Fetch and interpret robots.txt for you
  • Block disallowed paths before the HTTP request goes out
  • Make behavior consistent across CLI usage, agents, and test suites

In Lightpanda, this is done with one primitive:

--obey_robots

Once enabled, any request Lightpanda makes to a host will:

  1. Fetch https://<host>/robots.txt (if available).
  2. Apply the rules to subsequent requests to that host.
  3. Refuse requests to disallowed paths.

This works the same way in both fetch and serve modes.

Enabling robots.txt compliance for fetch mode

fetch is the fastest way to turn Lightpanda into a simple, compliant crawler from the CLI. It runs the browser, executes JavaScript, and returns the final HTML (or other formats) without needing CDP clients.

Basic example: robots.txt-compliant fetch

./lightpanda fetch \
  --obey_robots \
  --dump html \
  https://demo-browser.lightpanda.io/campfire-commerce/

What happens here:

  • --obey_robots tells Lightpanda to fetch and obey robots.txt for demo-browser.lightpanda.io.
  • --dump html returns the fully rendered HTML after JS execution.
  • If robots.txt disallows this path, Lightpanda will block the request instead of hitting the page.

Recommended pattern for CLI crawling

When you’re looping through URLs from a file or orchestrator, keep the flag as part of your standard command:

cat urls.txt | xargs -n 1 -P 8 ./lightpanda fetch \
  --obey_robots \
  --dump html
  • -P 8 runs up to 8 concurrent fetches.
  • Each Lightpanda process respects robots.txt for each target host.

This is the simplest way to enforce responsible crawling from day one.

Enabling robots.txt compliance for serve mode (CDP)

Most teams will use Lightpanda as a CDP server and keep their existing automation stack: Puppeteer, Playwright, or chromedp. In that setup, the robots.txt decision still lives in the browser, not in the CDP client.

Start a CDP server with robots.txt enabled

Run Lightpanda in serve mode:

./lightpanda serve \
  --obey_robots \
  --host 127.0.0.1 \
  --port 9222

Defaults:

  • --host default: 127.0.0.1
  • --port default: 9222

Adding --obey_robots means:

  • All navigation and requests triggered via CDP (Puppeteer, Playwright, chromedp, custom agents) will respect robots.txt.
  • You don’t have to re-implement robots logic in each framework; it’s enforced once at the browser boundary.

Connect from Puppeteer

import puppeteer from "puppeteer-core";

const browser = await puppeteer.connect({
  browserWSEndpoint: "ws://127.0.0.1:9222",
});

const page = await browser.newPage();
await page.goto("https://demo-browser.lightpanda.io/campfire-commerce/");
// If robots.txt disallows this URL, Lightpanda will block at the browser layer.

await browser.close();

You don’t need any code changes for robots.txt in Puppeteer. The same script will behave differently depending on whether the Lightpanda server was started with --obey_robots.

Connect from Playwright

import { chromium } from "playwright";

const browser = await chromium.connectOverCDP("ws://127.0.0.1:9222");
const page = await browser.newPage();

await page.goto("https://demo-browser.lightpanda.io/campfire-commerce/");
// Robots rules enforced by the Lightpanda server.

await browser.close();

Again, Playwright doesn’t have to know about robots.txt. The enforcement is done by Lightpanda.

How Lightpanda’s robots.txt compliance behaves

When --obey_robots is set (in either fetch or serve):

  • Robots fetched automatically: For each new host, Lightpanda will fetch /robots.txt once and cache the rules.
  • Disallowed paths blocked: If your script tries to access a disallowed path, the browser will prevent that request rather than hitting the page.
  • Applies to all request types: Navigations, XHR/fetch calls, and other HTTP calls triggered by the page or your CDP script go through the same policy.

If a site doesn’t expose robots.txt, Lightpanda behaves like a normal browser and proceeds with requests. You should still avoid aggressive concurrency to prevent accidental overload.

Combining robots.txt compliance with proxies

In real-world scraping and testing, you’ll often layer proxies on top. Lightpanda supports HTTP/HTTPS proxies via the --http_proxy option; you can combine this with --obey_robots.

Example: fetch mode with proxy + robots.txt

./lightpanda fetch \
  --obey_robots \
  --http_proxy http://user:pass@my-proxy:8080 \
  --dump html \
  https://target-site.example/

Example: serve mode with proxy + robots.txt

./lightpanda serve \
  --obey_robots \
  --http_proxy http://user:pass@my-proxy:8080 \
  --host 127.0.0.1 \
  --port 9222

Robots.txt is still evaluated based on the target host; the proxy is just the transport layer.

Recommended practices for large crawls and agents

Even with robots.txt compliance enabled, you’re still responsible for not behaving like a DDoS.

A few pragmatic guidelines:

  • Keep --obey_robots on by default for any scraping, training, or testing that touches the public web.
  • Control concurrency at your orchestrator level (queue, worker count, batch size). Lightpanda is fast enough that “too many” concurrent sessions can hurt small sites quickly.
  • Respect crawl delays if the target’s robots.txt or docs ask for them; add sleeps/backoff in your job runner.
  • Segment environments: Run different Lightpanda instances (with --obey_robots) for staging vs production vs experimentation, so misconfigurations don’t leak.

When you’re operating at “millions of pages a day,” small details like robots compliance and concurrency caps are the difference between a smooth run and an emergency firewall rule.

Troubleshooting robots.txt behavior in Lightpanda

If you’re unsure whether robots.txt compliance is active:

  1. Check the startup command.

    • No --obey_robots on fetch? It’s not enabled.
    • No --obey_robots on serve? Your CDP clients are running without robots enforcement.
  2. Run a simple fetch test:

    ./lightpanda fetch --obey_robots --dump html https://example.com/disallowed-path
    

    Compare behavior with and without --obey_robots to confirm blocking.

  3. Verify from the CDP side.
    Start Lightpanda with --obey_robots, connect with Puppeteer/Playwright, and try hitting known-disallowed paths. If navigation suddenly starts failing where it used to succeed, robots.txt is doing its job.

  4. Check your proxies and DNS.
    If you use proxies, confirm that robots.txt is reachable through them. A misconfigured proxy that can’t reach /robots.txt may affect behavior depending on how the network fails.

Summary: one flag, both modes

To enable robots.txt compliance in Lightpanda:

  • For fetch mode:

    ./lightpanda fetch --obey_robots --dump html https://example.com/
    
  • For serve mode (CDP server):

    ./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222
    

Once you standardize on --obey_robots in these entrypoints, all your CLI jobs, crawlers, and CDP-driven agents inherit the same robots.txt policy automatically.

Next Step

Get Started