How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?
Headless Browser Infrastructure

How do I enable robots.txt compliance in Lightpanda (CLI flags) for both fetch and serve modes?

7 min read

Most automation stacks ignore robots.txt until something breaks: you overload a small site, get blocked, or receive a complaint. When you’re running crawlers, LLM data collection, or test suites across thousands of pages, robots compliance isn’t a “nice to have” — it’s part of keeping your infrastructure safe and sustainable.

In Lightpanda, robots.txt compliance is an explicit, first-class behavior you enable via a single CLI flag: --obey_robots. You can turn it on both for one-off fetches and for long-running CDP servers used by Puppeteer, Playwright, or chromedp.

This guide walks through:

  • Which CLI flag to use for robots.txt compliance
  • How to enable it in fetch mode
  • How to enable it in serve (CDP server) mode
  • How it behaves in real crawlers and test runs
  • How to combine robots compliance with proxies and rate awareness

The core flag: --obey_robots

Lightpanda exposes robots.txt compliance via a single CLI option:

  • --obey_robots:
    • Fetches the target site’s robots.txt (if available)
    • Evaluates allowed/disallowed paths
    • Enforces those rules for HTTP requests Lightpanda makes

Once this flag is on, every navigation coming from Lightpanda respects robots.txt for that origin.

You can attach --obey_robots to:

  • ./lightpanda fetch — for direct CLI fetches
  • ./lightpanda serve — for CDP servers that Puppeteer/Playwright/chromedp connect to

There’s nothing to add in your Puppeteer/Playwright scripts: the robots logic runs inside the browser process, not in your client code.


Enabling robots.txt compliance in fetch mode

fetch is the simplest place to start. It’s the CLI you use when you want to hit a URL directly from Lightpanda without CDP clients.

Basic robots-aware fetch

./lightpanda fetch \
  --obey_robots \
  --dump html \
  https://demo-browser.lightpanda.io/campfire-commerce/

What this does:

  • Downloads https://demo-browser.lightpanda.io/robots.txt (if present)
  • Validates that /campfire-commerce/ is allowed
  • Only then performs the GET request and executes page JavaScript
  • Dumps the final HTML to stdout

If robots.txt disallows that path, Lightpanda will not request it — your script fails fast instead of silently violating the rules.

Using fetch with custom host/port and robots compliance

fetch can also talk to a Lightpanda CDP server instead of starting its own internal browser. The same --obey_robots behavior applies.

Start a server:

./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222

Then route fetch through it (robots was already enforced at server level, but this shows the host/port knobs):

./lightpanda fetch \
  --host 127.0.0.1 \
  --port 9222 \
  --dump html \
  https://demo-browser.lightpanda.io/campfire-commerce/

Defaults if you don’t set them:

  • --host default: 127.0.0.1
  • --port default: 9222

In practice, when you’re just using fetch, you can stick to:

./lightpanda fetch --obey_robots --dump html https://example.com/

Enabling robots.txt compliance in serve mode (CDP server)

Most teams hit Lightpanda through Puppeteer or Playwright. In that flow, robots.txt compliance is controlled on the server process, not in your Node/Go/Python code.

Start a robots-aware CDP server

./lightpanda serve \
  --obey_robots \
  --host 127.0.0.1 \
  --port 9222

You’ll see something like:

INFO  app : server running . . . . . . . . . . . . . . . . .  [+0ms] address = 127.0.0.1:9222

From now on, every navigation done through this server (via CDP) is checked against robots.txt for each origin.

Connect Puppeteer to the robots-aware server

Your Puppeteer script doesn’t change except for the browserWSEndpoint:

const puppeteer = require('puppeteer-core');

(async () => {
  const browser = await puppeteer.connect({
    browserWSEndpoint: 'ws://127.0.0.1:9222',
  });

  const page = await browser.newPage();

  // If robots.txt disallows /admin, this navigation will be blocked by Lightpanda.
  await page.goto('https://example.com/admin', { waitUntil: 'networkidle0' });

  const html = await page.content();
  console.log(html);

  await browser.close();
})();

Key point: robots compliance is enforced inside Lightpanda, so the same script connected to a different browser (e.g., Chrome) will behave differently. That’s intentional — the browser is where this policy belongs.

Connect Playwright or chromedp the same way

For Playwright, you point connectOverCDP to the same endpoint:

import { chromium } from 'playwright';

(async () => {
  const browser = await chromium.connectOverCDP('http://127.0.0.1:9222');
  const [page] = browser.contexts()[0].pages();

  await page.goto('https://example.com/secret-report');
  // If disallowed by robots.txt, Lightpanda will block this.

  await browser.close();
})();

Again, the robots behavior is on the server; the client code remains idiomatic Playwright.


How robots.txt compliance behaves in real workflows

With --obey_robots enabled, Lightpanda acts as a gatekeeper for each HTTP request it makes on your behalf.

When robots.txt is present

  • Lightpanda fetches https://hostname/robots.txt the first time you touch that host.
  • It parses the file and evaluates if the path is allowed.
  • Disallowed paths are not requested; you’ll see failures instead of silent success.

For large crawls, this is the difference between:

  • Accidentally hammering /search or /private endpoints on every domain
  • Being constrained to the paths the site owner explicitly permits

When robots.txt is missing

If there is no robots.txt:

  • Lightpanda proceeds as if there were no restrictions.
  • You still need to control your rate of requests (robots.txt is not a rate limiter).

Failure modes to expect

You should be prepared for:

  • Navigations that fail because robots.txt explicitly disallows them.
  • Some sites using broad disallow rules you need to respect by design.

In other words, a 100% “success” rate is not the goal when you enable robots — policy correctness is.


Combining robots.txt compliance with proxies

Lightpanda supports HTTP/HTTPS proxies via --http_proxy. You can combine this with --obey_robots in both fetch and serve modes.

Robots + proxy with fetch

./lightpanda fetch \
  --obey_robots \
  --http_proxy http://user:pass@proxy.example.com:8080 \
  --dump html \
  https://example.com/

Robots + proxy with serve

./lightpanda serve \
  --obey_robots \
  --http_proxy http://user:pass@proxy.example.com:8080 \
  --host 127.0.0.1 \
  --port 9222

Robots behavior is unchanged; it still enforces the site’s rules. The proxy only shifts where traffic originates from.


Why robots.txt compliance matters when you’re 10× faster

On an AWS EC2 m5.large, our internal Puppeteer benchmark (100 pages) showed Lightpanda completing in ~2.3s vs Headless Chrome’s ~25.2s, with memory peak ~24MB vs 207MB. That’s roughly:

  • ~11× faster execution
  • ~9× less memory

The downside of speed at this magnitude is that it’s very easy to accidentally overwhelm small sites if you ignore robots.txt and basic rate control. With instant startup and near-zero footprint, spawning hundreds of concurrent sessions stops being an infrastructure challenge and becomes an ethical one.

That’s why:

  • We ship --obey_robots as a core primitive, not an afterthought.
  • We explicitly recommend you:
    • Respect robots.txt at scale.
    • Avoid high-frequency requesting against small or fragile infrastructures.
    • Treat DDOS risk as real — “DDOS could happen fast” is not theory, it’s operational experience.

Recommended defaults for responsible crawlers

If you’re building a serious crawler or LLM data pipeline on top of Lightpanda, I’d treat these as baseline defaults:

  1. Always enable robots compliance:

    ./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222
    
  2. Add application-level rate limiting in your orchestrator:

    • Limit concurrent domains.
    • Add per-domain delays/backoff.
    • Observe HTTP status distribution and slow down on 429/5xx.
  3. Audit your targets periodically:

    • Some sites update robots.txt without notice.
    • A “suddenly blocked path” is often a deliberate signal.
  4. Keep telemetry transparent:

    • Lightpanda has telemetry described in the privacy policy.
    • You can disable it via LIGHTPANDA_DISABLE_TELEMETRY=true if your environment requires it.

These patterns let you leverage Lightpanda’s speed and low memory footprint without turning that efficiency into operational risk.


Final snapshot: exact commands to remember

For quick reference, here are the minimal commands that enable robots.txt compliance in both modes:

Fetch mode, robots-aware:

./lightpanda fetch --obey_robots --dump html https://example.com/

Serve mode (CDP), robots-aware:

./lightpanda serve --obey_robots --host 127.0.0.1 --port 9222

Once --obey_robots is in place, Puppeteer, Playwright, and chromedp can connect via CDP as usual; the rest of your script remains the same while the browser enforces robots.txt for you.

Get Started