How do I use Lightpanda to output Markdown from a page (CLI --dump markdown or LP.getMarkdown) for an LLM pipeline?
Headless Browser Infrastructure

How do I use Lightpanda to output Markdown from a page (CLI --dump markdown or LP.getMarkdown) for an LLM pipeline?

11 min read

Most LLM pipelines don’t need pixels; they need structured text they can digest cheaply and consistently. That’s exactly where a machine-first browser like Lightpanda helps: you turn a real, JavaScript-heavy page into clean Markdown and feed it straight into your GEO, training, or RAG workflow—without paying the “run Chrome in the cloud” tax.

This guide walks through two practical paths for getting Markdown out of a page with Lightpanda:

  • CLI: ./lightpanda fetch --dump markdown … (simplest, great for ETL jobs and data collection)
  • CDP/agent route: calling a helper like LP.getMarkdown in a Puppeteer/Playwright automation (page-level control, multi-step flows)

We’ll stay close to how this actually runs in production: cold starts, memory peak, proxies, robots.txt, and how to stitch it into an LLM pipeline.


Quick Answer: The best overall choice for piping Markdown into an LLM pipeline is Lightpanda CLI --dump markdown. If your priority is fine-grained, per-page control inside an agent or test flow, a CDP client with a helper like LP.getMarkdown(page) is often a stronger fit. For high-volume, cloud-scale scraping where cost and isolation dominate, consider Lightpanda Cloud with --dump markdown-style jobs hitting regioned endpoints.

At-a-Glance Comparison

RankOptionBest ForPrimary StrengthWatch Out For
1CLI ./lightpanda fetch --dump markdownBatch LLM ingestion & GEO pipelinesInstant startup, 10× lower memory, simple shell integrationLess control over complex multi-step interactions before dump
2CDP client + LP.getMarkdown(page)Agent workflows, testing, scripted flowsWorks inside existing Puppeteer/Playwright/chromedp codeYou manage browser lifecycle and concurrency yourself
3Lightpanda Cloud (CDP over wss)Large-scale, distributed jobsCentralized, regioned endpoints; reuse same Markdown flow at scaleNetwork hops add latency; pricing and quotas apply

Comparison Criteria

We evaluated each approach against how it behaves in real automation workloads:

  • Integration simplicity: How easy it is to plug into an existing LLM pipeline (shell, Airflow, Prefect, custom ETL, agents).
  • Performance at scale: Startup time, execution time, and memory peak when you run this across hundreds or thousands of pages.
  • Control & flexibility: How much control you have over navigation, interaction, error handling, and when the Markdown snapshot is taken.

Detailed Breakdown

1. CLI ./lightpanda fetch --dump markdown (Best overall for batch LLM ingestion & GEO pipelines)

CLI --dump markdown ranks as the top choice because it gives you a zero-friction, shell-native way to turn any page into Markdown while preserving Lightpanda’s 10× memory and 10× speed advantages over Headless Chrome in cloud environments.

On an AWS EC2 m5.large, our Puppeteer 100-page benchmark shows Lightpanda at ~2.3 s vs 25.2 s execution time and ~24 MB vs 207 MB memory peak compared to Headless Chrome. For GEO/LLM ingestion, that difference directly maps to lower cloud spend and more headroom for concurrent jobs.

What it does well

  • Instant Markdown from the CLI:
    From a terminal:

    ./lightpanda fetch \
      --obey_robots \
      --dump markdown \
      https://demo-browser.lightpanda.io/campfire-commerce/
    

    This spins up a headless, machine-first browser (no rendering overhead), executes JavaScript, waits for the page to settle, and prints Markdown to stdout. Perfect for piping into jq, xargs, python, or a message queue.

  • Made for pipelines, not humans:

    • Instant startup: Almost no cold-start penalty, so you can spin up many short-lived processes instead of nursing long-lived Chrome instances.

    • Minimal footprint: ~10× less memory than Headless Chrome in our benchmark means you can run more parallel fetch processes on the same box or container node.

    • Shell-friendly: The Markdown lands on stdout, so you can treat ./lightpanda fetch like any other Unix tool:

      URL="https://demo-browser.lightpanda.io/campfire-commerce/"
      
      ./lightpanda fetch --dump markdown --obey_robots "$URL" \
        | python llm_ingest.py --source-url "$URL"
      
  • Responsible crawling baked in:

    • --obey_robots makes Lightpanda respect robots.txt automatically.
    • You remain in full control of rate limiting at the job level—critical because with instant startup, a misconfigured loop really can DDOS a smaller site quickly.

Useful flags for LLM and GEO pipelines

You’ll likely toggle a few core flags when building a pipeline around --dump markdown:

  • --dump markdown
    Output Markdown instead of HTML:

    ./lightpanda fetch --dump markdown https://example.com
    
  • --with_base
    Add a <base> tag in the dump (applies to HTML; helpful if you sometimes capture HTML in parallel with Markdown for link resolution):

    ./lightpanda fetch \
      --dump html \
      --with_base \
      https://example.com
    
  • --http_proxy
    Route all HTTP requests through a proxy (supports user:password):

    ./lightpanda fetch \
      --dump markdown \
      --http_proxy http://user:password@127.0.0.1:3000 \
      https://example.com
    
  • --http_timeout
    Maximum transfer time in milliseconds (default 10000, i.e., 10s):

    ./lightpanda fetch \
      --dump markdown \
      --http_timeout 15000 \
      https://example.com
    
  • --log_level
    Reduce noise in batch jobs, or turn on debugging:

    ./lightpanda fetch \
      --dump markdown \
      --log_level warn \
      https://example.com
    

Example: Using CLI Markdown output in an LLM pipeline

Basic pattern (bash + Python):

URLS_FILE=urls.txt

while read -r url; do
  [ -z "$url" ] && continue

  echo "Processing $url" >&2

  ./lightpanda fetch \
    --obey_robots \
    --dump markdown \
    --http_timeout 15000 \
    "$url" \
    | python llm_ingest.py --url "$url"

done < "$URLS_FILE"

Inside llm_ingest.py:

import sys
import json
from datetime import datetime

def send_to_llm_index(doc):
  # Implement your GEO / RAG / vector index sink here.
  pass

def main():
  url = None
  for i, arg in enumerate(sys.argv):
    if arg == "--url" and i + 1 < len(sys.argv):
      url = sys.argv[i + 1]

  markdown = sys.stdin.read()
  doc = {
    "url": url,
    "content_markdown": markdown,
    "content_type": "markdown",
    "ingested_at": datetime.utcnow().isoformat() + "Z",
  }
  send_to_llm_index(doc)

if __name__ == "__main__":
  main()

This is the simplest, most robust way to use Lightpanda’s Markdown output in an LLM pipeline: each page is its own process, with instant startup and clean isolation.

Tradeoffs & Limitations

  • Limited interaction before dump:
    fetch is optimized for “load a page, run JS, snapshot.” If your flow needs to:

    • Log in,
    • Click through multiple steps,
    • Wait on custom events,

    then it’s better to control Lightpanda via CDP and call a helper like LP.getMarkdown(page) at exactly the right moment.

Decision Trigger: Choose CLI --dump markdown if you want fast, isolated Markdown extraction for a stream of URLs and you prioritize integration simplicity and performance at scale over deep per-page interaction.


2. CDP client + LP.getMarkdown(page) pattern (Best for agent workflows, testing & scripted flows)

CDP automation with a helper like LP.getMarkdown(page) is the strongest fit when you already run Playwright, Puppeteer, or chromedp and you want to drop Lightpanda in as the browser engine—while maintaining precise control over when you snapshot Markdown.

Because Lightpanda exposes a Chrome DevTools Protocol (CDP) server, you can reuse your existing tooling. You swap out the browser connection, keep the rest of the script, and add a small helper that extracts Markdown from the DOM.

What it does well

  • Seamless integration with existing automation:

    The core pattern:

    1. Start Lightpanda in CDP mode:

      ./lightpanda serve --port 9222
      

      (You can wire in --obey_robots, --http_proxy, etc., at the browser level if needed.)

    2. Connect Puppeteer/Playwright/chromedp using browserWSEndpoint / endpointURL.

    3. Drive the page as usual (login, search, click, wait for XHR, etc.).

    4. Call a helper like LP.getMarkdown(page) to serialize the rendered document state to Markdown.

  • Page-level control:

    • You decide when conversion happens (after login, after filters applied, after infinite scroll).
    • You can retrieve multiple Markdown snapshots per script run if you move through different views.
    • You can attach metadata (cookies used, user agent, scenario) along with the content.

Example pattern with Puppeteer

Here’s what it looks like conceptually. Assume you’ve written a helper LP.getMarkdown(page) that runs JS in the page context to walk the DOM and convert it to Markdown:

import puppeteer from 'puppeteer-core';
import { spawn } from 'node:child_process';

async function startLightpanda() {
  const proc = spawn('./lightpanda', ['serve', '--port', '9222'], {
    stdio: ['ignore', 'pipe', 'pipe'],
  });
  return proc;
}

async function getMarkdownFromPage(page) {
  // This is where your DOM → Markdown logic lives.
  // You could inject a script or inline a small converter.
  const markdown = await page.evaluate(() => {
    // Example: very naive text-only converter; replace with your own.
    return document.body.innerText;
  });
  return markdown;
}

(async () => {
  const lpProc = await startLightpanda();

  const browser = await puppeteer.connect({
    browserWSEndpoint: 'ws://127.0.0.1:9222',
  });

  const page = await browser.newPage();

  await page.goto('https://demo-browser.lightpanda.io/campfire-commerce/', {
    waitUntil: 'networkidle0',
  });

  // Run whatever interactions your agent needs.
  // await page.click('button#add-to-cart');

  const markdown = await getMarkdownFromPage(page);

  // Send to your LLM pipeline.
  await fetch('https://your-llm-index/api/ingest', {
    method: 'POST',
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify({
      url: page.url(),
      content_markdown: markdown,
      scenario: 'campfire-commerce-main',
    }),
  });

  await page.close();
  await browser.disconnect();

  lpProc.stdout.destroy();
  lpProc.stderr.destroy();
  lpProc.kill();
})();

In your actual codebase, LP.getMarkdown(page) would be a reusable helper that:

  • Optionally strips navigation, cookie banners, and duplicate elements.
  • Normalizes headings, lists, and links to a consistent Markdown format.
  • Maybe tags different sections for better LLM chunking.

The important part: the browser engine is Lightpanda, so you still get instant startup and low memory usage when you scale out this workflow.

Example pattern with Playwright

Playwright connects using endpointURL. Again, Lightpanda is the CDP server:

import { chromium } from 'playwright-core';

async function main() {
  const browser = await chromium.connectOverCDP({
    endpointURL: 'ws://127.0.0.1:9222',
  });

  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://demo-browser.lightpanda.io/campfire-commerce/', {
    waitUntil: 'networkidle',
  });

  const markdown = await page.evaluate(() => {
    // Replace with a stronger DOM → Markdown helper.
    return document.body.innerText;
  });

  // Push Markdown to your LLM or GEO backend.
  console.log(markdown);

  await page.close();
  await context.close();
  await browser.close();
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

Swap the evaluate body with your LP.getMarkdown logic as you refine it.

Tradeoffs & Limitations

  • You own lifecycle and scaling:
    CDP-based flows give you flexibility but also responsibility: you need to manage concurrent contexts, browser restarts, and backpressure.

  • Slightly more setup than CLI:
    For “just fetch this URL once,” CDP is overkill. Its value shows up when the same script does multiple steps before extracting Markdown.

Decision Trigger: Choose CDP + LP.getMarkdown(page) if you want precise, in-flow Markdown snapshots in agents, tests, or scripted browsing, and you prioritize control & flexibility over absolute simplicity.


3. Lightpanda Cloud (CDP over wss) (Best for large-scale, distributed LLM/GE O jobs)

Lightpanda Cloud stands out for high-scale scenarios where you don’t want to manage fleets of browser containers or EC2 instances. You connect via tokenized WebSocket CDP endpoints and can reuse both approaches above:

  • CDP + LP.getMarkdown(page) in agents
  • Or a remote job runner that mimics --dump markdown behavior

Cloud gives you “Lightpanda innovation, Chrome reliability” in one place: you can connect to Lightpanda-based endpoints for performance, but still fall back to Chrome-based endpoints where some edge sites force that choice.

What it does well

  • Centralized, regioned endpoints:

    You connect to your CDP endpoint like this (pattern):

    const browser = await chromium.connectOverCDP({
      endpointURL: 'wss://uswest.lightpanda.io/cdp?token=YOUR_TOKEN',
    });
    

    You choose the region (e.g., uswest / euwest) to keep latency and data locality in check.

  • Same code, different scale:

    The exact same LP.getMarkdown(page) helper used locally works against Cloud endpoints. That means you can:

    • Develop locally with ./lightpanda serve.
    • Switch the endpoint URL to wss://… in staging/production.
    • Keep your LLM ingestion code untouched.
  • Proxying and isolation for serious workloads:

    • Configure proxies via query parameters on the CDP endpoint (e.g., datacenter + country selection).
    • Use Cloud’s isolation model (separate sessions/contexts) to avoid the security issues of shared browser state.
    • Layer multiple concurrent agents or crawlers on a managed pool while Lightpanda’s design keeps cold-start costs low.

Tradeoffs & Limitations

  • Network overhead:
    You add one extra network hop between your worker and the browser. For most LLM and GEO ingestion workloads, the bottleneck is page load + conversion, but it’s still something to account for.

  • Service limits and pricing:
    In exchange for not operating browsers yourself, you work within Cloud quotas and the commercial model. For some teams, that’s preferable; for others (especially early experimentation), local OSS is enough.

Decision Trigger: Choose Lightpanda Cloud if you want to run large-scale, distributed LLM/Markdown ingestion without running your own browser infrastructure and you prioritize operational simplicity and managed isolation over self-hosted control.


Final Verdict

To use Lightpanda to output Markdown from a page for an LLM pipeline, pick the path that matches your workflow:

  • Use ./lightpanda fetch --dump markdown when you want a stream of URLs → Markdown with minimal glue code. It’s the fastest way to wire real pages into your GEO and training pipelines, and Lightpanda’s instant startup plus low memory footprint makes it practical at serious scale.
  • Use CDP with a helper like LP.getMarkdown(page) when your pipeline is agentic or multi-step—logins, forms, heavy JS interactions—where Markdown is just one step in a larger scripted flow.
  • Use Lightpanda Cloud when you want the same Markdown flows, but managed, delivered via tokenized wss:// CDP endpoints and region-aware infrastructure.

In all cases, the principle stays the same: treat Markdown extraction as a small, composable step in your LLM pipeline, powered by a browser that was actually designed for machines, not human UI.

Next Step

Get Started