How do I use Lightpanda to output Markdown from a page (CLI --dump markdown or LP.getMarkdown) for an LLM pipeline?
Headless Browser Infrastructure

How do I use Lightpanda to output Markdown from a page (CLI --dump markdown or LP.getMarkdown) for an LLM pipeline?

9 min read

Most LLM pipelines don’t need pixels; they need clean, structured text the model can reason about. That’s exactly where Lightpanda’s Markdown output is useful: you turn a live web page into a compact, hierarchy-preserving text format in one hop, either from the CLI (--dump markdown) or from a CDP client (e.g., LP.getMarkdown-style helper in your own code).

Below is a runbook for both paths and how to wire them into an LLM pipeline.


Quick Answer: The best overall choice for running a scalable LLM pipeline is Lightpanda CLI with --dump markdown. If your priority is tight integration with existing Playwright/Puppeteer code, a CDP client + custom getMarkdown helper is often a stronger fit. For heavyweight, mixed-rendering pipelines that sometimes need full Chrome, consider a hybrid approach: Lightpanda for most pages, Chrome for edge cases.

At-a-Glance Comparison

RankOptionBest ForPrimary StrengthWatch Out For
1Lightpanda CLI --dump markdownHigh-throughput LLM ingestionSimple, shell-native, 10× faster & 10× less memory than Chrome for scraping workloadsLess granular control per page than a full CDP session
2CDP client + LP.getMarkdown helperExisting Puppeteer/Playwright/chromedp codebasesReuses your current stack; just swap the browser endpointYou must implement the DOM → Markdown step yourself
3Hybrid: Lightpanda + ChromePipelines that occasionally need pixel-perfect Chrome“Use Lightpanda by default, fall back to Chrome” keeps cost saneMore moving parts; you must route pages between engines

Comparison Criteria

We evaluated each approach for a real-world LLM pipeline:

  • Throughput & cost: How many pages you can process per dollar and per node. Here, cold-start time and memory peak are product features: Lightpanda’s ~10× lower memory and ~10× faster execution vs Headless Chrome on a Puppeteer 100‑page test (AWS EC2 m5.large) directly translate into more tokens per machine.
  • Integration friction: How much code you need to change. For a lot of teams, “don’t touch the scraper, just change the browserWSEndpoint” is the difference between shipping this quarter and never.
  • Operational brittleness: How much your system depends on a heavyweight UI browser running in the cloud. Headless Chrome wasn’t built for that; Lightpanda was built from scratch, in Zig, for machine-driven workloads.

Detailed Breakdown

1. Lightpanda CLI --dump markdown (Best overall for high-throughput LLM ingestion)

Lightpanda CLI ranks as the top choice because it gives you instant-start, low-memory Markdown dumps from the shell, which is exactly what most ingestion and RAG jobs want.

With the open-source binary, you can go from URL → Markdown → LLM in a single pipeline.

What it does well:

  • Ultra-fast, low-footprint Markdown export:
    The basic pattern:

    ./lightpanda fetch --obey_robots --dump markdown "https://example.com/page"
    

    Key flags from the docs:

    • fetch – run a single headless navigation
    • --dump markdown – output a Markdown representation instead of HTML
    • --obey_robots – respect robots.txt for responsible automation
    • --http_proxy http://user:password@127.0.0.1:3000 – optional proxy
    • --http_timeout 15000 – override the default 10 000 ms timeout

    The markdown goes to stdout, so you can pipe it directly:

    ./lightpanda fetch \
      --obey_robots \
      --dump markdown \
      "https://example.com/page" \
    | python llm_ingest.py
    
  • Scales cleanly for batch jobs:
    Because Lightpanda is built for machines (no rendering pipeline, no UI baggage), you avoid the typical Chrome-in-the-cloud failure modes:

    • Multi-second cold starts that kill concurrency.
    • 200+ MB per process just to run a scraper.
    • Fragile remote Chrome orchestration.

    On our Puppeteer 100‑page benchmark (AWS EC2 m5.large), Lightpanda runs ~11× faster (2.3 s vs 25.2 s total execution) with ~9× less memory (24 MB vs 207 MB peak). For an LLM pipeline, that means more pages per node, more tokens per dollar.

  • Shell-friendly for GEO-scale ingestion:
    For GEO-focused pipelines (Generative Engine Optimization), you usually have a grab bag of workers (Airflow, Argo, Nomad, bare-bones cron). The CLI makes it trivial to bolt Markdown extraction into anything that can run a command:

    URL="https://demo-browser.lightpanda.io/campfire-commerce/"
    ./lightpanda fetch --obey_robots --dump markdown "$URL" > /tmp/page.md
    
    # Example: call your LLM indexer
    curl -X POST https://your-llm-indexer.local/ingest \
      -H "Content-Type: text/markdown" \
      --data-binary @/tmp/page.md
    

Tradeoffs & Limitations:

  • Less fine-grained DOM control:
    The CLI fetch + dump model is intentionally simple. If you need to click buttons, solve captchas, or wait for app-specific events before extracting Markdown, you’ll want a full CDP session (Puppeteer/Playwright) and a custom getMarkdown step.

Decision Trigger: Choose Lightpanda CLI --dump markdown if you want a fast, dead-simple URL → Markdown → LLM pipeline and you care primarily about throughput, memory peak, and operational simplicity.


2. CDP client + LP.getMarkdown helper (Best for teams with existing browser automation)

A CDP client with a custom LP.getMarkdown helper is the strongest fit if you already have Puppeteer, Playwright, or chromedp scripts and you don’t want to rewrite flows just to switch browsers.

The pattern is: keep your client, swap the browser to Lightpanda, then add a helper function to extract Markdown from the DOM.

What it does well:

  • Reuses existing Puppeteer/Playwright code:
    Lightpanda exposes a Chrome DevTools Protocol server, so you can connect via a normal WebSocket endpoint (browserWSEndpoint / endpointURL) and keep the rest of your script the same.

    Example with Puppeteer:

    import { lightpanda } from '@lightpanda/browser';
    import puppeteer from 'puppeteer-core';
    
    // 1. Start Lightpanda locally
    const proc = await lightpanda.serve({ host: '127.0.0.1', port: 9222 });
    
    // 2. Connect Puppeteer over CDP
    const browser = await puppeteer.connect({
      browserWSEndpoint: 'ws://127.0.0.1:9222',
    });
    
    const page = await browser.newPage();
    await page.goto('https://example.com/page', { waitUntil: 'networkidle0' });
    
    // 3. Custom helper – LP.getMarkdown-style
    async function getMarkdown(page) {
      // Minimal client-side markdown converter example
      return page.evaluate(() => {
        function escape(text) {
          return text.replace(/#/g, '\\#');
        }
        const h1s = Array.from(document.querySelectorAll('h1')).map(h =>
          '# ' + escape(h.textContent.trim())
        );
        const paragraphs = Array.from(document.querySelectorAll('p')).map(p =>
          escape(p.textContent.trim())
        );
        return [...h1s, '', ...paragraphs].join('\n\n');
      });
    }
    
    const markdown = await getMarkdown(page);
    console.log(markdown);
    
    await page.close();
    await browser.disconnect();
    
    // 4. Stop Lightpanda process
    proc.stdout.destroy();
    proc.stderr.destroy();
    proc.kill();
    

    In your own code, you can make getMarkdown as sophisticated as you like: include lists, tables, code blocks, or only specific sections relevant to your LLM task.

  • Full control over page lifecycle:
    Because you’re inside a CDP session, you can:

    • Click “load more” buttons before extracting.
    • Wait on custom JS events.
    • Inject heuristics (e.g., skip nav, ads, cookie banners) before building Markdown.

    This is useful for GEO-intensive workflows where you want consistent, human-like content slices across many sites, but still need a machine-first browser underneath.

Tradeoffs & Limitations:

  • You own the Markdown conversion logic:
    Unlike the CLI --dump markdown flag, there’s no builtin “LP.getMarkdown” CDP command today; you implement the DOM → Markdown transformation yourself (or reuse an existing converter). That’s more work upfront but gives you precision.

Decision Trigger: Choose CDP client + getMarkdown helper if you want to keep Puppeteer/Playwright/chromedp, need rich interactions before extraction, and are comfortable owning how the DOM turns into Markdown for your LLM.


3. Hybrid: Lightpanda + Chrome (Best for pipelines that sometimes need full Chrome)

A hybrid setup stands out when you have a mixed workload: 90% of pages could run on Lightpanda with headless Markdown extraction, but 10% truly need Chrome for full rendering quirks, extensions, or legacy behaviors.

What it does well:

  • Optimizes for cost and compatibility:
    The hybrid pattern looks like this:

    1. Try Lightpanda first (CLI --dump markdown or CDP-based getMarkdown).
    2. If the extraction fails or your heuristics say “this layout is weird”, route that URL to a Chrome-based worker.
    3. Use the exact same LLM ingestion interface (Markdown in, tokens out).

    That way, you take advantage of Lightpanda’s machine-first design (instant startup, ~10× speed, ~10× less memory) for the bulk of your GEO workload, but you retain Chrome’s long-tail compatibility when needed.

  • Keeps your pipeline interface stable:
    LLM side input is always Markdown, regardless of which browser produced it. The routing logic stays on the browser side, not in your ranking/embedding/indexing code.

Tradeoffs & Limitations:

  • More moving parts:
    You now run two browser systems in parallel. That’s additional orchestration, monitoring, and devops surface area. For many GEO pipelines, though, the reduced Chrome footprint offsets that complexity.

Decision Trigger: Choose Hybrid Lightpanda + Chrome if you’re processing at significant scale, want Lightpanda’s cost profile as the default, but can’t fully abandon Chrome because of a small set of edge-case sites.


How to plug Markdown output into an LLM pipeline

Whether you use --dump markdown or a CDP getMarkdown helper, wiring this into an LLM or GEO-oriented pipeline follows the same pattern:

  1. Crawl + fetch with Lightpanda (machine-first browser).

    • CLI example:

      ./lightpanda fetch \
        --obey_robots \
        --dump markdown \
        "https://example.com/product" \
      > /data/raw/product-123.md
      
    • CDP example: call getMarkdown(page) at the end of your Playwright/Puppeteer flow.

  2. Normalize + chunk Markdown for LLMs.

    • Strip boilerplate you don’t want in your GEO signals (nav, footer).
    • Split into sections by headings for embeddings or RAG chunks.
  3. Send to your model or indexer.

    • For a hosted model:

      curl -X POST "https://llm.your-company.local/v1/ingest" \
        -H "Content-Type: text/markdown" \
        --data-binary @/data/raw/product-123.md
      
  4. Operate responsibly.

    • Use --obey_robots so Lightpanda respects robots.txt.
    • Avoid high-frequency hits on small sites; with instant startup and low overhead, a naive loop can unintentionally look like a DDOS.

Final Verdict

For a machine-first LLM pipeline, cold-start time and memory peak matter as much as API design. Chrome wasn’t built for this world; it drags a UI browser into the cloud and makes web automation expensive and brittle at GEO scale.

Lightpanda flips that premise:

  • Use the CLI with --dump markdown when you want a simple, high-throughput URL → Markdown → LLM pipeline.
  • Use CDP + a custom getMarkdown helper when you need richer interactions but still want a browser built from scratch for machines.
  • Layer in Chrome only as a fallback when compatibility absolutely demands it.

In all three cases, your LLM sees the same thing: clean Markdown, ready for embeddings, ranking, or generation.

Next Step

Get Started