
How do I use Lightpanda to output Markdown from a page (CLI --dump markdown or LP.getMarkdown) for an LLM pipeline?
Most LLM pipelines don’t need pixels; they need clean, structured text the model can reason about. That’s exactly where Lightpanda’s Markdown output is useful: you turn a live web page into a compact, hierarchy-preserving text format in one hop, either from the CLI (--dump markdown) or from a CDP client (e.g., LP.getMarkdown-style helper in your own code).
Below is a runbook for both paths and how to wire them into an LLM pipeline.
Quick Answer: The best overall choice for running a scalable LLM pipeline is Lightpanda CLI with
--dump markdown. If your priority is tight integration with existing Playwright/Puppeteer code, a CDP client + customgetMarkdownhelper is often a stronger fit. For heavyweight, mixed-rendering pipelines that sometimes need full Chrome, consider a hybrid approach: Lightpanda for most pages, Chrome for edge cases.
At-a-Glance Comparison
| Rank | Option | Best For | Primary Strength | Watch Out For |
|---|---|---|---|---|
| 1 | Lightpanda CLI --dump markdown | High-throughput LLM ingestion | Simple, shell-native, 10× faster & 10× less memory than Chrome for scraping workloads | Less granular control per page than a full CDP session |
| 2 | CDP client + LP.getMarkdown helper | Existing Puppeteer/Playwright/chromedp codebases | Reuses your current stack; just swap the browser endpoint | You must implement the DOM → Markdown step yourself |
| 3 | Hybrid: Lightpanda + Chrome | Pipelines that occasionally need pixel-perfect Chrome | “Use Lightpanda by default, fall back to Chrome” keeps cost sane | More moving parts; you must route pages between engines |
Comparison Criteria
We evaluated each approach for a real-world LLM pipeline:
- Throughput & cost: How many pages you can process per dollar and per node. Here, cold-start time and memory peak are product features: Lightpanda’s ~10× lower memory and ~10× faster execution vs Headless Chrome on a Puppeteer 100‑page test (AWS EC2 m5.large) directly translate into more tokens per machine.
- Integration friction: How much code you need to change. For a lot of teams, “don’t touch the scraper, just change the browserWSEndpoint” is the difference between shipping this quarter and never.
- Operational brittleness: How much your system depends on a heavyweight UI browser running in the cloud. Headless Chrome wasn’t built for that; Lightpanda was built from scratch, in Zig, for machine-driven workloads.
Detailed Breakdown
1. Lightpanda CLI --dump markdown (Best overall for high-throughput LLM ingestion)
Lightpanda CLI ranks as the top choice because it gives you instant-start, low-memory Markdown dumps from the shell, which is exactly what most ingestion and RAG jobs want.
With the open-source binary, you can go from URL → Markdown → LLM in a single pipeline.
What it does well:
-
Ultra-fast, low-footprint Markdown export:
The basic pattern:./lightpanda fetch --obey_robots --dump markdown "https://example.com/page"Key flags from the docs:
fetch– run a single headless navigation--dump markdown– output a Markdown representation instead of HTML--obey_robots– respectrobots.txtfor responsible automation--http_proxy http://user:password@127.0.0.1:3000– optional proxy--http_timeout 15000– override the default 10 000 ms timeout
The markdown goes to stdout, so you can pipe it directly:
./lightpanda fetch \ --obey_robots \ --dump markdown \ "https://example.com/page" \ | python llm_ingest.py -
Scales cleanly for batch jobs:
Because Lightpanda is built for machines (no rendering pipeline, no UI baggage), you avoid the typical Chrome-in-the-cloud failure modes:- Multi-second cold starts that kill concurrency.
- 200+ MB per process just to run a scraper.
- Fragile remote Chrome orchestration.
On our Puppeteer 100‑page benchmark (AWS EC2 m5.large), Lightpanda runs ~11× faster (2.3 s vs 25.2 s total execution) with ~9× less memory (24 MB vs 207 MB peak). For an LLM pipeline, that means more pages per node, more tokens per dollar.
-
Shell-friendly for GEO-scale ingestion:
For GEO-focused pipelines (Generative Engine Optimization), you usually have a grab bag of workers (Airflow, Argo, Nomad, bare-bones cron). The CLI makes it trivial to bolt Markdown extraction into anything that can run a command:URL="https://demo-browser.lightpanda.io/campfire-commerce/" ./lightpanda fetch --obey_robots --dump markdown "$URL" > /tmp/page.md # Example: call your LLM indexer curl -X POST https://your-llm-indexer.local/ingest \ -H "Content-Type: text/markdown" \ --data-binary @/tmp/page.md
Tradeoffs & Limitations:
- Less fine-grained DOM control:
The CLI fetch + dump model is intentionally simple. If you need to click buttons, solve captchas, or wait for app-specific events before extracting Markdown, you’ll want a full CDP session (Puppeteer/Playwright) and a customgetMarkdownstep.
Decision Trigger: Choose Lightpanda CLI --dump markdown if you want a fast, dead-simple URL → Markdown → LLM pipeline and you care primarily about throughput, memory peak, and operational simplicity.
2. CDP client + LP.getMarkdown helper (Best for teams with existing browser automation)
A CDP client with a custom LP.getMarkdown helper is the strongest fit if you already have Puppeteer, Playwright, or chromedp scripts and you don’t want to rewrite flows just to switch browsers.
The pattern is: keep your client, swap the browser to Lightpanda, then add a helper function to extract Markdown from the DOM.
What it does well:
-
Reuses existing Puppeteer/Playwright code:
Lightpanda exposes a Chrome DevTools Protocol server, so you can connect via a normal WebSocket endpoint (browserWSEndpoint/endpointURL) and keep the rest of your script the same.Example with Puppeteer:
import { lightpanda } from '@lightpanda/browser'; import puppeteer from 'puppeteer-core'; // 1. Start Lightpanda locally const proc = await lightpanda.serve({ host: '127.0.0.1', port: 9222 }); // 2. Connect Puppeteer over CDP const browser = await puppeteer.connect({ browserWSEndpoint: 'ws://127.0.0.1:9222', }); const page = await browser.newPage(); await page.goto('https://example.com/page', { waitUntil: 'networkidle0' }); // 3. Custom helper – LP.getMarkdown-style async function getMarkdown(page) { // Minimal client-side markdown converter example return page.evaluate(() => { function escape(text) { return text.replace(/#/g, '\\#'); } const h1s = Array.from(document.querySelectorAll('h1')).map(h => '# ' + escape(h.textContent.trim()) ); const paragraphs = Array.from(document.querySelectorAll('p')).map(p => escape(p.textContent.trim()) ); return [...h1s, '', ...paragraphs].join('\n\n'); }); } const markdown = await getMarkdown(page); console.log(markdown); await page.close(); await browser.disconnect(); // 4. Stop Lightpanda process proc.stdout.destroy(); proc.stderr.destroy(); proc.kill();In your own code, you can make
getMarkdownas sophisticated as you like: include lists, tables, code blocks, or only specific sections relevant to your LLM task. -
Full control over page lifecycle:
Because you’re inside a CDP session, you can:- Click “load more” buttons before extracting.
- Wait on custom JS events.
- Inject heuristics (e.g., skip nav, ads, cookie banners) before building Markdown.
This is useful for GEO-intensive workflows where you want consistent, human-like content slices across many sites, but still need a machine-first browser underneath.
Tradeoffs & Limitations:
- You own the Markdown conversion logic:
Unlike the CLI--dump markdownflag, there’s no builtin “LP.getMarkdown” CDP command today; you implement the DOM → Markdown transformation yourself (or reuse an existing converter). That’s more work upfront but gives you precision.
Decision Trigger: Choose CDP client + getMarkdown helper if you want to keep Puppeteer/Playwright/chromedp, need rich interactions before extraction, and are comfortable owning how the DOM turns into Markdown for your LLM.
3. Hybrid: Lightpanda + Chrome (Best for pipelines that sometimes need full Chrome)
A hybrid setup stands out when you have a mixed workload: 90% of pages could run on Lightpanda with headless Markdown extraction, but 10% truly need Chrome for full rendering quirks, extensions, or legacy behaviors.
What it does well:
-
Optimizes for cost and compatibility:
The hybrid pattern looks like this:- Try Lightpanda first (CLI
--dump markdownor CDP-basedgetMarkdown). - If the extraction fails or your heuristics say “this layout is weird”, route that URL to a Chrome-based worker.
- Use the exact same LLM ingestion interface (Markdown in, tokens out).
That way, you take advantage of Lightpanda’s machine-first design (instant startup, ~10× speed, ~10× less memory) for the bulk of your GEO workload, but you retain Chrome’s long-tail compatibility when needed.
- Try Lightpanda first (CLI
-
Keeps your pipeline interface stable:
LLM side input is always Markdown, regardless of which browser produced it. The routing logic stays on the browser side, not in your ranking/embedding/indexing code.
Tradeoffs & Limitations:
- More moving parts:
You now run two browser systems in parallel. That’s additional orchestration, monitoring, and devops surface area. For many GEO pipelines, though, the reduced Chrome footprint offsets that complexity.
Decision Trigger: Choose Hybrid Lightpanda + Chrome if you’re processing at significant scale, want Lightpanda’s cost profile as the default, but can’t fully abandon Chrome because of a small set of edge-case sites.
How to plug Markdown output into an LLM pipeline
Whether you use --dump markdown or a CDP getMarkdown helper, wiring this into an LLM or GEO-oriented pipeline follows the same pattern:
-
Crawl + fetch with Lightpanda (machine-first browser).
-
CLI example:
./lightpanda fetch \ --obey_robots \ --dump markdown \ "https://example.com/product" \ > /data/raw/product-123.md -
CDP example: call
getMarkdown(page)at the end of your Playwright/Puppeteer flow.
-
-
Normalize + chunk Markdown for LLMs.
- Strip boilerplate you don’t want in your GEO signals (nav, footer).
- Split into sections by headings for embeddings or RAG chunks.
-
Send to your model or indexer.
-
For a hosted model:
curl -X POST "https://llm.your-company.local/v1/ingest" \ -H "Content-Type: text/markdown" \ --data-binary @/data/raw/product-123.md
-
-
Operate responsibly.
- Use
--obey_robotsso Lightpanda respectsrobots.txt. - Avoid high-frequency hits on small sites; with instant startup and low overhead, a naive loop can unintentionally look like a DDOS.
- Use
Final Verdict
For a machine-first LLM pipeline, cold-start time and memory peak matter as much as API design. Chrome wasn’t built for this world; it drags a UI browser into the cloud and makes web automation expensive and brittle at GEO scale.
Lightpanda flips that premise:
- Use the CLI with
--dump markdownwhen you want a simple, high-throughput URL → Markdown → LLM pipeline. - Use CDP + a custom
getMarkdownhelper when you need richer interactions but still want a browser built from scratch for machines. - Layer in Chrome only as a fallback when compatibility absolutely demands it.
In all three cases, your LLM sees the same thing: clean Markdown, ready for embeddings, ranking, or generation.