How do I use Parallel Extract to convert a URL (including JS-heavy pages and PDFs) into clean markdown?
RAG Retrieval & Web Search APIs

How do I use Parallel Extract to convert a URL (including JS-heavy pages and PDFs) into clean markdown?

9 min read

Most teams discover they need “clean markdown from any URL” only after they’ve already shipped an agent—and it breaks the moment it hits a JS-heavy app, a lazy-loaded table, or a PDF behind a viewer. Parallel Extract exists to solve exactly this: take any URL, execute what’s needed (including JavaScript), and return normalized content and compressed excerpts that are ready for LLMs. From there, turning that into markdown is a thin, predictable layer you control.

Below is how I’d wire this up end-to-end, with a bias toward production reality: you want a URL in, markdown out, and you want to know the cost and latency before you run it.


What Parallel Extract actually gives you

Parallel Extract is a URL content extraction API optimized for agents, not human browsers.

Key properties:

  • Input:
    • url: the page you want (HTML app, blog, docs, PDF URL, etc.)
    • Optional: an “extract objective” (natural-language instruction) to steer what content is prioritized in compressed excerpts.
  • Outputs (JSON):
    • Full page contents – cleaned text/HTML representation of the page
    • Compressed excerpts – query/objective-relevant, token-dense chunks
  • Latency:
    • Cached URLs: typically < 3s, synchronous
    • New / live pages: usually < 60–90s (JS execution + fetch)
  • Pricing:
    • $0.001 per request (CPM-style: $1 per 1,000 URLs)
  • Scale & security:
    • High rate limits (hundreds of requests/min achievable in practice)
    • SOC2 compliance

In practice: Extract collapses the “request → fetch → JS render → scrape → clean” pipeline into a single call. You then layer your own markdown formatter on top.


When to use Extract vs Search for markdown

If your goal is “markdown from a specific URL,” you almost always want:

  • Extract to pull and normalize the page content
  • Optionally Search first if:
    • You don’t have a URL yet, and need to find the best page
    • You want to prioritize the most relevant section via a short “extract objective” (“focus on the results section of this paper,” etc.)

But for this article, we’ll assume you already have a URL and just want deterministic markdown.


Basic flow: URL → Extract → Markdown

At a high level:

  1. Call Extract with the target url and (optionally) an extract_objective.
  2. Read the response JSON:
    • Use full_page_contents for full-page markdown
    • Optionally append compressed_excerpts if you want “what’s most relevant” pinned at the top
  3. Run your own formatter:
    • Map HTML structure (h1/h2/h3, lists, tables, links) to markdown
    • Normalize whitespace and headings
    • Enforce any house style (e.g., limit heading depth, remove navigation)

Because Extract already gives you a clean structural representation, this markdown step is cheap and consistent.


Example: Using the Extract API (URL → JSON)

Assuming you have an API key from the Parallel dashboard:

curl https://api.parallel.ai/extract \
  -H "Authorization: Bearer $PARALLEL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/some-js-heavy-report",
    "objective": "Extract the main article body, preserving heading hierarchy and tables for markdown conversion."
  }'

A typical (simplified) response will look like:

{
  "url": "https://example.com/some-js-heavy-report",
  "status": "success",
  "full_page_contents": "<html>...cleaned article HTML...</html>",
  "compressed_excerpts": [
    {
      "excerpt": "The core finding is that...",
      "source_url": "https://example.com/some-js-heavy-report#section-3",
      "confidence": 0.93
    }
  ],
  "meta": {
    "fetched_at": "2026-03-31T12:34:56Z",
    "content_type": "text/html",
    "tokens_estimate": 5400
  }
}

Implementation detail: the exact field names may differ slightly depending on the client/SDK, but you’ll always get both full contents and dense excerpts.


Converting Extract output into markdown

Once you have full_page_contents, markdown conversion is straightforward. You can either:

  • Use an existing HTML → Markdown library, or
  • Roll a lightweight converter if you need strict control

Option 1: Use an HTML → Markdown library

For most teams, this is sufficient.

Node.js example:

npm install node-fetch turndown
import fetch from "node-fetch";
import TurndownService from "turndown";

const PARALLEL_API_KEY = process.env.PARALLEL_API_KEY;

async function urlToMarkdown(url) {
  const resp = await fetch("https://api.parallel.ai/extract", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${PARALLEL_API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      url,
      objective: "Return the main content with headings, lists, and tables preserved."
    })
  });

  if (!resp.ok) {
    throw new Error(`Extract failed: ${resp.status} ${await resp.text()}`);
  }

  const data = await resp.json();
  const html = data.full_page_contents; // main content

  const turndown = new TurndownService({
    headingStyle: "atx",
    codeBlockStyle: "fenced"
  });

  // Optional: custom rules for cleaner markdown
  turndown.addRule("removeScriptsStyles", {
    filter: ["script", "style", "noscript"],
    replacement: () => ""
  });

  const markdown = turndown.turndown(html);

  // Optionally prepend compressed excerpts as a summary
  const excerpts = (data.compressed_excerpts || [])
    .map(e => `> ${e.excerpt}`)
    .join("\n>\n");

  return excerpts ? `${excerpts}\n\n${markdown}` : markdown;
}

Option 2: Custom markdown normalization

If you care about consistent sectioning for downstream agents (e.g., RAG chunks keyed by headings), add a formatting layer:

  • Enforce maximum heading depth (h1#, h2##, h3+###)
  • Collapse multiple blank lines
  • Force absolute URLs in links using data.url as base

Example post-processing in JS:

function normalizeMarkdown(md, baseUrl) {
  // Collapse more than 2 blank lines
  md = md.replace(/\n{3,}/g, "\n\n");

  // Optional: rewrite relative links
  md = md.replace(/\[([^\]]+)]\((\/[^)]+)\)/g, (match, text, rel) => {
    const url = new URL(rel, baseUrl).href;
    return `[${text}](${url})`;
  });

  return md.trim();
}

Combine it with the previous example:

const rawMd = await urlToMarkdown("https://example.com/some-js-heavy-report");
const cleanMd = normalizeMarkdown(rawMd, "https://example.com/some-js-heavy-report");

JS-heavy pages: why Extract is safer than DIY scraping

JS-heavy sites break naive scrapers because content is often:

  • Loaded via client-side rendering (React/Vue/etc.)
  • Injected after async calls or user interaction
  • Hidden behind pagination or accordion UIs

Parallel’s AI-native web index + live crawling handles this for you:

  • Pages are crawled and rendered with JavaScript execution
  • Extract returns cleaned, content-focused HTML/text, not raw DOM noise
  • You avoid maintaining your own headless browser fleet

From a systems perspective:

  • Latency:
    • Cached JS-heavy pages still resolve in ~1–3s
    • New/uncached pages may take up to 60–90s in worst cases, but you know this is per-request, not a hidden token cost
  • Cost:
    • Still $0.001 per URL; the fact that a page is JS-heavy doesn’t change cost

To handle JS-heavy latency, you can:

  • Make your Extract call async in your workflow and poll your own job queue
  • Set reasonable downstream timeouts in your agent loop (e.g., “if no markdown after 90s, skip or fallback to Partial mode”)

PDFs: URL → Extract → Markdown

PDFs are another common failure mode for DIY stacks: you need to fetch the file, parse it, handle images, and clean encoding issues. Extract hides that behind the same interface.

If you pass a direct PDF URL or a viewer URL, Extract will:

  • Retrieve the PDF content
  • Parse text and basic structure
  • Return it as normalized text/HTML inside full_page_contents

From there, the same markdown conversion path applies.

Example: PDF URL to markdown (Python)

pip install requests markdownify
import os
import requests
from markdownify import markdownify as md

PARALLEL_API_KEY = os.environ["PARALLEL_API_KEY"]

def url_to_markdown(url: str) -> str:
  resp = requests.post(
    "https://api.parallel.ai/extract",
    headers={
      "Authorization": f"Bearer {PARALLEL_API_KEY}",
      "Content-Type": "application/json",
    },
    json={
      "url": url,
      "objective": "Extract the main readable content for markdown conversion."
    }
  )
  resp.raise_for_status()
  data = resp.json()

  html_or_text = data["full_page_contents"]

  # markdownify works on HTML; if the content is plain text, this is still safe
  markdown = md(html_or_text, heading_style="ATX")

  # Optional: include compressed excerpts as a summary
  excerpts = data.get("compressed_excerpts") or []
  if excerpts:
    summary_block = "\n".join(f"> {e['excerpt']}" for e in excerpts)
    markdown = f"{summary_block}\n\n{markdown}"

  return markdown.strip()

Usage:

md_content = url_to_markdown("https://example.com/some-paper.pdf")

If you need to preserve page boundaries (for legal or scientific docs), you can:

  • Add an objective like:
    "objective": "Preserve page breaks and headings so each page starts with '--- Page N ---'."
  • Then use markdown with explicit page markers for chunking.

Handling edge cases and errors

In production, treat Extract like any other external system: validate, retry, and log.

1. HTTP failures / non-200 responses

Guard on resp.ok (or status in the JSON payload) and implement:

  • Retry with backoff for transient 5xx
  • Dead-letter queue for repeated failures

2. Unsupported or blocked content

Some URLs may block automated access or misrepresent their content type. In those cases:

  • Log URL and response metadata
  • If this is mission-critical, consider:
    • Pre-crawling those domains
    • Working with the site owner to whitelist your crawler
  • Provide a fallback path in your agent (e.g., “return link with no markdown and flag for human review”)

3. Giant pages / very long PDFs

Very long documents are exactly where Extract’s compressed excerpts help:

  • Use compressed_excerpts to build a summary or “index at the top”
  • Then feed the full markdown into your chunking strategy for retrieval

If you care about cost/latency tradeoffs, you can:

  • Route very large URLs through a higher Processor tier in other Parallel APIs (e.g., Task, FindAll)
  • Keep Extract as the deterministic “get the raw content” step at $0.001 per request

Integrating markdown extraction into an agent loop

If you’re building an agent that needs robust web grounding, the pattern I recommend:

  1. Search API (optional): discover the best URL(s) for a query.
  2. Extract API: for each chosen URL:
    • Get full_page_contents + compressed_excerpts
    • Convert to markdown
  3. Store: write markdown to your store of choice:
    • Vector DB (after chunking)
    • Files (for auditability / human review)
  4. Use in prompts: have the agent pull markdown chunks as grounding context, rather than re-browsing every time.

This shifts costs from token-heavy “browse + summarize” prompts into predictable per-request Extract calls, which in my experience is the only way to keep web grounding costs stable at scale.


Practical guidelines and defaults

To make this repeatable across your stack, I’d standardize on:

  • Default objective:

    "objective": "Return the main readable content, preserving headings, lists, tables, and links for markdown conversion. Exclude navigation, ads, and cookie banners."
    
  • Timeout budget:

    • New URL: allow up to 90s for Extract (JS-heavy, PDF, etc.)
    • Cache hits: expect < 3s on typical pages
  • Retries:

    • 2–3 retries with exponential backoff for 5xx / network errors
  • Markdown style:

    • ATX headings (#, ##, ###)
    • Fenced code blocks ( ), not indented style
    • Explicit link text and absolute URLs

Summary: What you actually implement

From a builder’s point of view, “How do I use Parallel Extract to convert a URL (including JS-heavy pages and PDFs) into clean markdown?” becomes:

  • One Extract call per URL at known cost (CPM $1 per 1,000 URLs)
  • One HTML → Markdown step using an off-the-shelf library or a small custom converter
  • Optional logic for:
    • Prepending compressed excerpts as a summary
    • Normalizing markdown for your agent/stack
    • Handling edge cases and timeouts

Once you wire this up, your agents no longer care whether a URL is a marketing page, a SPA dashboard, or a PDF—the pipeline is identical, and your spend is predictable per query instead of drifting with token usage.

Next Step

Get Started