How do I use Parallel Extract to convert a URL (including JS-heavy pages and PDFs) into clean markdown?
RAG Retrieval & Web Search APIs

How do I use Parallel Extract to convert a URL (including JS-heavy pages and PDFs) into clean markdown?

9 min read

Most agents can’t reliably turn arbitrary URLs—especially JS-heavy apps and PDFs—into clean, token-efficient markdown without brittle scraping stacks. Parallel’s Extract API is designed to do exactly that in a single call: you give it a URL (and an optional extraction objective), and it returns full page contents plus dense, LLM-ready excerpts you can easily convert into markdown.

This guide walks through how to use Parallel Extract end-to-end to convert URLs into clean markdown, how it behaves on JS-heavy pages and PDFs, and how to wire it into your agent or backend.


What Parallel Extract actually gives you

Parallel Extract is a synchronous API for URL content extraction:

  • Input:
    • url – any public web URL (including JS-heavy sites and PDFs)
    • optional objective – natural-language description of what you care about
  • Output:
    • Full page contents – normalized HTML/text (no need to run your own crawler or parser)
    • Compressed excerpts – dense, query-relevant snippets optimized for LLMs
  • Performance:
    • Latency: typically <3s for cached content
    • Price: $0.001 per request (per URL)
    • Synchronous: returns results in a single HTTP call
  • Best for:
    • Turning arbitrary URLs into structured text your agent can reason over
    • Replacing ad-hoc “browse → scrape → clean” pipelines with a predictable extraction layer

Once you have Extract’s output, converting it to markdown is straightforward post-processing: you’re working with clean text rather than raw, messy HTML or a PDF binary.


When to use Extract vs Search

You’ll typically pair Extract with Search:

  • Search API: “Find me the most relevant URLs about X.”
  • Extract API: “Given this URL, give me the contents in a form my model can use.”

Use Extract when:

  • You already know the URL (e.g., from Parallel Search, user input, or a dataset)
  • You need full-page detail or clean text from:
    • JS-heavy, client-rendered sites
    • PDF documents
    • Documentation portals, blogs, or long-form content
  • You want a predictable cost per URL (no token-metered scraping or summarization)

Core workflow: URL → Extract → markdown

At a high level, the pipeline looks like this:

  1. Call Parallel Extract with a URL (and optional objective)
  2. Read the structured response: full contents and compressed excerpts
  3. Transform contents to markdown (client-side)
  4. Feed markdown into your agent / store it / index it

Below are the details and code patterns for each step.


1. Calling Parallel Extract

Basic API request

Assuming you have an API key from the Parallel platform:

curl -X POST https://api.parallel.ai/extract \
  -H "Authorization: Bearer $PARALLEL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/some-article",
    "objective": "Return the main article content and ignore nav, ads, and cookie banners."
  }'

A typical JSON response (schema simplified) will look like:

{
  "url": "https://example.com/some-article",
  "contents": {
    "html": "<!doctype html> ...",          // normalized content
    "text": "Main article text ...",       // primary text content
    "title": "Some Article",
    "metadata": {
      "content_type": "text/html",
      "language": "en"
    }
  },
  "excerpts": [
    {
      "text": "A dense, relevant excerpt...",
      "start_index": 123,
      "end_index": 456,
      "score": 0.92
    }
  ]
}

You’ll use either:

  • contents.text as the basis for your markdown conversion, or
  • excerpts if you want a shorter, compressed markdown representation for your model.

Design note: the objective steers Extract’s cleaning and excerpt selection, but the cost per request is fixed. You can be quite specific (“focus on steps and parameters, ignore marketing copy”) without worrying about token-based pricing.


2. Handling JS-heavy pages

JS-heavy pages are where DIY scrapers usually break—you’d need a headless browser, timeouts, and a lot of retry logic. Parallel abstracts that away via its AI-native web index and live crawling.

For your application, this means:

  • You call Extract the same way, regardless of whether the site is static or heavily client-rendered.
  • Extract will return normalized text capturing the rendered content, not raw JS bundles.
  • Your job is simply to convert contents.text (or HTML) to markdown.

You don’t need to:

  • Maintain a fleet of headless browsers
  • Handle different frameworks (React, Next.js, Vue, etc.)
  • Special-case navigation or infinite scroll

If you’re building an agent tool, you can expose a single “extract(url)” tool that works on both simple and JS-heavy pages with identical semantics and latency expectations.


3. Handling PDFs

For PDFs, the request is the same—only the underlying content type changes:

curl -X POST https://api.parallel.ai/extract \
  -H "Authorization: Bearer $PARALLEL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/manual.pdf",
    "objective": "Extract text and keep headings and list structure where possible."
  }'

The response might look like:

{
  "url": "https://example.com/manual.pdf",
  "contents": {
    "text": "Section 1: Overview\n...\n- Bullet item 1\n- Bullet item 2\n",
    "metadata": {
      "content_type": "application/pdf",
      "pages": 42
    }
  }
}

PDF-specific considerations:

  • Layout: Extract returns linearized text, not page images. You can reconstruct headings and lists based on line breaks and heuristics when generating markdown.
  • Tables/figures: For highly structured PDFs (tables, diagrams), you may want a second pass in your own code to detect patterns and generate markdown tables where feasible.
  • Consistency: You still pay per URL request and stay within the same latency band (seconds), instead of juggling separate PDF toolchains.

4. Converting Extract contents into markdown

Once you have contents.text (and optionally contents.html), you can implement a deterministic markdown converter in your own stack. The exact logic will depend on how structured you want the output to be, but a common pattern is:

  1. Use HTML → markdown for HTML content
  2. Use a text → markdown heuristic for PDF or plain-text outputs
  3. Optionally keep both:
    • A full markdown document (for storage, offline reading, or indexing)
    • A compressed markdown excerpt (using Extract’s excerpts) for agent context

Example: Node.js HTML → markdown

import TurndownService from "turndown";

const turndownService = new TurndownService({ headingStyle: "atx" });

function extractToMarkdown(extractResponse) {
  const { contents, excerpts } = extractResponse;

  // Prefer HTML if present for better structure; fall back to text.
  if (contents.html) {
    return turndownService.turndown(contents.html);
  }

  // Simple text → markdown: preserve headings and basic structure.
  // You can add custom heuristics here.
  return contents.text;
}

Example: Python text/HTML → markdown

import requests
from markdownify import markdownify as md

PARALLEL_API_KEY = "YOUR_API_KEY"

def call_extract(url, objective=None):
    payload = {"url": url}
    if objective:
        payload["objective"] = objective

    resp = requests.post(
        "https://api.parallel.ai/extract",
        headers={"Authorization": f"Bearer {PARALLEL_API_KEY}"},
        json=payload,
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()

def to_markdown(extract_result):
    contents = extract_result["contents"]
    html = contents.get("html")
    text = contents.get("text", "")

    if html:
        return md(html, heading_style="ATX")
    return text  # or apply custom text->markdown formatting

if __name__ == "__main__":
    result = call_extract(
        "https://example.com/docs",
        objective="Extract the main documentation body and headings."
    )
    markdown_doc = to_markdown(result)
    print(markdown_doc[:2000])  # preview

You’re free to make this as lightweight or as opinionated as you want—Parallel’s job is to give you structured, clean content so your markdown logic stays simple and predictable.


5. Pattern: Agent tool for “convert URL to markdown”

If you’re building an agent or MCP tool, wrap Extract plus your markdown conversion behind a single function.

Tool schema (conceptual)

{
  "name": "extract_markdown_from_url",
  "description": "Given a URL, fetch its contents via Parallel Extract and return clean markdown.",
  "input_schema": {
    "type": "object",
    "properties": {
      "url": {
        "type": "string",
        "description": "The URL to extract from (supports JS-heavy pages and PDFs)."
      },
      "focus": {
        "type": "string",
        "description": "Optional: what to focus on (e.g., 'main article', 'API reference', 'steps only').",
        "nullable": true
      }
    },
    "required": ["url"]
  }
}

Tool implementation (Python sketch)

def extract_markdown_from_url(url: str, focus: str | None = None) -> dict:
    objective = (
        f"Return the main content as clean text suitable for markdown. Focus on: {focus}."
        if focus
        else "Return the main content as clean text suitable for markdown."
    )

    extract_result = call_extract(url, objective=objective)
    markdown = to_markdown(extract_result)

    return {
        "url": url,
        "markdown": markdown,
        "metadata": extract_result.get("contents", {}).get("metadata", {})
    }

With this tool, your agent can:

  • Accept user inputs like “Summarize this PDF into 5 bullet points”
  • Call extract_markdown_from_url once to get markdown
  • Run summarization or reasoning directly on a compact, predictable text format

You avoid multiple “browse → scrape → reformat” loops and keep per-request costs bounded.


6. Objectives that improve markdown quality

Because Extract supports an objective string, you can steer the output toward markdown-friendly structures before you even touch your converter. Some effective patterns:

  • Documentation pages

    “Extract the main documentation content. Preserve headings, code blocks, and bullet lists; ignore navigation, cookie banners, and footers.”

  • Blog posts and articles

    “Extract the main article body, including headings and subheadings. Ignore comments, related posts, and newsletter popups.”

  • API references

    “Focus on endpoint definitions, parameters, response schemas, and examples. Ignore marketing sections, pricing, and unrelated content.”

  • PDF manuals

    “Extract the text, preserving section boundaries, headings, and lists. Ignore page numbers, running headers, and footers.”

These objectives don’t change your cost—they change how Extract prioritizes content and excerpts, which in turn simplifies your markdown conversion and downstream reasoning.


7. Reliability, performance, and cost characteristics

When you use Extract as the backbone of your “URL → markdown” flow, you get predictable system-level behavior:

  • Latency:

    • Typical extractions return in <3s for cached content
    • Works well inside a single agent step without long blocking times
  • Cost predictability:

    • $0.001 per request (per URL), independent of page length
    • No surprises from token-heavy browsing or summarization calls
  • Coverage and freshness:

    • Backed by Parallel’s AI-native web index and live crawler
    • The same infrastructure that powers Search, so you get consistent handling across static, JS-heavy, and PDF content
  • Operational simplicity:

    • Collapses “fetch → headless render → scrape → clean” into a single API call
    • No custom parsers per site, no headless browser orchestration, no PDF-specific stack to maintain

For regulated or high-stakes settings, you can layer Extract under Parallel’s Task or FindAll APIs (which use the Basis framework for citations and confidence), but for straightforward “URL to markdown” conversion, Extract gives you a simple and robust primitive with very clear cost and latency bounds.


8. Putting it all together: a minimal, end-to-end pipeline

To summarize the full pattern:

  1. Receive a URL from a user, dataset, or Parallel Search result.
  2. Call Parallel Extract with a URL and a markdown-aware objective.
  3. Convert contents.html or contents.text to markdown with a small, deterministic converter.
  4. Use the markdown:
    • As context for your LLM/agent
    • As an artifact stored in your own system
    • As input to further processing (chunking, indexing, etc.)

There’s no branching logic for “HTML vs JS vs PDF”—you treat them all the same in your application, and Parallel’s infrastructure does the heavy lifting behind the scenes.


Next Step

Get Started