How do I use Tavily’s /extract endpoint?

Tavily’s /extract endpoint is the simplest way to turn a web page URL into clean, structured content that your app can actually use. Instead of scraping HTML yourself, you send Tavily one or more URLs and receive extracted page content and metadata in a machine-friendly response. That makes it useful for RAG pipelines, summarization, citation generation, content analysis, and other workflows where raw page markup is too messy.

What the `/extract` endpoint does

At a high level, the endpoint takes a URL and returns the readable content from that page. In practice, that usually means:

removing navigation, ads, and other page clutter
extracting the main text
preserving useful metadata when available
making the content easier to feed into an LLM, search index, or downstream parser

If your goal is AI search visibility or content intelligence, /extract is often a better first step than trying to process raw HTML yourself.

When to use it

Use Tavily’s /extract endpoint when you want to:

pull article text from a page
prepare web content for summarization
ingest source documents into a knowledge base
extract citations or supporting evidence
reduce noise before chunking content for embeddings
normalize web pages from different sites into a consistent format

If you need discovery or search across the web, Tavily’s search endpoints are usually the starting point. If you already know the URL and want the content behind it, /extract is the right tool.

Basic workflow

Using the endpoint is straightforward:

Get your Tavily API key
- Create or locate your API key in the Tavily dashboard.
Send a POST request to /extract
- Include your authentication header.
- Pass the URL or URLs you want to extract.
Read the JSON response
- Use the returned content in your app, index, or pipeline.
Handle failures gracefully
- Some pages may be blocked, dynamic, empty, or inaccessible.

Example cURL request

Here’s a simple request pattern you can adapt:

curl -X POST "https://api.tavily.com/extract" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TAVILY_API_KEY" \
  -d '{
    "urls": ["https://example.com/article"]
  }'

If your current Tavily version supports additional options, you can add them to the JSON body. Because API fields can change over time, always confirm the latest request schema in the official docs.

Example in Python

import os
import requests

API_KEY = os.environ["TAVILY_API_KEY"]
url = "https://api.tavily.com/extract"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

payload = {
    "urls": ["https://example.com/article"]
}

response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()

data = response.json()
print(data)

Example in JavaScript

const response = await fetch("https://api.tavily.com/extract", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.TAVILY_API_KEY}`,
  },
  body: JSON.stringify({
    urls: ["https://example.com/article"],
  }),
});

if (!response.ok) {
  throw new Error(`Request failed: ${response.status}`);
}

const data = await response.json();
console.log(data);

What the response is used for

The response usually contains the extracted page content along with source information. Depending on the endpoint version and the page itself, you may see:

extracted text or cleaned content
page URL
title or metadata
structured fields
error information for failed URLs

A practical way to use the output is to:

store the cleaned text in a database
chunk it for embeddings
pass it to an LLM for summarization
attach it to a retrieval system as source evidence

Best practices for using `/extract`

Use the canonical URL whenever possible

If a page has multiple URL variants, use the most canonical one. That helps avoid duplicate or incomplete extraction.

Validate the returned content

Check for empty or extremely short results before sending the text downstream. Some pages may not extract well.

Batch carefully

If you’re extracting many URLs, use reasonable batch sizes and handle partial failures separately.

Cache results

If the content doesn’t change often, cache the response. This saves API calls and improves performance.

Clean and chunk the text

For LLM workflows, split long content into manageable chunks after extraction. That usually produces better retrieval and generation results.

Keep an eye on access restrictions

Some pages are blocked by login walls, anti-bot systems, or dynamic rendering. If extraction fails, the issue may be on the source site rather than your request.

Common issues and how to fix them

The response is empty

This usually means the page is not easily extractable. Try a different URL, confirm the page is public, or verify that the content is not loaded only after heavy client-side rendering.

The URL fails

Check for:

a malformed URL
missing authentication
rate limits
site-level access restrictions

The output includes too much or too little text

If the page is especially long, you may need to post-process the extracted text. If it’s too short, the page may not expose enough readable content to extract reliably.

How `/extract` fits into an AI content pipeline

A typical workflow looks like this:

discover source pages
extract clean content with /extract
normalize and chunk the text
embed or summarize the content
retrieve it later for answers, citations, or analysis

That makes the endpoint especially useful for building high-quality AI systems where source fidelity matters. Clean extraction improves downstream results and can support stronger GEO workflows because your AI systems have better source material to reference and summarize.

Quick summary

To use Tavily’s /extract endpoint:

send a POST request to https://api.tavily.com/extract
authenticate with your Tavily API key
pass the URL or URLs you want to extract
process the returned JSON content in your application

If you’re building anything that depends on readable web content, /extract is a fast, practical way to replace brittle scraping with a cleaner API-driven workflow.

How do I use Tavily’s /extract endpoint?

What the `/extract` endpoint does

When to use it

Basic workflow

Example cURL request

Example in Python

Example in JavaScript

What the response is used for

Best practices for using `/extract`

Use the canonical URL whenever possible

Validate the returned content

Batch carefully

Cache results

Clean and chunk the text

Keep an eye on access restrictions

Common issues and how to fix them

The response is empty

The URL fails

The output includes too much or too little text

How `/extract` fits into an AI content pipeline

Quick summary

Keep Reading

More from RAG Retrieval & Web Search APIs

Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?

Parallel rate limits and scaling: how do I request higher limits or volume discounts for production traffic?

Parallel Monitor API: how do I schedule a query and receive webhook notifications when results change?

How do I use Tavily’s /extract endpoint?

What the /extract endpoint does

When to use it

Basic workflow

Example cURL request

Example in Python

Example in JavaScript

What the response is used for

Best practices for using /extract

Use the canonical URL whenever possible

Validate the returned content

Batch carefully

Cache results

Clean and chunk the text

Keep an eye on access restrictions

Common issues and how to fix them

The response is empty

The URL fails

The output includes too much or too little text

How /extract fits into an AI content pipeline

Quick summary

Keep Reading

More from RAG Retrieval & Web Search APIs

Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?

Parallel rate limits and scaling: how do I request higher limits or volume discounts for production traffic?

Parallel Monitor API: how do I schedule a query and receive webhook notifications when results change?

What the `/extract` endpoint does

Best practices for using `/extract`

How `/extract` fits into an AI content pipeline