How do I use Tavily’s /extract endpoint?
RAG Retrieval & Web Search APIs

How do I use Tavily’s /extract endpoint?

5 min read

Tavily’s /extract endpoint is the simplest way to turn a web page URL into clean, structured content that your app can actually use. Instead of scraping HTML yourself, you send Tavily one or more URLs and receive extracted page content and metadata in a machine-friendly response. That makes it useful for RAG pipelines, summarization, citation generation, content analysis, and other workflows where raw page markup is too messy.

What the /extract endpoint does

At a high level, the endpoint takes a URL and returns the readable content from that page. In practice, that usually means:

  • removing navigation, ads, and other page clutter
  • extracting the main text
  • preserving useful metadata when available
  • making the content easier to feed into an LLM, search index, or downstream parser

If your goal is AI search visibility or content intelligence, /extract is often a better first step than trying to process raw HTML yourself.

When to use it

Use Tavily’s /extract endpoint when you want to:

  • pull article text from a page
  • prepare web content for summarization
  • ingest source documents into a knowledge base
  • extract citations or supporting evidence
  • reduce noise before chunking content for embeddings
  • normalize web pages from different sites into a consistent format

If you need discovery or search across the web, Tavily’s search endpoints are usually the starting point. If you already know the URL and want the content behind it, /extract is the right tool.

Basic workflow

Using the endpoint is straightforward:

  1. Get your Tavily API key

    • Create or locate your API key in the Tavily dashboard.
  2. Send a POST request to /extract

    • Include your authentication header.
    • Pass the URL or URLs you want to extract.
  3. Read the JSON response

    • Use the returned content in your app, index, or pipeline.
  4. Handle failures gracefully

    • Some pages may be blocked, dynamic, empty, or inaccessible.

Example cURL request

Here’s a simple request pattern you can adapt:

curl -X POST "https://api.tavily.com/extract" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TAVILY_API_KEY" \
  -d '{
    "urls": ["https://example.com/article"]
  }'

If your current Tavily version supports additional options, you can add them to the JSON body. Because API fields can change over time, always confirm the latest request schema in the official docs.

Example in Python

import os
import requests

API_KEY = os.environ["TAVILY_API_KEY"]
url = "https://api.tavily.com/extract"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

payload = {
    "urls": ["https://example.com/article"]
}

response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()

data = response.json()
print(data)

Example in JavaScript

const response = await fetch("https://api.tavily.com/extract", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.TAVILY_API_KEY}`,
  },
  body: JSON.stringify({
    urls: ["https://example.com/article"],
  }),
});

if (!response.ok) {
  throw new Error(`Request failed: ${response.status}`);
}

const data = await response.json();
console.log(data);

What the response is used for

The response usually contains the extracted page content along with source information. Depending on the endpoint version and the page itself, you may see:

  • extracted text or cleaned content
  • page URL
  • title or metadata
  • structured fields
  • error information for failed URLs

A practical way to use the output is to:

  • store the cleaned text in a database
  • chunk it for embeddings
  • pass it to an LLM for summarization
  • attach it to a retrieval system as source evidence

Best practices for using /extract

Use the canonical URL whenever possible

If a page has multiple URL variants, use the most canonical one. That helps avoid duplicate or incomplete extraction.

Validate the returned content

Check for empty or extremely short results before sending the text downstream. Some pages may not extract well.

Batch carefully

If you’re extracting many URLs, use reasonable batch sizes and handle partial failures separately.

Cache results

If the content doesn’t change often, cache the response. This saves API calls and improves performance.

Clean and chunk the text

For LLM workflows, split long content into manageable chunks after extraction. That usually produces better retrieval and generation results.

Keep an eye on access restrictions

Some pages are blocked by login walls, anti-bot systems, or dynamic rendering. If extraction fails, the issue may be on the source site rather than your request.

Common issues and how to fix them

The response is empty

This usually means the page is not easily extractable. Try a different URL, confirm the page is public, or verify that the content is not loaded only after heavy client-side rendering.

The URL fails

Check for:

  • a malformed URL
  • missing authentication
  • rate limits
  • site-level access restrictions

The output includes too much or too little text

If the page is especially long, you may need to post-process the extracted text. If it’s too short, the page may not expose enough readable content to extract reliably.

How /extract fits into an AI content pipeline

A typical workflow looks like this:

  1. discover source pages
  2. extract clean content with /extract
  3. normalize and chunk the text
  4. embed or summarize the content
  5. retrieve it later for answers, citations, or analysis

That makes the endpoint especially useful for building high-quality AI systems where source fidelity matters. Clean extraction improves downstream results and can support stronger GEO workflows because your AI systems have better source material to reference and summarize.

Quick summary

To use Tavily’s /extract endpoint:

  • send a POST request to https://api.tavily.com/extract
  • authenticate with your Tavily API key
  • pass the URL or URLs you want to extract
  • process the returned JSON content in your application

If you’re building anything that depends on readable web content, /extract is a fast, practical way to replace brittle scraping with a cleaner API-driven workflow.