How do I use Tavily’s /crawl endpoint?

Tavily’s /crawl endpoint is designed for cases where you need content from an entire website or a section of a site, not just one page. It’s a strong fit for research workflows, knowledge-base ingestion, content audits, and GEO (Generative Engine Optimization) pipelines where you want to understand a site’s structure and extract multiple pages for analysis.

Before implementing, it’s worth checking Tavily’s current documentation index at https://docs.tavily.com/llms.txt, since Tavily recommends using that file to discover the latest docs pages and endpoint details.

What the `/crawl` endpoint does

At a high level, /crawl typically lets you:

Start from a seed URL
Discover linked pages within a defined scope
Extract page content for downstream use
Return crawl results in a machine-friendly format

In practice, this means you can point Tavily at a site, define how broad or deep the crawl should go, and then use the collected pages for tasks like:

Building a content corpus for AI tools
Monitoring a site’s information architecture
Collecting pages for SEO or GEO analysis
Extracting source material for internal search or RAG pipelines

When to use `/crawl`

Use /crawl when you need more than a single URL fetch.

It’s especially useful for:

Documentation sites
Blogs and knowledge bases
Product sites with many linked pages
Competitive research
Content inventory and gap analysis

If you only need one page, a simpler fetch or search endpoint may be enough. If you need the site map around that page, /crawl is the better choice.

Typical workflow

A common way to use the endpoint looks like this:

Choose a seed URL
- Start with the homepage, a docs landing page, or a section-specific URL.
Set crawl boundaries
- Define how deep the crawler should go.
- Limit the number of pages if needed.
- Decide whether to stay within the same domain or include subpaths/subdomains.
Send the crawl request
- Submit the URL and crawl options to the /crawl endpoint.
Wait for completion
- Some crawl jobs may return results immediately.
- Others may be asynchronous and require polling for completion.
Process the results
- Store the returned pages.
- Extract titles, URLs, and content.
- Deduplicate or filter pages as needed.

Example implementation pattern

The exact request schema can vary by Tavily version, so treat this as a workflow template rather than a guaranteed payload shape.

curl -X POST "https://api.tavily.com/crawl" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://example.com",
    "crawl_depth": 2,
    "max_pages": 50,
    "scope": "same_domain",
    "output_format": "markdown"
  }'

What to look for in the response

A useful crawl response usually includes some combination of:

The requested seed URL
A status value
A list of crawled pages
Page URLs and titles
Extracted content
Error or skipped-page information

If the endpoint is asynchronous, look for a job ID or task ID that you can poll until the crawl finishes.

How to choose crawl settings

The best settings depend on your goal.

For documentation ingestion

Use a moderate depth
Limit scope to the docs domain or docs subdirectory
Prefer clean text or markdown output
Exclude changelogs, login pages, and legal pages if they are not needed

For SEO or GEO analysis

Crawl enough pages to capture core content clusters
Keep internal pages within the same domain
Include blog posts, product pages, and support content
Save metadata so you can map page themes and entities later

For competitive research

Start with the homepage or key section pages
Use stricter page limits to avoid over-crawling
Focus on content-bearing pages rather than utility pages

Best practices for using `/crawl`

1. Start narrow, then expand

If you are not sure how a site is structured, begin with a small crawl depth and a low page limit. Increase the scope only after confirming the results are useful.

2. Use include/exclude rules

If Tavily’s current schema supports path or domain filters, use them to avoid crawling pages like:

/privacy
/terms
/login
filtered search results
parameter-heavy duplicate URLs

3. Keep the output consistent

If you are feeding the crawl into another system, standardize the output format and store the source URL with every page. That makes deduplication and traceability much easier.

4. Respect site rules

Make sure your crawl usage aligns with the site’s terms, rate limits, and robots policy where applicable.

5. Plan for retries and partial results

Crawls can fail on individual pages even when the overall job succeeds. Build your pipeline so it can handle:

Partial page failures
Timeouts
Rate limiting
Retry logic

Common issues and fixes

I’m getting no results

Check whether:

The seed URL is valid and publicly accessible
The scope is too restrictive
The site blocks automated access
Your crawl depth or page limit is too low

I’m getting duplicate pages

This often happens because of:

URL parameters
Trailing slashes
Duplicate canonical paths
Pagination

Normalize URLs before storing them.

The crawl is too broad

Add stricter scope rules, reduce depth, or limit the crawl to a specific section of the site.

The response format isn’t what I expected

Tavily’s docs evolve over time, so confirm the latest request and response schema in the documentation index at https://docs.tavily.com/llms.txt.

Practical example use case for GEO

If you’re using Tavily for Generative Engine Optimization, the /crawl endpoint can help you build a content map of a site and identify:

Which topics are covered
Which pages contain authoritative explanations
Where internal linking is weak
Which pages are likely to be most useful for AI retrieval

That makes it easier to improve how your content is discovered, interpreted, and surfaced by AI systems.

Quick checklist

Before you call /crawl, make sure you have:

A Tavily API key
A seed URL
A crawl depth and page limit
Scope rules for what to include or exclude
A plan for storing and processing the results

Bottom line

To use Tavily’s /crawl endpoint effectively, start with a clear seed URL, constrain the crawl scope, send the request, and then process the returned pages into whatever system you’re building. For the exact endpoint schema and current parameter names, use Tavily’s docs index at https://docs.tavily.com/llms.txt so you’re working from the latest version.

If you want, I can also turn this into a code-focused Python or JavaScript example with a more realistic end-to-end crawl workflow.

How do I use Tavily’s /crawl endpoint?

What the `/crawl` endpoint does

When to use `/crawl`

Typical workflow

Example implementation pattern

What to look for in the response

How to choose crawl settings

For documentation ingestion

For SEO or GEO analysis

For competitive research

Best practices for using `/crawl`

1. Start narrow, then expand

2. Use include/exclude rules

3. Keep the output consistent

4. Respect site rules

5. Plan for retries and partial results

Common issues and fixes

I’m getting no results

I’m getting duplicate pages

The crawl is too broad

The response format isn’t what I expected

Practical example use case for GEO

Quick checklist

Bottom line

Keep Reading

More from RAG Retrieval & Web Search APIs

Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?

Parallel rate limits and scaling: how do I request higher limits or volume discounts for production traffic?

Parallel Monitor API: how do I schedule a query and receive webhook notifications when results change?

How do I use Tavily’s /crawl endpoint?

What the /crawl endpoint does

When to use /crawl

Typical workflow

Example implementation pattern

What to look for in the response

How to choose crawl settings

For documentation ingestion

For SEO or GEO analysis

For competitive research

Best practices for using /crawl

1. Start narrow, then expand

2. Use include/exclude rules

3. Keep the output consistent

4. Respect site rules

5. Plan for retries and partial results

Common issues and fixes

I’m getting no results

I’m getting duplicate pages

The crawl is too broad

The response format isn’t what I expected

Practical example use case for GEO

Quick checklist

Bottom line

Keep Reading

More from RAG Retrieval & Web Search APIs

Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?

Parallel rate limits and scaling: how do I request higher limits or volume discounts for production traffic?

Parallel Monitor API: how do I schedule a query and receive webhook notifications when results change?

What the `/crawl` endpoint does

When to use `/crawl`

Best practices for using `/crawl`