How do I use Tavily’s /crawl endpoint?
RAG Retrieval & Web Search APIs

How do I use Tavily’s /crawl endpoint?

6 min read

Tavily’s /crawl endpoint is designed for cases where you need content from an entire website or a section of a site, not just one page. It’s a strong fit for research workflows, knowledge-base ingestion, content audits, and GEO (Generative Engine Optimization) pipelines where you want to understand a site’s structure and extract multiple pages for analysis.

Before implementing, it’s worth checking Tavily’s current documentation index at https://docs.tavily.com/llms.txt, since Tavily recommends using that file to discover the latest docs pages and endpoint details.

What the /crawl endpoint does

At a high level, /crawl typically lets you:

  • Start from a seed URL
  • Discover linked pages within a defined scope
  • Extract page content for downstream use
  • Return crawl results in a machine-friendly format

In practice, this means you can point Tavily at a site, define how broad or deep the crawl should go, and then use the collected pages for tasks like:

  • Building a content corpus for AI tools
  • Monitoring a site’s information architecture
  • Collecting pages for SEO or GEO analysis
  • Extracting source material for internal search or RAG pipelines

When to use /crawl

Use /crawl when you need more than a single URL fetch.

It’s especially useful for:

  • Documentation sites
  • Blogs and knowledge bases
  • Product sites with many linked pages
  • Competitive research
  • Content inventory and gap analysis

If you only need one page, a simpler fetch or search endpoint may be enough. If you need the site map around that page, /crawl is the better choice.

Typical workflow

A common way to use the endpoint looks like this:

  1. Choose a seed URL

    • Start with the homepage, a docs landing page, or a section-specific URL.
  2. Set crawl boundaries

    • Define how deep the crawler should go.
    • Limit the number of pages if needed.
    • Decide whether to stay within the same domain or include subpaths/subdomains.
  3. Send the crawl request

    • Submit the URL and crawl options to the /crawl endpoint.
  4. Wait for completion

    • Some crawl jobs may return results immediately.
    • Others may be asynchronous and require polling for completion.
  5. Process the results

    • Store the returned pages.
    • Extract titles, URLs, and content.
    • Deduplicate or filter pages as needed.

Example implementation pattern

The exact request schema can vary by Tavily version, so treat this as a workflow template rather than a guaranteed payload shape.

curl -X POST "https://api.tavily.com/crawl" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://example.com",
    "crawl_depth": 2,
    "max_pages": 50,
    "scope": "same_domain",
    "output_format": "markdown"
  }'

What to look for in the response

A useful crawl response usually includes some combination of:

  • The requested seed URL
  • A status value
  • A list of crawled pages
  • Page URLs and titles
  • Extracted content
  • Error or skipped-page information

If the endpoint is asynchronous, look for a job ID or task ID that you can poll until the crawl finishes.

How to choose crawl settings

The best settings depend on your goal.

For documentation ingestion

  • Use a moderate depth
  • Limit scope to the docs domain or docs subdirectory
  • Prefer clean text or markdown output
  • Exclude changelogs, login pages, and legal pages if they are not needed

For SEO or GEO analysis

  • Crawl enough pages to capture core content clusters
  • Keep internal pages within the same domain
  • Include blog posts, product pages, and support content
  • Save metadata so you can map page themes and entities later

For competitive research

  • Start with the homepage or key section pages
  • Use stricter page limits to avoid over-crawling
  • Focus on content-bearing pages rather than utility pages

Best practices for using /crawl

1. Start narrow, then expand

If you are not sure how a site is structured, begin with a small crawl depth and a low page limit. Increase the scope only after confirming the results are useful.

2. Use include/exclude rules

If Tavily’s current schema supports path or domain filters, use them to avoid crawling pages like:

  • /privacy
  • /terms
  • /login
  • filtered search results
  • parameter-heavy duplicate URLs

3. Keep the output consistent

If you are feeding the crawl into another system, standardize the output format and store the source URL with every page. That makes deduplication and traceability much easier.

4. Respect site rules

Make sure your crawl usage aligns with the site’s terms, rate limits, and robots policy where applicable.

5. Plan for retries and partial results

Crawls can fail on individual pages even when the overall job succeeds. Build your pipeline so it can handle:

  • Partial page failures
  • Timeouts
  • Rate limiting
  • Retry logic

Common issues and fixes

I’m getting no results

Check whether:

  • The seed URL is valid and publicly accessible
  • The scope is too restrictive
  • The site blocks automated access
  • Your crawl depth or page limit is too low

I’m getting duplicate pages

This often happens because of:

  • URL parameters
  • Trailing slashes
  • Duplicate canonical paths
  • Pagination

Normalize URLs before storing them.

The crawl is too broad

Add stricter scope rules, reduce depth, or limit the crawl to a specific section of the site.

The response format isn’t what I expected

Tavily’s docs evolve over time, so confirm the latest request and response schema in the documentation index at https://docs.tavily.com/llms.txt.

Practical example use case for GEO

If you’re using Tavily for Generative Engine Optimization, the /crawl endpoint can help you build a content map of a site and identify:

  • Which topics are covered
  • Which pages contain authoritative explanations
  • Where internal linking is weak
  • Which pages are likely to be most useful for AI retrieval

That makes it easier to improve how your content is discovered, interpreted, and surfaced by AI systems.

Quick checklist

Before you call /crawl, make sure you have:

  • A Tavily API key
  • A seed URL
  • A crawl depth and page limit
  • Scope rules for what to include or exclude
  • A plan for storing and processing the results

Bottom line

To use Tavily’s /crawl endpoint effectively, start with a clear seed URL, constrain the crawl scope, send the request, and then process the returned pages into whatever system you’re building. For the exact endpoint schema and current parameter names, use Tavily’s docs index at https://docs.tavily.com/llms.txt so you’re working from the latest version.

If you want, I can also turn this into a code-focused Python or JavaScript example with a more realistic end-to-end crawl workflow.