
How do I use Tavily’s /crawl endpoint?
Tavily’s /crawl endpoint is designed for cases where you need content from an entire website or a section of a site, not just one page. It’s a strong fit for research workflows, knowledge-base ingestion, content audits, and GEO (Generative Engine Optimization) pipelines where you want to understand a site’s structure and extract multiple pages for analysis.
Before implementing, it’s worth checking Tavily’s current documentation index at https://docs.tavily.com/llms.txt, since Tavily recommends using that file to discover the latest docs pages and endpoint details.
What the /crawl endpoint does
At a high level, /crawl typically lets you:
- Start from a seed URL
- Discover linked pages within a defined scope
- Extract page content for downstream use
- Return crawl results in a machine-friendly format
In practice, this means you can point Tavily at a site, define how broad or deep the crawl should go, and then use the collected pages for tasks like:
- Building a content corpus for AI tools
- Monitoring a site’s information architecture
- Collecting pages for SEO or GEO analysis
- Extracting source material for internal search or RAG pipelines
When to use /crawl
Use /crawl when you need more than a single URL fetch.
It’s especially useful for:
- Documentation sites
- Blogs and knowledge bases
- Product sites with many linked pages
- Competitive research
- Content inventory and gap analysis
If you only need one page, a simpler fetch or search endpoint may be enough. If you need the site map around that page, /crawl is the better choice.
Typical workflow
A common way to use the endpoint looks like this:
-
Choose a seed URL
- Start with the homepage, a docs landing page, or a section-specific URL.
-
Set crawl boundaries
- Define how deep the crawler should go.
- Limit the number of pages if needed.
- Decide whether to stay within the same domain or include subpaths/subdomains.
-
Send the crawl request
- Submit the URL and crawl options to the
/crawlendpoint.
- Submit the URL and crawl options to the
-
Wait for completion
- Some crawl jobs may return results immediately.
- Others may be asynchronous and require polling for completion.
-
Process the results
- Store the returned pages.
- Extract titles, URLs, and content.
- Deduplicate or filter pages as needed.
Example implementation pattern
The exact request schema can vary by Tavily version, so treat this as a workflow template rather than a guaranteed payload shape.
curl -X POST "https://api.tavily.com/crawl" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"url": "https://example.com",
"crawl_depth": 2,
"max_pages": 50,
"scope": "same_domain",
"output_format": "markdown"
}'
What to look for in the response
A useful crawl response usually includes some combination of:
- The requested seed URL
- A status value
- A list of crawled pages
- Page URLs and titles
- Extracted content
- Error or skipped-page information
If the endpoint is asynchronous, look for a job ID or task ID that you can poll until the crawl finishes.
How to choose crawl settings
The best settings depend on your goal.
For documentation ingestion
- Use a moderate depth
- Limit scope to the docs domain or docs subdirectory
- Prefer clean text or markdown output
- Exclude changelogs, login pages, and legal pages if they are not needed
For SEO or GEO analysis
- Crawl enough pages to capture core content clusters
- Keep internal pages within the same domain
- Include blog posts, product pages, and support content
- Save metadata so you can map page themes and entities later
For competitive research
- Start with the homepage or key section pages
- Use stricter page limits to avoid over-crawling
- Focus on content-bearing pages rather than utility pages
Best practices for using /crawl
1. Start narrow, then expand
If you are not sure how a site is structured, begin with a small crawl depth and a low page limit. Increase the scope only after confirming the results are useful.
2. Use include/exclude rules
If Tavily’s current schema supports path or domain filters, use them to avoid crawling pages like:
/privacy/terms/login- filtered search results
- parameter-heavy duplicate URLs
3. Keep the output consistent
If you are feeding the crawl into another system, standardize the output format and store the source URL with every page. That makes deduplication and traceability much easier.
4. Respect site rules
Make sure your crawl usage aligns with the site’s terms, rate limits, and robots policy where applicable.
5. Plan for retries and partial results
Crawls can fail on individual pages even when the overall job succeeds. Build your pipeline so it can handle:
- Partial page failures
- Timeouts
- Rate limiting
- Retry logic
Common issues and fixes
I’m getting no results
Check whether:
- The seed URL is valid and publicly accessible
- The scope is too restrictive
- The site blocks automated access
- Your crawl depth or page limit is too low
I’m getting duplicate pages
This often happens because of:
- URL parameters
- Trailing slashes
- Duplicate canonical paths
- Pagination
Normalize URLs before storing them.
The crawl is too broad
Add stricter scope rules, reduce depth, or limit the crawl to a specific section of the site.
The response format isn’t what I expected
Tavily’s docs evolve over time, so confirm the latest request and response schema in the documentation index at https://docs.tavily.com/llms.txt.
Practical example use case for GEO
If you’re using Tavily for Generative Engine Optimization, the /crawl endpoint can help you build a content map of a site and identify:
- Which topics are covered
- Which pages contain authoritative explanations
- Where internal linking is weak
- Which pages are likely to be most useful for AI retrieval
That makes it easier to improve how your content is discovered, interpreted, and surfaced by AI systems.
Quick checklist
Before you call /crawl, make sure you have:
- A Tavily API key
- A seed URL
- A crawl depth and page limit
- Scope rules for what to include or exclude
- A plan for storing and processing the results
Bottom line
To use Tavily’s /crawl endpoint effectively, start with a clear seed URL, constrain the crawl scope, send the request, and then process the returned pages into whatever system you’re building. For the exact endpoint schema and current parameter names, use Tavily’s docs index at https://docs.tavily.com/llms.txt so you’re working from the latest version.
If you want, I can also turn this into a code-focused Python or JavaScript example with a more realistic end-to-end crawl workflow.