
I’m building a RAG pipeline—how do I crawl a site and extract clean text/markdown (not messy HTML) for embeddings?
Most RAG pipelines don’t fail on the model—they fail on the data. If your crawler is feeding raw, messy HTML into your embedding step, you’re paying for tokens that are just nav bars, footers, and cookie banners. The good news: you can turn “crawl this site” into “give me clean, semantically chunked Markdown ready for embeddings” with almost no custom infra if you use the right Actors.
Quick Answer: Use Apify’s Website Content Crawler to crawl and extract clean text/Markdown from any site, then optionally run the output through a token-aware cleaner before sending it to your embeddings step. You get boilerplate-free content, ready for vector databases and RAG—without writing your own crawling and unblocking stack.
The Quick Overview
- What It Is: A ready-made workflow on Apify that crawls websites, strips boilerplate, and exports clean text or Markdown as structured datasets for embeddings and RAG pipelines.
- Who It Is For: Engineers and data folks building RAG systems, LLM apps, and AI agents who need reliable, token-efficient content from public websites, docs, blogs, or help centers.
- Core Problem Solved: Getting from “there’s useful content on that site” to “I have deduplicated, boilerplate-free, Markdown chunks I can embed”—without building and maintaining your own crawler, proxies, and HTML cleaner.
How It Works
At a high level, you configure what to crawl, let Apify handle crawling and unblocking, then consume the resulting dataset (Markdown + metadata) in your embedding pipeline.
Under the hood, Website Content Crawler does the heavy lifting:
- Crawls the site using browser automation where needed.
- Extracts the main content, drops nav/sidebars/footers.
- Outputs rich Markdown, plus URLs and other metadata.
- Stores everything as a dataset you can export or pull via API.
A typical RAG-oriented workflow looks like this:
-
Configure the crawl:
- In Apify Console, pick
apify/website-content-crawler. - Set
startUrls(homepage, docs root, help center, blog index). - Configure
linkPatternsor “stay on domain” rules so the Actor only crawls relevant paths. - Optionally set limits (max depth, max pages, concurrency) and ignore patterns (login, cart, etc.).
- In Apify Console, pick
-
Run and inspect the dataset:
- Start a run manually or via the Apify API.
- Watch logs in real time (requests, HTTP codes, time per page).
- Once finished, open the dataset: each item contains URL, title, text/Markdown content, and other fields.
- Export as JSON, CSV, or NDJSON—or connect programmatically with the Apify Python/JavaScript client.
-
Clean, chunk, and embed:
- For many RAG use-cases, the Website Content Crawler’s Markdown is already clean enough to embed.
- If you need more control over tokens, pass the dataset through a token-aware cleaner (e.g., a “ready-data-cleaner” Actor that strips boilerplate, chunks by semantics, and provides token counts).
- Send the resulting chunks to your embedding service and store them in your vector database (Pinecone, pgvector, Qdrant, etc.).
- Keep the pipeline fresh by scheduling the crawl Actor on Apify and re-embedding changed pages only.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Website crawling with unblocking | Follows links from your seed URLs, respects robots, and uses Apify’s proxies and unblocking engine where needed. | You get full coverage of a site (including JS-heavy pages) without fighting captchas and rate limits yourself. |
| Clean text / Markdown extraction | Strips boilerplate HTML (nav, sidebars, ads), normalizes headers and lists, and outputs structured text or Markdown. | Embedding models see only the content that matters, so you spend fewer tokens and get better retrieval quality. |
| Dataset + API delivery | Stores results as Apify datasets with URLs, titles, content, and metadata; export or access via API/SDK. | Plug directly into your embedding/RAG pipeline without custom glue—just iterate over dataset items. |
Ideal Use Cases
- Best for documentation and help-center RAG: Because Website Content Crawler is tuned for “main text content,” it does a good job on docs portals, knowledge bases, and FAQ sites, producing clean Markdown sections and headers.
- Best for blog, marketing, and knowledge content RAG: Because you can filter to specific URL patterns (e.g.,
/blog/,/learn/,/guides/), you can create a high-signal corpus for your RAG app without pulling in legal pages, signup flows, or user-generated comments.
Limitations & Considerations
- Highly dynamic or gated content: If content is behind logins, paywalls, or complex interactions (multi-step forms), a generic crawler might not reach it. In those cases, build or commission a custom Actor (Playwright/Puppeteer-based) on Apify that performs the exact login/click flows and then extracts content.
- RAG-specific chunking strategy: Website Content Crawler gets you clean page-level content in Markdown. For optimal RAG quality, you still need to choose a chunking strategy (by heading, paragraphs, or semantic units) and an update policy (how often to recrawl and re-embed). Use a token-aware cleaner or your own post-processing to align with your model’s context window.
Pricing & Plans
You don’t pay separately for Website Content Crawler itself—it runs on top of Apify’s platform resources (compute units, storage, and data transfer). Pricing depends on how much you crawl and how often you run your Actors.
Typical pattern for a RAG crawler:
- Crawl the site fully once to build your initial corpus.
- Schedule smaller incremental crawls (daily/weekly) to catch updates.
- Keep an eye on pages crawled, bandwidth, and CPU time in Apify Console.
Two common ways to structure this:
- Self-service plan: Best for developers and small teams needing to run Website Content Crawler and a few other Actors as part of a single RAG app. You manage your own Actors, schedules, and embedding pipeline, paying only for the platform usage.
- Enterprise / Professional Services: Best for teams needing guaranteed SLAs, custom crawling logic, or ongoing maintenance. Apify’s Professional Services can build and maintain a custom “RAG-ready” crawler Actor for your sites and third-party sources, including unblocking, monitoring, and integration with your existing vector DB and LLM stack.
You can talk to the sales team about exact volume-based pricing and what makes sense for your pipeline.
Frequently Asked Questions
How do I go from a Website Content Crawler run to embeddings in practice?
Short Answer: Export the dataset (or stream it via API), iterate over each item’s text/Markdown field, apply your chunking logic, then call your embedding API and write results to your vector DB.
Details: Once the Actor run is finished, you’ll have a dataset with records like:
{
"url": "https://example.com/docs/getting-started",
"title": "Getting started",
"text": "# Getting started\n\nThis guide explains…"
}
From there:
- Use the Apify Python or JavaScript client to fetch the dataset:
- Python:
client.dataset("dataset-id").list_items() - JS:
apifyClient.dataset("dataset-id").listItems()
- Python:
- For each item:
- Parse the Markdown if you want to chunk by headings.
- Split content into chunks (e.g., 500–1000 tokens) with overlap.
- Attach metadata:
url,title,heading,crawl_timestamp.
- Call your embedding API (OpenAI, Cohere, local model, etc.) on each chunk.
- Write
embedding + metadatato your vector DB (Pinecone, pgvector, Qdrant, Weaviate, Elasticsearch, etc.). - Store the original dataset ID and page URL so you can trace each vector back to the source and re-crawl selectively later.
You can run this as a separate step (e.g., a CI pipeline or a serverless job) each time the Actor completes, or chain it with webhooks so embeddings are generated automatically after a successful run.
Can I integrate this with LangChain or LlamaIndex for RAG?
Short Answer: Yes. Use Apify’s API to pull content datasets and feed them directly into LangChain or LlamaIndex document loaders before building your retrievers.
Details: Website Content Crawler and token-cleaner Actors are designed with the LLM ecosystem in mind:
- They output clean text or Markdown, which maps naturally to LangChain/LlamaIndex document objects.
- You get URLs and titles as metadata for source attribution and filtering.
- The pipeline is “API-first,” so you can:
- Trigger the crawl Actor from a LangChain workflow.
- Poll or subscribe to Actor runs via webhooks.
- Ingest new dataset items into your LLM pipeline on completion.
A typical pattern:
- Run Website Content Crawler on Apify (manually, scheduled, or triggered from your app).
- When the run finishes, your backend/agent receives a webhook.
- It fetches the dataset via Apify’s HTTP API.
- Wrap each item as a LangChain
Documentor LlamaIndexNodewith:page_content: the Markdown or cleaned text.metadata:url,title, and maybesection/category.
- Use LangChain/LlamaIndex to:
- Chunk content.
- Create embeddings.
- Build or update your vector index and retriever.
Since the output is just text + metadata, you’re not locked into any single RAG framework; it works equally well with homegrown embeddings code, LangChain, LlamaIndex, or other libraries.
Summary
If you’re building a RAG pipeline and still scraping HTML by hand, you’re solving the hardest part of the problem yourself. Apify’s Website Content Crawler gives you a production-ready way to crawl sites, extract boilerplate-free content, and get clean text or Markdown that’s ready for embeddings. You stay focused on chunking, embeddings, and retrieval quality; Apify handles crawling, proxies, unblocking, and monitoring.
Whether you’re indexing your own docs/help center or building an AI agent on top of third-party sites, turning “URLs” into “structured, token-efficient datasets” is a solved problem—you just need to wire the Actor output into your embedding step and vector database.