
How do I use Apify’s Website Content Crawler to generate clean text/markdown for LangChain/LlamaIndex and a vector DB?
Apify’s Website Content Crawler is one of the fastest ways I know to go from “we need content from this site in our RAG stack” to “we have clean Markdown flowing into LangChain/LlamaIndex and a vector database.” You get crawling, unblocking, HTML cleaning, and Markdown output handled for you, so your code can stay focused on embeddings, indexes, and queries.
Quick Answer: Use the Website Content Crawler Actor to crawl your target URLs, configure it to output cleaned text/Markdown, then pull the dataset via Python/JavaScript. From there, feed each item into LangChain or LlamaIndex document objects, compute embeddings, and upsert into your vector DB (Pinecone, Qdrant, pgvector, etc.).
The Quick Overview
- What It Is: A managed crawling Actor in the Apify Store that extracts text content from websites, cleans HTML, and outputs Markdown plus structured fields, ready for AI pipelines.
- Who It Is For: Engineers and data teams building RAG pipelines, AI agents, and search over web content who don’t want to maintain scraping infra, proxies, or brittle selectors.
- Core Problem Solved: Getting reliable, clean, chunkable text from arbitrary websites without having to own crawling, unblocking, and HTML post-processing yourself.
How It Works
At a high level, Website Content Crawler turns a list of URLs, sitemaps, or domains into a dataset of cleaned, structured page records. Each record typically includes:
- URL, title, and metadata
- Clean text or Markdown content
- Optional raw HTML / additional metadata
You run it as an Actor on Apify’s platform. Apify takes care of:
- HTTP requests and crawling
- Proxies and unblocking
- JavaScript rendering where needed
- HTML cleaning and Markdown generation
- Dataset storage and export (JSON, CSV, etc.)
From there, you plug the dataset into LangChain or LlamaIndex, chunk it, compute embeddings, and write into your vector DB.
A standard workflow looks like this:
-
Configure & run Website Content Crawler:
- Choose what to crawl (URLs, sitemaps, domain).
- Set rules for which pages to include/exclude.
- Enable cleaned text/Markdown output.
- Run the Actor in Apify Console or via API.
-
Fetch the dataset for your AI pipeline:
- Download the dataset in JSON.
- Or stream it into Python/JavaScript via Apify’s official clients or HTTP API.
- Optionally plug into tools like LangChain or LlamaIndex directly.
-
Chunk, embed, and store in a vector DB:
- Turn each scraped item into a document object.
- Chunk content using semantic or token-based strategies.
- Generate embeddings and upsert to Pinecone, Qdrant, pgvector, or similar.
Step 1: Run Website Content Crawler on Apify
1.1 Find and open the Actor
- Go to the Apify Store.
- Search for “Website Content Crawler” or go directly to
apify/website-content-crawler. - Click Try for free to open the Actor in Apify Console.
This Actor is built specifically to:
- Crawl websites at scale.
- Extract text content suitable for AI models.
- Output clean Markdown that works well with LangChain, LlamaIndex, and vector databases.
1.2 Configure the input
On the Actor’s Input tab, you’ll see parameters like:
-
Start URLs / Sitemaps / Domain:
- Use
startUrlsfor a few specific pages. - Use
sitemapsto cover a large site systematically. - Use
crawlerType+link selectorsfor domain-wide crawling.
- Use
-
Content selection:
- CSS/XPath selectors to grab main content areas (if needed).
- Exclusion rules for navigation, footers, etc.
-
Output options (key for AI use):
- Enable Markdown or plain text output.
- Keep title, URL, and metadata.
- Optionally keep raw HTML if you want to post-process further.
For a first test, I usually:
- Add 1–3 URLs in
startUrls. - Leave defaults for content selection.
- Enable Markdown output.
- Limit the max pages (e.g., 50) to keep the run quick.
1.3 Run the Actor and inspect the dataset
- Click Run.
- Watch logs in the Runs view for errors or blocking.
- After completion, open the Dataset tab:
You’ll see items with fields like:
{
"url": "https://example.com/blog/post-1",
"title": "My Example Post",
"markdown": "# My Example Post\n\nThis is the cleaned content…",
"text": "My Example Post\n\nThis is the cleaned content…",
"language": "en",
"metadata": {
"description": "SEO description…",
"keywords": ["example", "blog"],
"author": "…"
}
}
This is exactly what you want for LangChain/LlamaIndex: clean text/Markdown plus URL and metadata.
Step 2: Access the dataset for LangChain/LlamaIndex
Once your Actor run finishes, the dataset is accessible at a stable URL and via all Apify APIs/SDKs.
2.1 Get dataset ID / API URL
In the run’s Dataset view:
- The URL will look like:
https://api.apify.com/v2/datasets/xxxxxxxxx/items?format=json - The dataset ID is the
xxxxxxxxxpart.
You’ll use this in code. You can also export manually from the UI as JSON if you want to prototype quickly.
2.2 Python example: Fetch items for LangChain
Install the client:
pip install apify-client
Then:
from apify_client import ApifyClient
APIFY_TOKEN = "YOUR_APIFY_TOKEN"
DATASET_ID = "your-dataset-id"
client = ApifyClient(APIFY_TOKEN)
items = list(client.dataset(DATASET_ID).iterate_items())
print(f"Loaded {len(items)} pages")
# Example of accessing fields:
print(items[0]["url"])
print(items[0].get("markdown") or items[0].get("text"))
2.3 JavaScript/TypeScript example
npm install apify-client
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({
token: process.env.APIFY_TOKEN,
});
const datasetId = 'your-dataset-id';
const { items } = await client.dataset(datasetId).listItems();
console.log(`Loaded ${items.length} pages`);
console.log(items[0].url);
console.log(items[0].markdown ?? items[0].text);
Both approaches give you a list/array of page objects ready to feed into LangChain or LlamaIndex.
Step 3: Use Website Content Crawler output in LangChain
3.1 Convert items to LangChain documents (Python)
For LangChain, you typically create Document objects with page_content and metadata.
pip install langchain openai pinecone-client
Example skeleton:
from apify_client import ApifyClient
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec
APIFY_TOKEN = "YOUR_APIFY_TOKEN"
DATASET_ID = "your-dataset-id"
client = ApifyClient(APIFY_TOKEN)
items = list(client.dataset(DATASET_ID).iterate_items())
docs = []
for item in items:
content = item.get("markdown") or item.get("text")
if not content:
continue
metadata = {
"source": item.get("url"),
"title": item.get("title"),
"language": item.get("language"),
}
docs.append(Document(page_content=content, metadata=metadata))
print(f"Prepared {len(docs)} documents")
3.2 Chunk the content
Recursive splitting works well with Markdown:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(docs)
print(f"After chunking: {len(chunks)} chunks")
3.3 Embed and store in a vector DB (Pinecone example)
# Initialize embeddings model (replace with your provider)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Initialize Pinecone
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")
index_name = "website-content"
if index_name not in [idx["name"] for idx in pc.list_indexes()]:
pc.create_index(
name=index_name,
dimension=1536, # match your embedding model
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index(index_name)
# Upsert chunks into Pinecone using LangChain vectorstore
from langchain_pinecone import PineconeVectorStore
vectorstore = PineconeVectorStore.from_documents(
chunks, embeddings, index_name=index_name
)
Now your Website Content Crawler output is queryable via LangChain:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True,
)
result = qa_chain("What pricing tiers are mentioned on the site?")
print(result["result"])
for doc in result["source_documents"]:
print("Source:", doc.metadata["source"])
Step 4: Use Website Content Crawler output in LlamaIndex
LlamaIndex works similarly: you wrap each item into a Document, then build an index.
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai
4.1 Build documents from the dataset
from apify_client import ApifyClient
from llama_index.core import Document as LlamaDocument
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
APIFY_TOKEN = "YOUR_APIFY_TOKEN"
DATASET_ID = "your-dataset-id"
client = ApifyClient(APIFY_TOKEN)
items = list(client.dataset(DATASET_ID).iterate_items())
documents = []
for item in items:
content = item.get("markdown") or item.get("text")
if not content:
continue
metadata = {
"source": item.get("url"),
"title": item.get("title"),
"language": item.get("language"),
}
documents.append(LlamaDocument(text=content, metadata=metadata))
print(f"Prepared {len(documents)} documents")
4.2 Parse into nodes and build a vector index
llm = OpenAI(model="gpt-4o")
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
)
parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents)
index = VectorStoreIndex.from_documents(
documents,
service_context=service_context,
)
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Summarize the main features described on the site.")
print(response)
Under the hood, LlamaIndex will handle chunking (nodes), embeddings, and index building. Your Apify dataset is simply the content source.
Step 5: Keep your vector DB in sync with fresh crawls
Once your pipeline works end-to-end, you’ll likely want to refresh content regularly.
5.1 Schedule Website Content Crawler runs
In Apify Console:
- Go to the Actor run configuration.
- Click Schedules.
- Create a schedule (e.g., every 6 hours, daily).
- Point it at your Actor with your chosen input.
Apify will:
- Trigger new runs on schedule.
- Store each run’s dataset separately.
- Handle crawling, proxies, and monitoring.
5.2 Incremental updates in your ingestion code
You have a few options:
- Per-run ingestion: For each new run, use the run’s dataset ID and upsert all chunks into your vector DB (idempotency via
id = hash(url + slug + index)). - Use
createdAttimestamps: Only process items newer than your last ingestion checkpoint. - Soft-delete outdated chunks: Maintain a mapping from URL → vector DB IDs, so you can remove or update embeddings when a page changes.
Because Apify datasets are immutable per run, you can rerun ingestion if something breaks, and you always know the source of truth for that batch.
Common configuration tips for clean AI-ready text
From actually using Website Content Crawler in RAG pipelines, a few settings matter a lot:
-
Prefer Markdown over plain text:
Headings, lists, and code blocks survive, which helps both chunking and answer quality. -
Tune content selectors for complex sites:
If a site wraps main content in.article-bodyor[data-test=content], set that as the selector to avoid nav/footer noise. -
Limit depth and follow rules:
Use link include/exclude patterns to avoid crawling tags, archives, or irrelevant sections. -
Respect robots/limits:
The Actor supports polite crawling patterns; keep them enabled unless you have explicit permission to be more aggressive. -
Combine with a cleaner if needed:
If you still see boilerplate, pair this with a dedicated cleaner like ready-data-cleaner from the Apify Store to strip nav/boilerplate and chunk by semantics/token count.
Why use Apify for this instead of rolling your own crawler?
If you’ve tried to DIY this with Playwright/Scrapy, you already know the pain:
- Proxies. Unblocking. Headless browsers.
- Monitoring, retries, and rate-limiting.
- HTML cleaning and content detection.
- Storage, exports, scheduling, and API access.
Website Content Crawler + Apify gives you:
- Crawling at scale: Proxies, unblocking, and JS rendering included.
- AI-ready output: Cleaned HTML → Markdown, ready for LangChain/LlamaIndex, vector DBs, and RAG pipelines.
- Operational stack included: 99.95% uptime, monitoring, logs, retries, scheduling, datasets.
- Programmatic access: Official Python/JavaScript clients, HTTP API, OpenAPI, MCP clients.
You spend your time on embeddings, vector schemas, and prompts—not keeping scrapers alive.
Summary
To generate clean text/Markdown for LangChain, LlamaIndex, and your vector database using Apify’s Website Content Crawler:
- Run the Website Content Crawler Actor on your target URLs or domain, configured to output Markdown.
- Fetch the resulting dataset via the Apify API or clients (Python/JS).
- Wrap items into documents, chunk them, and compute embeddings.
- Upsert chunks into a vector DB like Pinecone, Qdrant, or pgvector.
- Schedule recurring runs and build a small ingestion loop to keep your index in sync.
You get stable, AI-ready content from the web without owning crawling infrastructure, and you can plug it directly into LangChain, LlamaIndex, and your RAG stack.