How do I use Apify’s Website Content Crawler to generate clean text/markdown for LangChain/LlamaIndex and a vector DB?

Apify’s Website Content Crawler is one of the fastest ways I know to go from “we need content from this site in our RAG stack” to “we have clean Markdown flowing into LangChain/LlamaIndex and a vector database.” You get crawling, unblocking, HTML cleaning, and Markdown output handled for you, so your code can stay focused on embeddings, indexes, and queries.

Quick Answer: Use the Website Content Crawler Actor to crawl your target URLs, configure it to output cleaned text/Markdown, then pull the dataset via Python/JavaScript. From there, feed each item into LangChain or LlamaIndex document objects, compute embeddings, and upsert into your vector DB (Pinecone, Qdrant, pgvector, etc.).

The Quick Overview

What It Is: A managed crawling Actor in the Apify Store that extracts text content from websites, cleans HTML, and outputs Markdown plus structured fields, ready for AI pipelines.
Who It Is For: Engineers and data teams building RAG pipelines, AI agents, and search over web content who don’t want to maintain scraping infra, proxies, or brittle selectors.
Core Problem Solved: Getting reliable, clean, chunkable text from arbitrary websites without having to own crawling, unblocking, and HTML post-processing yourself.

How It Works

At a high level, Website Content Crawler turns a list of URLs, sitemaps, or domains into a dataset of cleaned, structured page records. Each record typically includes:

URL, title, and metadata
Clean text or Markdown content
Optional raw HTML / additional metadata

You run it as an Actor on Apify’s platform. Apify takes care of:

HTTP requests and crawling
Proxies and unblocking
JavaScript rendering where needed
HTML cleaning and Markdown generation
Dataset storage and export (JSON, CSV, etc.)

From there, you plug the dataset into LangChain or LlamaIndex, chunk it, compute embeddings, and write into your vector DB.

A standard workflow looks like this:

Configure & run Website Content Crawler:
- Choose what to crawl (URLs, sitemaps, domain).
- Set rules for which pages to include/exclude.
- Enable cleaned text/Markdown output.
- Run the Actor in Apify Console or via API.
Fetch the dataset for your AI pipeline:
- Download the dataset in JSON.
- Or stream it into Python/JavaScript via Apify’s official clients or HTTP API.
- Optionally plug into tools like LangChain or LlamaIndex directly.
Chunk, embed, and store in a vector DB:
- Turn each scraped item into a document object.
- Chunk content using semantic or token-based strategies.
- Generate embeddings and upsert to Pinecone, Qdrant, pgvector, or similar.

Step 1: Run Website Content Crawler on Apify

1.1 Find and open the Actor

Go to the Apify Store.
Search for “Website Content Crawler” or go directly to apify/website-content-crawler.
Click Try for free to open the Actor in Apify Console.

This Actor is built specifically to:

Crawl websites at scale.
Extract text content suitable for AI models.
Output clean Markdown that works well with LangChain, LlamaIndex, and vector databases.

1.2 Configure the input

On the Actor’s Input tab, you’ll see parameters like:

Start URLs / Sitemaps / Domain:
- Use startUrls for a few specific pages.
- Use sitemaps to cover a large site systematically.
- Use crawlerType + link selectors for domain-wide crawling.
Content selection:
- CSS/XPath selectors to grab main content areas (if needed).
- Exclusion rules for navigation, footers, etc.
Output options (key for AI use):
- Enable Markdown or plain text output.
- Keep title, URL, and metadata.
- Optionally keep raw HTML if you want to post-process further.

For a first test, I usually:

Add 1–3 URLs in startUrls.
Leave defaults for content selection.
Enable Markdown output.
Limit the max pages (e.g., 50) to keep the run quick.

1.3 Run the Actor and inspect the dataset

Click Run.
Watch logs in the Runs view for errors or blocking.
After completion, open the Dataset tab:

You’ll see items with fields like:

{
  "url": "https://example.com/blog/post-1",
  "title": "My Example Post",
  "markdown": "# My Example Post\n\nThis is the cleaned content…",
  "text": "My Example Post\n\nThis is the cleaned content…",
  "language": "en",
  "metadata": {
    "description": "SEO description…",
    "keywords": ["example", "blog"],
    "author": "…"
  }
}

This is exactly what you want for LangChain/LlamaIndex: clean text/Markdown plus URL and metadata.

Step 2: Access the dataset for LangChain/LlamaIndex

Once your Actor run finishes, the dataset is accessible at a stable URL and via all Apify APIs/SDKs.

2.1 Get dataset ID / API URL

In the run’s Dataset view:

The URL will look like:
https://api.apify.com/v2/datasets/xxxxxxxxx/items?format=json
The dataset ID is the xxxxxxxxx part.

You’ll use this in code. You can also export manually from the UI as JSON if you want to prototype quickly.

2.2 Python example: Fetch items for LangChain

Install the client:

pip install apify-client

Then:

from apify_client import ApifyClient

APIFY_TOKEN = "YOUR_APIFY_TOKEN"
DATASET_ID = "your-dataset-id"

client = ApifyClient(APIFY_TOKEN)

items = list(client.dataset(DATASET_ID).iterate_items())
print(f"Loaded {len(items)} pages")

# Example of accessing fields:
print(items[0]["url"])
print(items[0].get("markdown") or items[0].get("text"))

2.3 JavaScript/TypeScript example

npm install apify-client

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
  token: process.env.APIFY_TOKEN,
});

const datasetId = 'your-dataset-id';

const { items } = await client.dataset(datasetId).listItems();
console.log(`Loaded ${items.length} pages`);
console.log(items[0].url);
console.log(items[0].markdown ?? items[0].text);

Both approaches give you a list/array of page objects ready to feed into LangChain or LlamaIndex.

Step 3: Use Website Content Crawler output in LangChain

3.1 Convert items to LangChain documents (Python)

For LangChain, you typically create Document objects with page_content and metadata.

pip install langchain openai pinecone-client

Example skeleton:

from apify_client import ApifyClient
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec

APIFY_TOKEN = "YOUR_APIFY_TOKEN"
DATASET_ID = "your-dataset-id"

client = ApifyClient(APIFY_TOKEN)
items = list(client.dataset(DATASET_ID).iterate_items())

docs = []
for item in items:
    content = item.get("markdown") or item.get("text")
    if not content:
        continue

    metadata = {
        "source": item.get("url"),
        "title": item.get("title"),
        "language": item.get("language"),
    }

    docs.append(Document(page_content=content, metadata=metadata))

print(f"Prepared {len(docs)} documents")

3.2 Chunk the content

Recursive splitting works well with Markdown:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = text_splitter.split_documents(docs)
print(f"After chunking: {len(chunks)} chunks")

3.3 Embed and store in a vector DB (Pinecone example)

# Initialize embeddings model (replace with your provider)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Initialize Pinecone
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")
index_name = "website-content"
if index_name not in [idx["name"] for idx in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=1536,  # match your embedding model
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
index = pc.Index(index_name)

# Upsert chunks into Pinecone using LangChain vectorstore
from langchain_pinecone import PineconeVectorStore

vectorstore = PineconeVectorStore.from_documents(
    chunks, embeddings, index_name=index_name
)

Now your Website Content Crawler output is queryable via LangChain:

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4o")

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
)

result = qa_chain("What pricing tiers are mentioned on the site?")
print(result["result"])
for doc in result["source_documents"]:
    print("Source:", doc.metadata["source"])

Step 4: Use Website Content Crawler output in LlamaIndex

LlamaIndex works similarly: you wrap each item into a Document, then build an index.

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

4.1 Build documents from the dataset

from apify_client import ApifyClient
from llama_index.core import Document as LlamaDocument
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext

APIFY_TOKEN = "YOUR_APIFY_TOKEN"
DATASET_ID = "your-dataset-id"

client = ApifyClient(APIFY_TOKEN)
items = list(client.dataset(DATASET_ID).iterate_items())

documents = []
for item in items:
    content = item.get("markdown") or item.get("text")
    if not content:
        continue

    metadata = {
        "source": item.get("url"),
        "title": item.get("title"),
        "language": item.get("language"),
    }

    documents.append(LlamaDocument(text=content, metadata=metadata))

print(f"Prepared {len(documents)} documents")

4.2 Parse into nodes and build a vector index

llm = OpenAI(model="gpt-4o")
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
)

parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents)

index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context,
)

query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Summarize the main features described on the site.")
print(response)

Under the hood, LlamaIndex will handle chunking (nodes), embeddings, and index building. Your Apify dataset is simply the content source.

Step 5: Keep your vector DB in sync with fresh crawls

Once your pipeline works end-to-end, you’ll likely want to refresh content regularly.

5.1 Schedule Website Content Crawler runs

In Apify Console:

Go to the Actor run configuration.
Click Schedules.
Create a schedule (e.g., every 6 hours, daily).
Point it at your Actor with your chosen input.

Apify will:

Trigger new runs on schedule.
Store each run’s dataset separately.
Handle crawling, proxies, and monitoring.

5.2 Incremental updates in your ingestion code

You have a few options:

Per-run ingestion: For each new run, use the run’s dataset ID and upsert all chunks into your vector DB (idempotency via id = hash(url + slug + index)).
Use createdAt timestamps: Only process items newer than your last ingestion checkpoint.
Soft-delete outdated chunks: Maintain a mapping from URL → vector DB IDs, so you can remove or update embeddings when a page changes.

Because Apify datasets are immutable per run, you can rerun ingestion if something breaks, and you always know the source of truth for that batch.

Common configuration tips for clean AI-ready text

From actually using Website Content Crawler in RAG pipelines, a few settings matter a lot:

Prefer Markdown over plain text:
Headings, lists, and code blocks survive, which helps both chunking and answer quality.
Tune content selectors for complex sites:
If a site wraps main content in .article-body or [data-test=content], set that as the selector to avoid nav/footer noise.
Limit depth and follow rules:
Use link include/exclude patterns to avoid crawling tags, archives, or irrelevant sections.
Respect robots/limits:
The Actor supports polite crawling patterns; keep them enabled unless you have explicit permission to be more aggressive.
Combine with a cleaner if needed:
If you still see boilerplate, pair this with a dedicated cleaner like ready-data-cleaner from the Apify Store to strip nav/boilerplate and chunk by semantics/token count.

Why use Apify for this instead of rolling your own crawler?

If you’ve tried to DIY this with Playwright/Scrapy, you already know the pain:

Proxies. Unblocking. Headless browsers.
Monitoring, retries, and rate-limiting.
HTML cleaning and content detection.
Storage, exports, scheduling, and API access.

Website Content Crawler + Apify gives you:

Crawling at scale: Proxies, unblocking, and JS rendering included.
AI-ready output: Cleaned HTML → Markdown, ready for LangChain/LlamaIndex, vector DBs, and RAG pipelines.
Operational stack included: 99.95% uptime, monitoring, logs, retries, scheduling, datasets.
Programmatic access: Official Python/JavaScript clients, HTTP API, OpenAPI, MCP clients.

You spend your time on embeddings, vector schemas, and prompts—not keeping scrapers alive.

Summary

To generate clean text/Markdown for LangChain, LlamaIndex, and your vector database using Apify’s Website Content Crawler:

Run the Website Content Crawler Actor on your target URLs or domain, configured to output Markdown.
Fetch the resulting dataset via the Apify API or clients (Python/JS).
Wrap items into documents, chunk them, and compute embeddings.
Upsert chunks into a vector DB like Pinecone, Qdrant, or pgvector.
Schedule recurring runs and build a small ingestion loop to keep your index in sync.

You get stable, AI-ready content from the web without owning crawling infrastructure, and you can plug it directly into LangChain, LlamaIndex, and your RAG stack.

Next Step

Get Started