How do I build a chatbot that answers from PDFs/docs without making stuff up, and stays maintainable as files change?
AI Agent Automation Platforms

How do I build a chatbot that answers from PDFs/docs without making stuff up, and stays maintainable as files change?

12 min read

Most teams start by dumping PDFs into a “chat with your documents” tool and then wonder why the chatbot makes things up or breaks as soon as files change. The core problem usually isn’t the model—it’s the system design: how you ingest documents, store them, retrieve them, and control what the model is allowed to say.

This guide walks through how to build a chatbot that:

  • Answers strictly from your PDFs/docs
  • Minimizes hallucinations
  • Stays maintainable as the underlying files are updated, replaced, or versioned

We’ll focus on a practical architecture you can scale, rather than a one-off prototype.


1. Clarify your requirements before you start

Before choosing any tools, get clear on what you actually need. This shapes everything else.

Ask yourself:

  • What content types?
    PDFs, Word, Google Docs, HTML, emails, spreadsheets, internal wikis?
  • How often do docs change?
    Daily updates? Versioned releases? Rare changes?
  • What is the source of truth?
    A document management system (DMS), Git repo, cloud storage (S3, GCS, Drive, SharePoint)?
  • What quality guarantees do you need?
    • “Best-effort, might be slightly wrong”
    • Or “must be traceable and verifiable from a document”
  • Who will use it, and where?
    Internal (support, sales, engineering) vs. external (customers); integrated in Slack, a web chat widget, or your app?
  • Security & permissions?
    Do different users see different documents? Need fine-grained access control?

Once you know that, you can design a system tailored to your real use case instead of a generic demo.


2. Core architecture: overview

A reliable chatbot over PDFs/docs typically uses this pattern:

  1. Document ingestion and parsing

    • Pull files from your sources
    • Convert to text
    • Extract structure (pages, sections, headings, tables)
  2. Chunking and indexing

    • Split long documents into small, meaningful chunks
    • Create embeddings and store them in a vector database
    • Store metadata like document ID, version, page number, permissions
  3. Retrieval at question time

    • Convert user query into an embedding
    • Retrieve the most relevant chunks for that user
    • Optionally apply filters (permissions, document type, date)
  4. Grounded answer generation

    • Feed retrieved chunks into the model with a strict system prompt
    • Instruct the model to answer using only those chunks
    • Include explicit instructions to say “I don’t know” when evidence is missing
  5. Response formatting and citation

    • Return an answer plus citations (which document, page, section)
    • Optionally link to the original file or anchor in a viewer
  6. Maintenance & sync

    • Detect new/updated/deleted docs
    • Re-embed chunks for changed docs only
    • Clean up stale entries in the index
    • Track document versions for traceability

This pattern is commonly known as RAG (Retrieval-Augmented Generation). The quality of your chatbot depends heavily on the quality of each step.


3. Step 1: Ingest and parse your PDFs/docs correctly

If your text extraction is bad, everything downstream will be noisy. Focus on robust, repeatable ingestion.

3.1 Choose where you ingest from

Common sources:

  • Cloud storage: S3, GCS, Azure Blob, Dropbox
  • Collaboration platforms: Google Drive, SharePoint, OneDrive, Box
  • Knowledge bases: Confluence, Notion, GitBook, wiki systems
  • Repos: GitHub, GitLab

Ideally, your ingestion process:

  • Polls or responds to webhooks on document changes
  • Maintains a mapping of each file to a stable document ID
  • Stores metadata like: path, owner, created_at, updated_at, type, permissions

3.2 Text extraction for PDFs and docs

Use tools tuned for your document types:

  • PDFs
    • Libraries: pdfminer.six, PyPDF2, pdfplumber, or commercial APIs
    • For scanned PDFs (images), run OCR (e.g., Tesseract, AWS Textract, Google Document AI)
  • Word/Office documents
    • Libraries: python-docx, LibreOffice in headless mode, or cloud conversion APIs
  • HTML or wikis
    • Strip navigation/boilerplate; keep headings, content, and links

Preserve structure:

  • Headings (H1/H2/H3)
  • Sections or paragraphs
  • Lists and tables
  • Page numbers (for PDFs)

Store this as a structured representation (e.g., JSON) so you know which chunk came from where.


4. Step 2: Chunking and embedding without losing context

Embedding entire documents is inefficient and often yields bad retrieval. You need chunks that are:

  • Small enough to fit several in a prompt
  • Big enough to preserve context and meaning
  • Aligned with the document’s logical structure

4.1 Good chunking strategies

Preferred approaches:

  • Section-based chunking
    • Use headings/subheadings as boundaries
    • Within a section, further split every 300–800 tokens
  • Semantic/paragraph-based chunking
    • Group paragraphs that clearly belong together (e.g., a definition and its explanation)

Avoid:

  • Splitting every N characters blindly
  • Mixing content from unrelated sections into one chunk

For each chunk, store metadata like:

  • document_id
  • version (if you track versions)
  • page_range or page_number
  • section_title, heading_path (e.g., “User Guide > Billing > Refunds”)
  • created_at, updated_at
  • permissions (e.g., which roles or user IDs can access it)

4.2 Choosing an embedding model and vector store

Embedding models (examples):

  • OpenAI text-embedding models
  • Cohere, Azure OpenAI, etc.
  • Open-source models (e.g., bge-large, all-MiniLM) if you must self-host

Vector databases:

  • Hosted: Pinecone, Weaviate Cloud, Qdrant Cloud
  • Self-hosted: Weaviate, Qdrant, Milvus, pgvector (Postgres extension)

Important capabilities:

  • Metadata filters (for permissions, document types, dates)
  • Upserts so you can update or delete entries efficiently
  • Scalability and latency aligned with your expected load

5. Step 3: Retrieval that actually returns the right context

Good retrieval dramatically reduces hallucinations. Techniques:

5.1 Basic retrieval

For each user query:

  1. Create a query embedding
  2. Search your vector store for the top k chunks (e.g., k = 5–15)
  3. Apply filters:
    • document_type, language
    • permissions based on the user
    • date or version (e.g., only latest version)

5.2 Improve relevance with hybrid retrieval

Combine:

  • Vector search (semantic similarity)
  • Keyword search (BM25 / lexical search)

You can:

  • Pre-filter with keyword search, then run vector search on that subset
  • Or run both separately and merge scores

This is especially useful for:

  • Technical terms, product codes, IDs
  • Domain-specific jargon that generic embeddings miss

5.3 Limit the context window wisely

Don’t stuff everything into the model at once. Strategies:

  • Select top k chunks with the best score
  • Deduplicate or merge overlapping chunks
  • Optionally group chunks by source document and prioritize diversity (not all from one page)

6. Step 4: Answer generation without making stuff up

This is where you explicitly prevent hallucination. Two key components:

  1. A strict system prompt
  2. A constrained answer format

6.1 A grounding-focused system prompt

Example system prompt (adapt as needed):

You are a question-answering assistant for our internal documentation.

You must follow these rules:

  1. Answer only using the provided context.
  2. If the context does not contain the answer, say you do not know or that the information is not in the documents.
  3. Do not use outside knowledge or assumptions.
  4. Quote or paraphrase relevant passages and always reference the source document and page/section.
  5. If context appears outdated or contradictory, state that clearly rather than guessing.

Then attach your retrieved chunks, each labeled with its metadata.

6.2 Ask the model to show its work (optionally hidden from the user)

You can use a two-step approach:

  1. Internal reasoning step
    • Ask the model: “Given the context, what facts are relevant to the question? List them with document and page references.”
  2. User-facing answer step
    • Use those facts to generate a concise answer with citations.

You can keep step 1 hidden from the user but log it for debugging.

6.3 Enforce “I don’t know”

Explicitly reward honesty:

  • In the system prompt:
    “If you cannot answer from the context, say: ‘I’m not able to answer that from the available documents.’ Do not attempt to answer anyway.”
  • During evaluation:
    Treat correct “I don’t know” responses as a success, not a failure.

Over time, you can tune your prompts and retrieval to balance coverage vs. caution.


7. Step 5: Returning answers with traceable citations

To ensure trust and maintainability, each answer should be:

  • Traceable – users can see exactly which document/page it came from
  • Actionable – link back to the source file or section

Good patterns:

  • Inline citations: “As described in the Refund Policy (v3.1, p. 4)…”
  • Footnote-style references:
    • [1] “Refund Policy”, v3.1, page 4
    • [2] “Subscription Guide”, v2.0, section “Cancellation”

Implement:

  • A mapping from chunk metadata → original file URL and page anchor
  • In your frontend, show a sidebar or footnotes with links to those sources

This not only increases trust but also makes it easier to debug when the answer is wrong.


8. Keeping the chatbot maintainable as files change

The biggest long-term challenge is keeping the index in sync with evolving documents.

8.1 Track documents and versions

For each document, maintain:

  • A stable document_id
  • version or revision number (optional but helpful)
  • hash of the content (e.g., SHA-256)
  • last_ingested_at

Workflow:

  • On each sync:
    • Fetch file metadata from your source system
    • Compare content hash or updated_at with what you have
    • If changed:
      • Re-parse and re-chunk
      • Re-embed chunks for that document only
      • Delete or mark old chunks as superseded

8.2 Design for idempotent upserts

Set up your ingestion to safely run repeatedly:

  • Use document_id + version as keys in your vector store
  • On re-ingest:
    • Delete or soft-delete old chunks for that doc/version
    • Upsert new chunks with updated metadata
  • Optional: Keep old versions with a flag like is_latest = true/false

8.3 Sync schedules and triggers

Common patterns:

  • Incremental sync every N minutes/hours for systems that support “changed since” filters
  • Webhooks/events from the source (e.g., “file updated”)
  • Manual reindex for critical docs after releases

Aim for:

  • A stable cadency (e.g., every 15 minutes or hourly)
  • A manual override button for urgent updates

8.4 Handling deleted or restricted docs

When a file is deleted or its permissions change:

  • Remove or mark its chunks as inactive in the vector store
  • Update any cache or search index
  • If the doc becomes restricted:
    • Adjust permissions metadata so future queries respect the new rules

9. Permissions and access control

If different users should see different subsets of documents, you must enforce this at retrieval time.

9.1 Represent permissions in metadata

Attach fields to each chunk, such as:

  • visibility: public / internal / confidential
  • allowed_roles: [“support”, “admin”]
  • allowed_user_ids: explicit user IDs (for very sensitive data)
  • org_id or team_id for multi-tenant setups

9.2 Filter by user at query time

When searching:

  • Pass metadata filters like:
    • org_id = current_user.org_id
    • visibility IN [‘public’, ‘internal’]
    • allowed_roles intersection with current_user.roles not empty

Make sure the chat frontend passes a user ID or token to your backend so you can apply these filters securely.


10. Evaluating hallucinations and quality

To ensure your chatbot truly “doesn’t make stuff up,” you need a feedback loop.

10.1 Create a test set

Collect:

  • Real user questions (or simulate them based on your docs)
  • The “gold” answer from a human or authoritative document
  • Where in the documents that answer is found

Use these to:

  • Run nightly evaluations of:
    • Correctness vs. the gold answer
    • Citation accuracy (points to correct doc/section)
    • Rate of “I don’t know” responses

10.2 Monitor in production

Track:

  • User feedback buttons (helpful / not helpful)
  • Cases where users click “view source” and then report mismatch
  • Areas where users repeatedly ask follow-ups or clarifications

Use this data to:

  • Improve chunking or embeddings
  • Add new docs or improve existing ones
  • Refine prompts and retrieval parameters (e.g., top_k)

11. Practical tech stack examples

You can implement all of this with many combinations. Two common approaches:

Example A: Fully managed (minimal infrastructure)

  • Ingestion: your scripts or a no-code connector (e.g., from your DMS to a vector store)
  • Vector DB: Pinecone / Weaviate Cloud / Qdrant Cloud
  • LLM: OpenAI, Azure OpenAI, or a hosted LLM provider
  • Backend: a small Node.js / Python service handling:
    • Authentication
    • Retrieval
    • Prompt construction
    • Logging
  • Frontend: React web widget, or Slack/Teams bot

Example B: More controlled, self-hosted components

  • Ingestion: custom Python service pulling from S3 / Git / internal drives
  • Text extraction: pdfplumber, python-docx, Tesseract OCR, etc.
  • Vector DB: self-hosted Qdrant/Weaviate/Milvus
  • LLM: self-hosted or via API, depending on security/compliance needs
  • Observability: log all interactions, retrievals, and sources to a database

Both can yield low-hallucination, maintainable chatbots. Choose based on your scale, budget, and compliance requirements.


12. Common pitfalls and how to avoid them

Pitfall 1: Letting the model use outside knowledge by default

  • Fix: Strong system prompt + retrieval-only answers + explicit “I don’t know” instructions.

Pitfall 2: Chunking every N characters with no structure

  • Fix: Use headings, sections, and paragraphs; store page/section metadata.

Pitfall 3: No ongoing sync

  • Fix: Build a simple ingestion pipeline with hash-based change detection and upserts.

Pitfall 4: Ignoring permissions

  • Fix: Encode access control in metadata and enforce it on every retrieval.

Pitfall 5: Not logging sources

  • Fix: Always log which chunks were used, their documents/pages, and show that to users.

13. Putting it all together: a minimal blueprint

A concise blueprint you can follow:

  1. Ingestion service

    • Watches your source (e.g., Drive/SharePoint/S3)
    • On change: parse → chunk → embed → upsert into vector DB
  2. Index schema

    • id, document_id, version, content, embedding
    • page, section_title, heading_path
    • permissions, org_id, created_at, updated_at
  3. Query service

    • Receives { user, question }
    • Checks user’s permissions
    • Embeds question → vector search with filters
    • Builds prompt: system instructions + retrieved chunks
    • Calls LLM, gets answer + citations
    • Logs everything (question, retrieved chunks, answer, feedback)
  4. Frontend

    • Chat UI (web, Slack, etc.)
    • Shows answer and sources (document names, pages, links)
    • Lets users mark answers as helpful or not and view original docs

With this foundation, your chatbot will:

  • Answer from PDFs/docs instead of hallucinating
  • Clearly show where answers come from
  • Stay maintainable as documents change, with controlled re-ingestion and indexing

From there, you can layer in advanced features like summarizing entire documents, multi-document comparisons, or workflow-specific agents—while still grounded firmly in your actual files.