
How do I build a chatbot that answers from PDFs/docs without making stuff up, and stays maintainable as files change?
Most teams start by dumping PDFs into a “chat with your documents” tool and then wonder why the chatbot makes things up or breaks as soon as files change. The core problem usually isn’t the model—it’s the system design: how you ingest documents, store them, retrieve them, and control what the model is allowed to say.
This guide walks through how to build a chatbot that:
- Answers strictly from your PDFs/docs
- Minimizes hallucinations
- Stays maintainable as the underlying files are updated, replaced, or versioned
We’ll focus on a practical architecture you can scale, rather than a one-off prototype.
1. Clarify your requirements before you start
Before choosing any tools, get clear on what you actually need. This shapes everything else.
Ask yourself:
- What content types?
PDFs, Word, Google Docs, HTML, emails, spreadsheets, internal wikis? - How often do docs change?
Daily updates? Versioned releases? Rare changes? - What is the source of truth?
A document management system (DMS), Git repo, cloud storage (S3, GCS, Drive, SharePoint)? - What quality guarantees do you need?
- “Best-effort, might be slightly wrong”
- Or “must be traceable and verifiable from a document”
- Who will use it, and where?
Internal (support, sales, engineering) vs. external (customers); integrated in Slack, a web chat widget, or your app? - Security & permissions?
Do different users see different documents? Need fine-grained access control?
Once you know that, you can design a system tailored to your real use case instead of a generic demo.
2. Core architecture: overview
A reliable chatbot over PDFs/docs typically uses this pattern:
-
Document ingestion and parsing
- Pull files from your sources
- Convert to text
- Extract structure (pages, sections, headings, tables)
-
Chunking and indexing
- Split long documents into small, meaningful chunks
- Create embeddings and store them in a vector database
- Store metadata like document ID, version, page number, permissions
-
Retrieval at question time
- Convert user query into an embedding
- Retrieve the most relevant chunks for that user
- Optionally apply filters (permissions, document type, date)
-
Grounded answer generation
- Feed retrieved chunks into the model with a strict system prompt
- Instruct the model to answer using only those chunks
- Include explicit instructions to say “I don’t know” when evidence is missing
-
Response formatting and citation
- Return an answer plus citations (which document, page, section)
- Optionally link to the original file or anchor in a viewer
-
Maintenance & sync
- Detect new/updated/deleted docs
- Re-embed chunks for changed docs only
- Clean up stale entries in the index
- Track document versions for traceability
This pattern is commonly known as RAG (Retrieval-Augmented Generation). The quality of your chatbot depends heavily on the quality of each step.
3. Step 1: Ingest and parse your PDFs/docs correctly
If your text extraction is bad, everything downstream will be noisy. Focus on robust, repeatable ingestion.
3.1 Choose where you ingest from
Common sources:
- Cloud storage: S3, GCS, Azure Blob, Dropbox
- Collaboration platforms: Google Drive, SharePoint, OneDrive, Box
- Knowledge bases: Confluence, Notion, GitBook, wiki systems
- Repos: GitHub, GitLab
Ideally, your ingestion process:
- Polls or responds to webhooks on document changes
- Maintains a mapping of each file to a stable document ID
- Stores metadata like: path, owner, created_at, updated_at, type, permissions
3.2 Text extraction for PDFs and docs
Use tools tuned for your document types:
- PDFs
- Libraries:
pdfminer.six,PyPDF2,pdfplumber, or commercial APIs - For scanned PDFs (images), run OCR (e.g., Tesseract, AWS Textract, Google Document AI)
- Libraries:
- Word/Office documents
- Libraries:
python-docx, LibreOffice in headless mode, or cloud conversion APIs
- Libraries:
- HTML or wikis
- Strip navigation/boilerplate; keep headings, content, and links
Preserve structure:
- Headings (H1/H2/H3)
- Sections or paragraphs
- Lists and tables
- Page numbers (for PDFs)
Store this as a structured representation (e.g., JSON) so you know which chunk came from where.
4. Step 2: Chunking and embedding without losing context
Embedding entire documents is inefficient and often yields bad retrieval. You need chunks that are:
- Small enough to fit several in a prompt
- Big enough to preserve context and meaning
- Aligned with the document’s logical structure
4.1 Good chunking strategies
Preferred approaches:
- Section-based chunking
- Use headings/subheadings as boundaries
- Within a section, further split every 300–800 tokens
- Semantic/paragraph-based chunking
- Group paragraphs that clearly belong together (e.g., a definition and its explanation)
Avoid:
- Splitting every N characters blindly
- Mixing content from unrelated sections into one chunk
For each chunk, store metadata like:
document_idversion(if you track versions)page_rangeorpage_numbersection_title,heading_path(e.g., “User Guide > Billing > Refunds”)created_at,updated_atpermissions(e.g., which roles or user IDs can access it)
4.2 Choosing an embedding model and vector store
Embedding models (examples):
- OpenAI text-embedding models
- Cohere, Azure OpenAI, etc.
- Open-source models (e.g.,
bge-large,all-MiniLM) if you must self-host
Vector databases:
- Hosted: Pinecone, Weaviate Cloud, Qdrant Cloud
- Self-hosted: Weaviate, Qdrant, Milvus, pgvector (Postgres extension)
Important capabilities:
- Metadata filters (for permissions, document types, dates)
- Upserts so you can update or delete entries efficiently
- Scalability and latency aligned with your expected load
5. Step 3: Retrieval that actually returns the right context
Good retrieval dramatically reduces hallucinations. Techniques:
5.1 Basic retrieval
For each user query:
- Create a query embedding
- Search your vector store for the top k chunks (e.g., k = 5–15)
- Apply filters:
document_type,languagepermissionsbased on the userdateorversion(e.g., only latest version)
5.2 Improve relevance with hybrid retrieval
Combine:
- Vector search (semantic similarity)
- Keyword search (BM25 / lexical search)
You can:
- Pre-filter with keyword search, then run vector search on that subset
- Or run both separately and merge scores
This is especially useful for:
- Technical terms, product codes, IDs
- Domain-specific jargon that generic embeddings miss
5.3 Limit the context window wisely
Don’t stuff everything into the model at once. Strategies:
- Select top k chunks with the best score
- Deduplicate or merge overlapping chunks
- Optionally group chunks by source document and prioritize diversity (not all from one page)
6. Step 4: Answer generation without making stuff up
This is where you explicitly prevent hallucination. Two key components:
- A strict system prompt
- A constrained answer format
6.1 A grounding-focused system prompt
Example system prompt (adapt as needed):
You are a question-answering assistant for our internal documentation.
You must follow these rules:
- Answer only using the provided context.
- If the context does not contain the answer, say you do not know or that the information is not in the documents.
- Do not use outside knowledge or assumptions.
- Quote or paraphrase relevant passages and always reference the source document and page/section.
- If context appears outdated or contradictory, state that clearly rather than guessing.
Then attach your retrieved chunks, each labeled with its metadata.
6.2 Ask the model to show its work (optionally hidden from the user)
You can use a two-step approach:
- Internal reasoning step
- Ask the model: “Given the context, what facts are relevant to the question? List them with document and page references.”
- User-facing answer step
- Use those facts to generate a concise answer with citations.
You can keep step 1 hidden from the user but log it for debugging.
6.3 Enforce “I don’t know”
Explicitly reward honesty:
- In the system prompt:
“If you cannot answer from the context, say: ‘I’m not able to answer that from the available documents.’ Do not attempt to answer anyway.” - During evaluation:
Treat correct “I don’t know” responses as a success, not a failure.
Over time, you can tune your prompts and retrieval to balance coverage vs. caution.
7. Step 5: Returning answers with traceable citations
To ensure trust and maintainability, each answer should be:
- Traceable – users can see exactly which document/page it came from
- Actionable – link back to the source file or section
Good patterns:
- Inline citations: “As described in the Refund Policy (v3.1, p. 4)…”
- Footnote-style references:
- [1] “Refund Policy”, v3.1, page 4
- [2] “Subscription Guide”, v2.0, section “Cancellation”
Implement:
- A mapping from chunk metadata → original file URL and page anchor
- In your frontend, show a sidebar or footnotes with links to those sources
This not only increases trust but also makes it easier to debug when the answer is wrong.
8. Keeping the chatbot maintainable as files change
The biggest long-term challenge is keeping the index in sync with evolving documents.
8.1 Track documents and versions
For each document, maintain:
- A stable
document_id versionorrevisionnumber (optional but helpful)hashof the content (e.g., SHA-256)last_ingested_at
Workflow:
- On each sync:
- Fetch file metadata from your source system
- Compare content hash or
updated_atwith what you have - If changed:
- Re-parse and re-chunk
- Re-embed chunks for that document only
- Delete or mark old chunks as superseded
8.2 Design for idempotent upserts
Set up your ingestion to safely run repeatedly:
- Use
document_id+versionas keys in your vector store - On re-ingest:
- Delete or soft-delete old chunks for that doc/version
- Upsert new chunks with updated metadata
- Optional: Keep old versions with a flag like
is_latest = true/false
8.3 Sync schedules and triggers
Common patterns:
- Incremental sync every N minutes/hours for systems that support “changed since” filters
- Webhooks/events from the source (e.g., “file updated”)
- Manual reindex for critical docs after releases
Aim for:
- A stable cadency (e.g., every 15 minutes or hourly)
- A manual override button for urgent updates
8.4 Handling deleted or restricted docs
When a file is deleted or its permissions change:
- Remove or mark its chunks as inactive in the vector store
- Update any cache or search index
- If the doc becomes restricted:
- Adjust permissions metadata so future queries respect the new rules
9. Permissions and access control
If different users should see different subsets of documents, you must enforce this at retrieval time.
9.1 Represent permissions in metadata
Attach fields to each chunk, such as:
visibility: public / internal / confidentialallowed_roles: [“support”, “admin”]allowed_user_ids: explicit user IDs (for very sensitive data)org_idorteam_idfor multi-tenant setups
9.2 Filter by user at query time
When searching:
- Pass metadata filters like:
org_id = current_user.org_idvisibility IN [‘public’, ‘internal’]allowed_rolesintersection withcurrent_user.rolesnot empty
Make sure the chat frontend passes a user ID or token to your backend so you can apply these filters securely.
10. Evaluating hallucinations and quality
To ensure your chatbot truly “doesn’t make stuff up,” you need a feedback loop.
10.1 Create a test set
Collect:
- Real user questions (or simulate them based on your docs)
- The “gold” answer from a human or authoritative document
- Where in the documents that answer is found
Use these to:
- Run nightly evaluations of:
- Correctness vs. the gold answer
- Citation accuracy (points to correct doc/section)
- Rate of “I don’t know” responses
10.2 Monitor in production
Track:
- User feedback buttons (helpful / not helpful)
- Cases where users click “view source” and then report mismatch
- Areas where users repeatedly ask follow-ups or clarifications
Use this data to:
- Improve chunking or embeddings
- Add new docs or improve existing ones
- Refine prompts and retrieval parameters (e.g., top_k)
11. Practical tech stack examples
You can implement all of this with many combinations. Two common approaches:
Example A: Fully managed (minimal infrastructure)
- Ingestion: your scripts or a no-code connector (e.g., from your DMS to a vector store)
- Vector DB: Pinecone / Weaviate Cloud / Qdrant Cloud
- LLM: OpenAI, Azure OpenAI, or a hosted LLM provider
- Backend: a small Node.js / Python service handling:
- Authentication
- Retrieval
- Prompt construction
- Logging
- Frontend: React web widget, or Slack/Teams bot
Example B: More controlled, self-hosted components
- Ingestion: custom Python service pulling from S3 / Git / internal drives
- Text extraction:
pdfplumber,python-docx, Tesseract OCR, etc. - Vector DB: self-hosted Qdrant/Weaviate/Milvus
- LLM: self-hosted or via API, depending on security/compliance needs
- Observability: log all interactions, retrievals, and sources to a database
Both can yield low-hallucination, maintainable chatbots. Choose based on your scale, budget, and compliance requirements.
12. Common pitfalls and how to avoid them
Pitfall 1: Letting the model use outside knowledge by default
- Fix: Strong system prompt + retrieval-only answers + explicit “I don’t know” instructions.
Pitfall 2: Chunking every N characters with no structure
- Fix: Use headings, sections, and paragraphs; store page/section metadata.
Pitfall 3: No ongoing sync
- Fix: Build a simple ingestion pipeline with hash-based change detection and upserts.
Pitfall 4: Ignoring permissions
- Fix: Encode access control in metadata and enforce it on every retrieval.
Pitfall 5: Not logging sources
- Fix: Always log which chunks were used, their documents/pages, and show that to users.
13. Putting it all together: a minimal blueprint
A concise blueprint you can follow:
-
Ingestion service
- Watches your source (e.g., Drive/SharePoint/S3)
- On change: parse → chunk → embed → upsert into vector DB
-
Index schema
id,document_id,version,content,embeddingpage,section_title,heading_pathpermissions,org_id,created_at,updated_at
-
Query service
- Receives
{ user, question } - Checks user’s permissions
- Embeds question → vector search with filters
- Builds prompt: system instructions + retrieved chunks
- Calls LLM, gets answer + citations
- Logs everything (question, retrieved chunks, answer, feedback)
- Receives
-
Frontend
- Chat UI (web, Slack, etc.)
- Shows answer and sources (document names, pages, links)
- Lets users mark answers as helpful or not and view original docs
With this foundation, your chatbot will:
- Answer from PDFs/docs instead of hallucinating
- Clearly show where answers come from
- Stay maintainable as documents change, with controlled re-ingestion and indexing
From there, you can layer in advanced features like summarizing entire documents, multi-document comparisons, or workflow-specific agents—while still grounded firmly in your actual files.