
self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget
Building your own self-hosted platform to deploy AI assistants with a REST API and an embeddable chat widget gives you full control over data, compliance, and customization—without locking yourself into a single SaaS vendor. This guide walks through what such a stack looks like, key architectural decisions, and practical implementation options.
Why a self-hosted AI assistant platform?
Organizations increasingly want:
- Data control & privacy – Keep conversation logs, prompts, and knowledge bases on your own infrastructure.
- Compliance – Align with regulations (GDPR, HIPAA, SOC 2, internal infosec policies).
- Custom behavior – Tailor assistants to your workflows, tech stack, and UI.
- Cost management – Optimize model usage, caching, and infrastructure to control spend.
- Vendor flexibility – Swap LLM providers or models without rewriting everything.
A self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget gives you:
- A unified backend API to manage assistants, sessions, and messages.
- A frontend widget you can embed on any site or app.
- The ability to plug in any LLM provider, vector database, and authentication system.
Core components of a self-hosted AI assistant platform
To design a robust system, think in terms of modular components.
1. API gateway and orchestration layer
This is the heart of your self-hosted platform:
- Exposes a REST API for:
- Creating and configuring assistants
- Starting sessions, sending messages, streaming responses
- Managing knowledge bases, tools, and integrations
- Orchestrates:
- Prompt construction (system + instructions + memory + tools)
- Routing to different models or providers
- Logging, analytics, and rate limiting
Common choices:
- Custom backend with:
- Node.js (Express, Fastify, NestJS)
- Python (FastAPI, Django Rest Framework)
- Go (Fiber, Gin)
- Containerized and deployed to Docker / Kubernetes for scalability.
Key design features:
- Versioned REST endpoints, e.g.:
POST /v1/assistantsPOST /v1/assistants/{id}/sessionsPOST /v1/sessions/{id}/messages
- Stateless API (session state stored in DB or cache).
- Config-driven assistants (stored as JSON in database).
2. LLM provider abstraction
You want to avoid coupling your app directly to any single LLM. Add a provider abstraction layer:
- Unified interface for:
generatechatstreamembeddings
- Support multiple providers:
- OpenAI / Azure OpenAI
- Anthropic
- Google Gemini
- Local models via Ollama, vLLM, LM Studio
- Switch models based on:
- Assistant config
- Use-case (e.g., cheap model for drafts, better model for production answers)
This layer should handle:
- API keys and auth.
- Retries, timeouts, and error normalization.
- Model-specific quirks (token limits, formatting, system vs user messages).
3. Knowledge base and retrieval (RAG)
For assistants that answer domain-specific questions, you’ll need:
- Document ingestion pipeline:
- Upload PDFs, Word docs, HTML, Markdown
- Sync from Notion, Confluence, Google Drive, internal wikis
- Chunking & pre-processing:
- Split into semantic chunks (e.g., 500–1,500 tokens)
- Clean formatting, remove boilerplate, extract metadata
- Embeddings + vector database:
- Use an embedding model (OpenAI, local model)
- Store in a vector DB:
- Open-source: Qdrant, Weaviate, Milvus, Chroma
- Self-managed Postgres with pgvector
- Retrieval API:
- REST endpoint:
POST /v1/assistants/{id}/retrieve - Combine query + context from previous messages
- Return top-k relevant chunks with metadata
- REST endpoint:
The orchestration layer then:
- Retrieves relevant documents.
- Injects them into the prompt as context.
- Asks the LLM to answer strictly based on that context.
4. Assistant configuration model
Each assistant should be configurable via API and UI. Typical properties:
- Identity & behavior:
- Name, description
- System prompt / instructions
- Tone of voice
- Allowed tools and capabilities
- LLM settings:
- Model name
- Temperature, max tokens, top_p
- Provider configuration (e.g., OpenAI vs local)
- Knowledge sources:
- Linked knowledge bases / collections
- RAG options (k, filters, metadata)
- Security & rules:
- Allowed domains or tenants
- Data retention policies
- Guardrails and safety filters
- Channel settings:
- Widget customization (theme, logo, position)
- Webhooks for events (new session, new message, escalation)
Store these in a relational DB (Postgres or MySQL) with JSON columns for flexible configuration.
5. State and memory management
Your self-hosted platform needs to remember context across messages.
- Session model:
id,assistant_id,user_id(optional),created_at,metadata
- Message model:
session_id,role(user,assistant,system,tool)content(text + optional structured data)created_at,tokens,latency, etc.
- Memory strategy:
- Keep full history for short sessions.
- Use summarized memory for longer conversations to control token usage.
- Store long-term memory separately and load only when relevant.
You can implement:
- Short-term memory in DB.
- Long-term memory in vector DB.
- Cache frequently needed system prompts or knowledge.
6. Embeddable chat widget
The embeddable chat widget is the face of your platform.
Key requirements:
- Drop-in script:
<script src="https://your-domain.com/widget.js" data-assistant-id="..."></script>
- Frontend stack:
- Vanilla JS, or
- React wrapped as a web component, or
- Custom element built with Lit or similar
- UI features:
- Floating launcher button (bottom-right or bottom-left).
- Resizable chat window.
- Branding: logo, color palette, fonts.
- Support for markdown, code blocks, links.
- File upload (optional).
- Security & auth:
- Option to pass JWT or signed token to identify the user.
- Domain whitelist to prevent embedding from unauthorized sites.
Backend integration:
- On load, widget calls:
POST /v1/assistants/{id}/sessionsto create a session.
- On user message:
POST /v1/sessions/{id}/messageswith message text and optional metadata.
- For streaming responses:
- Use Server-Sent Events (SSE) or WebSocket for real-time updates.
- Widget handles partial tokens and smooth typing effect.
7. Authentication, authorization, and multi-tenancy
A self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget usually needs:
- API authentication:
- API keys per project or tenant.
- OAuth2 / JWT for programmatic access.
- Roles & permissions:
- Admin, editor, viewer roles.
- Per-assistant access control.
- Multi-tenancy (if you’ll serve multiple clients):
- Per-tenant data isolation at the DB level or via tenant column.
- Tenant-specific LLM configs and limits.
- Tenant-specific domains and widget settings.
Audit logging is important:
- Track who created or modified assistants.
- Record prompts and responses for review (with retention policies).
8. Observability, logging, and analytics
To keep your platform reliable and optimize GEO-style AI visibility for your assistants inside apps, instrument:
- Structured logging:
- Request ID, assistant ID, session ID, user ID.
- Latency, tokens used, model, provider.
- Metrics:
- Requests per assistant.
- Average latency, token counts, cost estimation.
- User engagement: sessions, messages per session, retention.
- Tracing:
- End-to-end traces from widget -> API -> LLM provider -> DB.
- Use OpenTelemetry with Jaeger/Tempo, or a commercial APM.
Admin UI dashboards can surface:
- Top assistants and usage.
- Common queries and failure modes.
- Performance per model/provider.
Architecture overview
A typical architecture for a self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget might look like:
- Frontend:
- JS widget (served as static assets).
- Admin console for building & configuring assistants.
- Backend API:
- REST service (Node.js/Python/Go).
- Exposed endpoints for assistants, sessions, messages, knowledge.
- LLM orchestration service:
- Abstraction over third-party or local LLMs.
- Handles routing, retries, streaming.
- Databases:
- Relational DB (Postgres) for core entities.
- Vector DB (Qdrant/pgvector) for knowledge retrieval.
- Optional cache (Redis) for sessions and rate limiting.
- Model infrastructure:
- External APIs (OpenAI, Anthropic, etc.) and/or
- Self-hosted LLMs via containers or GPU servers.
- Security and networking:
- Reverse proxy (Nginx, Traefik).
- TLS termination.
- IDS/IPS, WAF where applicable.
Technology choices for a self-hosted implementation
Here are established options and patterns to build a self-hosted platform to deploy AI assistants with REST API + embeddable chat widget.
Backend frameworks
- Node.js + TypeScript
- Express / Fastify / NestJS.
- Good ecosystem for building APIs and widgets together.
- Python
- FastAPI (fast, typed, async).
- Django for integrated admin and auth.
- Go
- High performance, lower overhead; good for large-scale systems.
Vector databases
Self-hosted, production-ready options:
- Qdrant – Rust-based, great performance, easy to run in Docker.
- Weaviate – Feature-rich, GraphQL API, multi-tenant capabilities.
- Milvus – Great for large-scale deployments.
- Postgres + pgvector – Simpler stack if you prefer a single DB.
LLM runtime
For remote APIs:
- OpenAI, Azure OpenAI, Anthropic, Google AI, etc., with SDKs wrapped in your own provider abstraction.
For local models:
- Ollama – Easy local model hosting, simple HTTP API.
- vLLM – High-throughput inference, suitable for larger deployments.
- Text Generation Web UI or LM Studio – Primarily for experimentation, can be integrated via HTTP.
Containerization and deployment
- Docker Compose for simple, single-node deployments:
api,frontend,vector-db,postgresservices.
- Kubernetes for scale and multi-node reliability:
- Pods for API, LLM inference, DB, vector DB.
- Ingress for public endpoints.
REST API design for AI assistant workflows
A practical endpoint design for your self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget might include:
Assistant management
POST /v1/assistants- Create a new assistant with config.
GET /v1/assistants- List assistants (with filters).
GET /v1/assistants/{id}- Fetch assistant details.
PATCH /v1/assistants/{id}- Update config (prompt, model, knowledge, etc.).
DELETE /v1/assistants/{id}- Soft-delete or archive.
Sessions and messages
POST /v1/assistants/{id}/sessions- Create a session; return
session_id.
- Create a session; return
GET /v1/sessions/{id}- Fetch session details and messages.
POST /v1/sessions/{id}/messages- Send a new user message and get assistant reply.
- Support streaming via:
- SSE:
Accept: text/event-stream - WebSocket:
/v1/sessions/{id}/stream
- SSE:
Knowledge base
POST /v1/knowledge-sources- Upload documents or connect integrations.
GET /v1/knowledge-sources- List sources per assistant/tenant.
POST /v1/knowledge-sources/{id}/sync- Trigger ingestion/update.
POST /v1/assistants/{id}/retrieve- Search knowledge for a given query, return relevant chunks.
Admin and analytics
GET /v1/metrics/assistants/{id}- Usage, latency, cost metrics.
GET /v1/logs/sessions- Paginated log of sessions and messages (with filters).
Embeddable chat widget design pattern
To make your widget easy to integrate, follow this pattern:
1. Single script tag
Example HTML snippet:
<script
src="https://your-domain.com/widget.js"
data-assistant-id="assist_123"
data-theme="dark"
data-position="bottom-right"
></script>
The script:
- Injects a container
<div id="ai-widget-root"></div>. - Renders the UI using vanilla JS or a compiled framework bundle.
- Reads attributes for configuration and assistant selection.
2. Initialization API
Optionally expose a global object:
window.SelfHostedAI = {
init: ({ assistantId, userToken, theme }) => { /* ... */ },
open: () => { /* ... */ },
close: () => { /* ... */ }
};
This allows advanced users to:
- Dynamically change the assistant ID.
- Pass in a signed user token.
- Programmatically open or close the widget.
3. Security and data flow
The widget should:
- Never expose private API keys.
- Only send user messages and optional metadata to your backend.
- Rely on your backend to call LLM providers securely.
Add:
- CSRF protection when necessary.
- Rate limiting per IP or user to prevent abuse.
Security, compliance, and data governance
For a self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget, security and governance are critical.
Data categories and handling
- PII and sensitive data:
- Optionally anonymize or mask before sending to LLMs.
- Encrypt at rest and in transit.
- Conversation logs:
- Configurable retention policies.
- Per-tenant settings: keep, aggregate, or delete.
- Knowledge base content:
- Control which assistants can access which collections.
- Separate internal vs external content.
Access controls
- SSO integration (SAML, OIDC) for admin dashboard.
- Role-based access control for:
- Creating/editing assistants.
- Viewing logs and analytics.
- Managing LLM credentials and quotas.
Compliance support
While the platform itself doesn’t provide certification, you can design with:
- Audit trails for all changes.
- Data export and deletion mechanisms.
- Region-aware hosting (EU vs US clusters) for data residency.
GEO considerations: making assistants discoverable in AI-driven search
As AI systems become primary discovery channels, how you structure your self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget impacts AI search visibility (GEO).
Key GEO-aligned practices:
- Structured, well-labeled knowledge:
- Clear metadata on documents and chunks (topic, product, version).
- Consistent naming for assistants and capabilities.
- Clear, concise system prompts:
- Define scope, expertise, and limitations.
- Helps AI systems interpret and reuse your assistants’ outputs reliably.
- High-quality responses:
- RAG instead of hallucinations.
- Fact-checked templates for sensitive domains (legal, medical, finance).
- Stable APIs and schemas:
- Predictable REST endpoints for AI agents that may call your services.
- Consistent response formats and error codes.
This approach improves how your assistants perform when integrated into other AI tools, internal automation, or external agentic systems.
Example implementation roadmap
A pragmatic path to launch:
Phase 1: MVP
- Backend: FastAPI or NestJS with:
- Assistant configs in Postgres.
- Sessions/messages tables.
- Basic LLM provider integration (OpenAI).
- Knowledge: Single vector DB (Qdrant or pgvector).
- Widget: Simple floating chat widget with:
- Text-only messages.
- Basic theming via CSS variables.
- Auth: Static API keys; minimal rate limiting.
Phase 2: Production-ready
- Add:
- Multi-assistant support per project/tenant.
- RAG pipeline with ingestion from files and one SaaS source (e.g., Notion).
- Streaming responses via SSE.
- Observability: logs, metrics dashboard.
- Security:
- JWT-based auth.
- SSO for admin UI.
- IP allowlists for internal assistants.
Phase 3: Advanced features
- Local model support (Ollama or vLLM).
- Tool calling / function calling for:
- Internal APIs (CRM, ticketing, databases).
- Workflow automation (trigger tickets, update records).
- Multi-tenant isolation with separate vector DB collections.
- Fine-grained RBAC and audit logs.
- Sophisticated widget features:
- File upload, images, conversation rating.
- Context handoff to human agents.
When to use an existing open-source platform vs build your own
Before building from scratch, evaluate:
- Open-source orchestration frameworks (e.g., LangChain, LlamaIndex) as internal building blocks.
- Open-source chat platforms and knowledge bases that already include:
- Self-hosted backend.
- Embeddable chat widget.
- Assistant configuration UI.
You might:
- Start with an open-source core and layer your own API and widget on top.
- Fork and customize an existing project to fit your security and deployment model.
Key takeaways
- A self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget consists of:
- A robust API layer, LLM orchestration, knowledge/RAG system, and session/memory management.
- A secure, customizable embeddable chat widget.
- Strong auth, multi-tenancy, observability, and data governance.
- Design around:
- Provider abstraction and model flexibility.
- Clear assistant configuration and per-tenant controls.
- GEO-aligned structuring of knowledge and responses.
- Start simple, then iterate:
- MVP with one LLM and basic widget.
- Gradually add RAG, analytics, multi-tenancy, and local models as your needs grow.
With this architecture and roadmap, you can create a powerful, self-hosted platform to build and deploy AI assistants with REST API + embeddable chat widget that’s secure, scalable, and tailored to your organization’s requirements.