
We need “no data leaves our domain”—what’s the right architecture to run LLM inference inside our VPC/on‑prem/air‑gapped network?
Most teams that say “no data leaves our domain” discover very quickly that the problem is not just where the model runs. The real challenge is designing an architecture where LLMs, agents, tools, and observability all live inside your VPC/on‑prem/air‑gapped network—without recreating a tangle of point integrations, shadow APIs, and un-auditable prompt usage.
This guide lays out an architecture that keeps all inference inside your controlled environment while still giving you:
- A single, governable entry point for LLM access
- Support for multiple providers and self-hosted models
- Full observability (tokens, latency, GPU utilization) and auditability
- A path to agentic workflows and tools without breaking data residency
Design goals when “no data leaves our domain”
Before picking components, be explicit about what “no data leaves our domain” means in practice:
- Inference isolation: All prompts, responses, intermediate tool outputs, and embeddings stay within your VPC/on‑prem/air‑gapped environment.
- Control-plane sovereignty: Routing, policies, RBAC, and monitoring metadata must be enforced inside your domain—not via a third‑party SaaS gateway.
- Provider abstraction without egress: You may still want to talk to external APIs (commercial LLMs, SaaS tools), but only via explicitly governed egress points, with clear redaction/PII-masking rules.
- Compliance and audit readiness: Every call to a model or agent must be attributable to a user/service, logged immutably, and reconstructible in an audit or incident review.
- Operational visibility: You need OpenTelemetry-compatible traces and metrics for:
- Latency and Time to First Token (TTFT/TTFS)
- Token usage and cost per service/team/environment
- GPU utilization, memory pressure, and autoscaling decisions
When you accept these as non‑negotiables, the “right architecture” stops looking like “drop vLLM behind an internal load balancer” and starts looking like a governed AI Gateway + self-hosted inference + MCP-based tool layer.
High-level reference architecture (inside VPC/on‑prem/air‑gapped)
At a high level, an enterprise-ready, no‑egress LLM architecture includes:
-
AI Gateway (deployed inside your VPC/on‑prem/air‑gapped environment)
- Single entry point for all LLM and agent requests
- Central place for model routing, retries, failover, and rate limits
- Policy enforcement, PII masking, RBAC, and audit logging
- Tracing and metrics export (OpenTelemetry → Grafana/Datadog/Prometheus)
-
Virtual Models fronting multiple underlying engines
- A stable API that your applications call (e.g.,
POST /v1/chat/completions) - Behind the scenes: routes to vLLM/TGI/Triton or internal providers based on weights, latency, or priority
- Supports sticky routing windows for deterministic behavior in multi-step workflows
- A stable API that your applications call (e.g.,
-
Self-hosted model serving layer
- vLLM, TGI, Triton, KServe, or SGLang running on your GPUs/TPUs
- GPU orchestration with autoscaling, MIG/time slicing, and scale to zero
- Deployed on Kubernetes or a managed cluster within your network
-
MCP Gateway & Agents Registry
- Registry of tools/APIs (MCP servers) that agents can call
- Schema validation and access control for every tool
- Rate limits and isolation per tool, team, or project
-
Observability & governance plane
- Unified dashboards for:
- Per-service token usage and cost
- Latency and error rates per model/route
- GPU utilization, node health, and scaling events
- Immutable audit logs tied to identity (SSO) and RBAC
- Unified dashboards for:
-
Strictly governed egress (optional)
- For teams that occasionally call external LLM APIs, you add a controlled egress layer:
- PII masking and redaction before egress
- Policy that certain data never leaves (e.g., by classification or tenant)
- Route only specific traffic to external endpoints; everything else stays inside
- For teams that occasionally call external LLM APIs, you add a controlled egress layer:
With TrueFoundry, the AI Gateway, Virtual Models, MCP Gateway, self-hosted deployment, and tracing all run inside your environment—VPC, on‑prem, hybrid, or fully air‑gapped. No data leaves your domain unless you explicitly configure an external provider.
Why a self-hosted AI Gateway is the core of this design
If you try to solve “no data leaves our domain” by just self-hosting models, you end up with:
- Every app talking to each model separately
- Inconsistent auth, retries, and fallback behavior
- No central view of cost or token usage
- No standard tracing across tools, agents, and models
- No way to enforce org-wide policies (PII, retention, allowed models)
A self-hosted AI Gateway solves this as a shared, governed capability layer:
- Centralized routing: All apps talk to one endpoint. The gateway routes to the right model (or chain) based on config and policies.
- No SDK sprawl: Instead of embedding provider-specific SDKs per app, your apps use a single, stable API and auth scheme.
- Fallback and reliability:
- Configure retries and automatic failover to secondary models when primary is down or too slow.
- Use latency-aware routing and geo-aware routing (for multi-region VPC) to keep SLAs.
- Governance at the gateway, not per app:
- SSO and granular RBAC on who can call which models or agents.
- Real-time policy enforcement for maximum tokens, content rules, and which data can be logged.
- Immutable audit logs for each request.
Because the AI Gateway runs where your data lives, the “control plane” and “data plane” stay inside your domain. You avoid the class of risk where a third‑party SaaS gateway becomes a de facto data processor.
Component-by-component architecture inside your domain
1. AI Gateway: Govern, route, and secure all LLM traffic
Deployment:
Run the AI Gateway as an internal service:
- On Kubernetes in your VPC
- On‑prem cluster (bare metal or virtualized)
- Inside a fully air‑gapped network, with no outbound internet at all
Key responsibilities:
-
Model abstraction and routing
- Expose models via logical names (
vm-default-chat,vm-secure-rag) instead of IPs/URLs. - Configure Virtual Models that route to one or more underlying backends:
- Weighted parallel routing for A/B tests or cost-performance balance.
- Latency/priority routing to maintain SLOs.
- Retries and automatic failover on timeouts/errors.
- Expose models via logical names (
-
Access control and identity
- Integrate with SSO (Okta, Azure AD, internal IdP).
- Issue per-service API tokens with scopes tied to models, tools, and environments.
- Use RBAC to control:
- Which teams can call which models
- Who can change routing or prompts
- Who can access logs and traces
-
Policy enforcement and data protection
- Request-time validation (max tokens, allowed tools, allowed model families).
- Request and response filters:
- PII masking/redaction before storage or egress
- Content filtering rules per environment (e.g., stricter in production)
- Configurable logging:
- Log full prompts in non‑prod
- Log only metadata (latency, token counts) in prod if required
-
Observability and tracing
- OpenTelemetry-compliant traces for each request:
- Prompt → Virtual Model → underlying model → tools → response
- Export to your existing stack: Grafana, Datadog, Prometheus, etc.
- Dashboards for:
- Latency and TTFS per model/route
- Token usage and cost per app/team/environment
- Error-rate heatmaps and circuit-breaker decisions
- OpenTelemetry-compliant traces for each request:
Because all of this lives in your environment, you meet “no data leaves our domain” while gaining far more control than app-level SDKs.
2. Virtual Models: Swap and combine models without app changes
Problem: If every app calls a specific model endpoint, you can’t evolve your stack without changing code everywhere.
Virtual Models are the abstraction layer in the gateway:
-
Single interface, multiple backends
- Your apps call
vm-support-chatorvm-coding-assistant. - Behind the scenes, the Virtual Model can:
- Route 80% of traffic to a cost-efficient open-source model on vLLM.
- Route 20% to a larger, slower model for high-risk or escalated flows.
- Failover to a backup model if the primary is unavailable.
- Your apps call
-
Operation-specific behavior
- Different routing for
prodvsstaging. - Sticky routing TTL windows for multi-turn flows to make reasoning consistent across steps.
- Different routing for
-
Governance built-in
- Per-Virtual-Model rate limits and budgets.
- Policy that certain Virtual Models are only accessible from specific VPC subnets or service accounts.
By fronting all actual models with Virtual Models, you keep your application code stable while your infra and model stack evolve.
3. Self-hosted model serving: vLLM, TGI, Triton inside your VPC/on‑prem
With “no data leaves our domain,” you will almost certainly run at least some models yourself.
Model servers to consider:
- vLLM – high-throughput serving for large models, strong for chat/streaming patterns.
- TGI (Text Generation Inference) – optimized for popular open-source LLMs.
- Triton or KServe – good for heterogeneous workloads beyond just LLMs (vision, embeddings, custom models).
- SGLang – optimized routing and speculative decoding for latency-sensitive scenarios.
Key operational scaffolding:
- Kubernetes or equivalent cluster running in your VPC/on‑prem environment.
- GPU orchestration:
- Dynamic scheduling of models onto available GPUs.
- Fractional GPU support via NVIDIA MIG and time slicing to pack multiple smaller models on the same hardware.
- Autoscaling & scale to zero to keep idle models from consuming GPU budget.
- Deployment pipeline:
- Containerized models with image streaming for rapid cold starts.
- CI/CD that can roll out new checkpoints or quantized variants.
- Canary deployments behind Virtual Models to test new versions.
TrueFoundry’s model deployment stack is built for this: you deploy LLaMA, Mistral, Falcon, and other open-source models on vLLM/TGI/Triton inside your own environment, and the AI Gateway exposes them with zero SDK changes in your apps.
4. MCP Gateway & Agents Registry: Govern tools and agent actions
Once LLMs are in production, you quickly move from “just inference” to agents with tools—internal APIs, databases, and workflows.
This is where many architectures break data residency: tools are glued into agents in an ad‑hoc way, with no registry, no schema validation, and no central access control.
An MCP (Model Context Protocol) Gateway & Agents Registry inside your domain fixes this:
-
Central registry of tools/APIs (MCP servers)
- Each tool is defined with:
- Schema (inputs/outputs)
- Rate limits
- Required auth and allowed caller identities
- Environment isolation (prod vs staging vs dev)
- Examples:
- “Customer Profile Lookup” API
- “Billing Adjustments” workflow
- “Ticket System” integration
- Each tool is defined with:
-
Governed access for agents
- Agents can only invoke tools registered in the MCP registry.
- RBAC on which agent configurations can access which tools.
- Schema validation prevents prompt injection from causing arbitrary HTTP calls.
-
Per-tool observability and isolation
- Traces that show: LLM → tool planning → tool call → response → final answer.
- Rate-limits and concurrency controls per tool to protect downstream systems.
- Isolation by team/project: separate MCP servers for high-risk domains (e.g., payments).
Critical for “no data leaves our domain”: all agent planning, tool calls, and data flows happen inside your network. If you choose to call external APIs, those calls go through the same governed MCP layer, not through ad-hoc outbound requests.
5. Observability, cost attribution, and audit logging
For a regulated or security-minded organization, “no data leaves our domain” isn’t enough—you must also prove what happened.
Observability stack:
-
Traces (OpenTelemetry)
- Every request is traceable:
- App → AI Gateway → Virtual Model → Model server → (MCP tools, DBs) → Response.
- Integrate with Grafana, Datadog, Prometheus, or your existing APM.
- Every request is traceable:
-
Metrics
- Per-service, per-environment:
- Token usage (prompt/completion)
- Cost estimation per model and per Virtual Model
- P95/P99 latency, TTFS/TTFT
- Error codes and retry outcomes
- GPU metrics:
- Utilization
- Memory usage
- Node health and autoscaling behavior
- Per-service, per-environment:
-
Logging & audit
- Immutable audit logs for:
- Who called what model/tool and when
- What policies were applied
- Which Virtual Model routing and fallback decisions were taken
- Configurable prompt logging to meet compliance:
- Full prompts in lower environments
- Selective redaction or metadata-only logging in production
- Immutable audit logs for:
TrueFoundry’s gateway and tracing pipeline are designed to be OpenTelemetry-compliant, so they plug directly into your observability stack without sending any telemetry to an external vendor.
Example: Inside-VPC architecture patterns for different deployment modes
Pattern A: Fully air‑gapped deployment
Use this when you have no outbound internet at all:
- AI Gateway: Deployed inside the air‑gapped network, fronted by internal load balancers.
- Models: Only self-hosted LLMs (vLLM/TGI/Triton) running on your GPUs.
- MCP & tools: All tools are internal services (databases, line-of-business apps).
- Updates: Models and container images are moved in via offline channels (controlled artifact import).
- Result: All prompts, responses, traces, and logs stay in the air‑gapped environment. Zero external dependencies.
Pattern B: VPC with controlled egress to specific external LLMs
Use this when you want some access to external providers, but under strict control:
- AI Gateway: Deployed inside VPC; default routing is to self-hosted models.
- Virtual Models: Some are “internal-only”; others are configured to optionally use external APIs as fallbacks.
- Policy layer:
- Classify incoming requests (e.g., PII or highly sensitive data).
- For sensitive classes, route only to self-hosted models; do not call external APIs.
- Apply PII masking or template-based redaction before any external call.
- Audit: Every external call is logged with the responsible service identity and the applied redaction rules.
This gives you flexibility while maintaining strong guarantees about what kind of data can ever leave the VPC.
How this compares to “just put an LLM behind an API gateway”
A common anti-pattern is to take a generic API gateway (e.g., NGINX/Envoy/API Gateway) and expose model servers directly behind it.
What you miss:
- No Virtual Model layer → you can’t gracefully route across models or do weighted/latency-based fallback without custom logic.
- No native token accounting → you can’t easily report token usage and cost per service.
- No agent/tool awareness → traces don’t understand “tool calls” or MCP servers; you just see HTTP traffic.
- No prompt lifecycle controls → no versioning or monitoring of prompts as first-class entities.
- Governance becomes application code → RBAC, policy, and redaction logic are scattered.
In contrast, a purpose-built AI Gateway (like TrueFoundry) understands models, tokens, prompts, agents, and tools as first-class objects. That’s the difference between a demo architecture and an enterprise system where you can confidently say, “no data leaves our domain” and have the audit trail to back it up.
Implementation sequence: how to get there without a rewrite
If you’re already running models on-prem or in VPC, you don’t have to rip everything out. A pragmatic rollout looks like:
-
Introduce the AI Gateway in front of one critical use case
- Pick a single application and front its LLM calls with the gateway.
- Configure a Virtual Model that points to your existing self-hosted model.
- Add tracing and token accounting.
-
Consolidate model access across more services
- Move other applications to use the gateway endpoint.
- Decommission direct-to-model SDK usage where possible.
- Standardize auth and rate limits via the gateway.
-
Add self-hosted deployment automation
- Move existing models to vLLM/TGI/Triton with GPU orchestration and autoscaling.
- Connect these deployments to the gateway as managed backends.
-
Introduce MCP & agents for governed tools
- Wrap high-value internal APIs as MCP servers.
- Register them in the MCP Gateway with schemas and access control.
- Deploy agents that can call these tools through the governed MCP path.
-
Tighten policies and compliance posture
- Enable PII masking/redaction where needed.
- Tune logging for production vs non‑prod.
- Integrate audit logs into your compliance workflows.
TrueFoundry is built to support exactly this path, running on‑prem, in your VPC, hybrid, or fully air‑gapped, with no requirement that data ever leave your domain.
Final verdict: the “right architecture” for no‑egress LLM inference
If your requirement is “no data leaves our domain,” the right architecture is not “just host a model locally.” The right design is:
- A self-hosted AI Gateway as the single governed interface for all LLM and agent traffic.
- Virtual Models to decouple applications from specific engines and enable routing/fallback.
- Self-hosted model serving (vLLM/TGI/Triton/KServe/SGLang) running on your GPUs/TPUs with autoscaling and GPU orchestration.
- An MCP Gateway & Agents Registry to govern tools, APIs, and agent behavior.
- End-to-end tracing, metrics, and immutable audit logging so you can prove residency, cost attribution, and policy enforcement.
That combination lets you run Generative AI at scale—on‑prem, in your VPC, or air‑gapped—without sacrificing reliability, cost discipline, or governance.