
How do we deploy BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint?
Deploying BerriAI’s LiteLLM OSS on Kubernetes is a practical way to give your organization a shared, internal OpenAI-compatible endpoint that can route to multiple providers while enforcing consistent policies, logging, and cost controls. This guide walks through the conceptual architecture and gives you concrete Kubernetes manifests and configuration patterns you can adapt to your cluster.
What is LiteLLM OSS and why run it on Kubernetes?
LiteLLM OSS (from BerriAI) is an open-source proxy that:
- Exposes an OpenAI-compatible API (e.g.,
/v1/chat/completions,/v1/completions,/v1/embeddings) - Routes requests to multiple LLM providers (OpenAI, Azure OpenAI, Anthropic, Google, Ollama, local models, etc.)
- Adds features like:
- Global and per-model rate limiting
- Cost tracking and logging
- Key management and secret rotation
- Routing rules, fallbacks, and load balancing across providers
Running LiteLLM on Kubernetes gives you:
- A shared internal endpoint all teams can call instead of hitting external APIs directly
- Centralized configuration and governance for LLM usage
- Easier scaling, resilience, and observability
- A clean separation between application code and provider-specific details
High-level architecture
In a typical Kubernetes-based internal deployment:
-
LiteLLM Pod(s)
- Runs as a Deployment (or StatefulSet if you need persistent features later)
- Uses a configuration file (
litellm.yamlor.env) mounted via ConfigMap/Secret - Listens on an HTTP port (e.g., 4000) exposing OpenAI-compatible endpoints
-
Cluster Networking
- An internal Kubernetes Service (ClusterIP) exposes the LiteLLM pods
- Optionally an Ingress/Service Mesh (Istio, Linkerd, Nginx Ingress, etc.) provides:
- Internal routing
- TLS termination
- mTLS and access policies
-
Secrets Management
- Provider API keys (e.g.,
OPENAI_API_KEY) stored in Kubernetes Secrets - Referenced as environment variables or via files in the LiteLLM container
- Provider API keys (e.g.,
-
Clients (internal apps/services)
- Use the OpenAI SDK or HTTP API
- Point
base_url(orapi_base) to the internal LiteLLM endpoint instead of the public OpenAI URL - Use an internal “proxy key” or token for auth (e.g., via an API gateway, custom auth middleware, mTLS, or Kubernetes network policies)
Prerequisites
Before deploying LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint, you should have:
- A working Kubernetes cluster (e.g., EKS, GKE, AKS, k3s, Minikube)
kubectlconfigured to talk to the cluster- Optional but recommended:
- A namespace dedicated to AI/LLM infra (e.g.,
llm-infra) - A cluster-wide Ingress controller for internal routing
- A secrets management strategy (Kubernetes Secrets, External Secrets, Vault, SOPS, etc.)
- A namespace dedicated to AI/LLM infra (e.g.,
You’ll also need:
- At least one LLM provider account (e.g., OpenAI)
- Provider API key(s)
Step 1: Container image and basic server configuration
LiteLLM supports running as a server via the CLI. The official recommended image often looks like:
image: ghcr.io/berriai/litellm:latest
You can run it in server mode with:
litellm --port 4000 --config /app/config/litellm.yaml
In Kubernetes, we’ll embed that as the container command and args or use the default that the image provides, depending on the version. Always verify with the current LiteLLM docs or image entrypoint.
Step 2: Create the LiteLLM configuration (ConfigMap)
LiteLLM supports configuration via YAML. Here’s a minimal example for an internal deployment that routes to OpenAI but exposes OpenAI-compatible endpoints internally:
# litellm.yaml
litellm_settings:
# Optional: default model if clients omit `model`
default_model: gpt-4o-mini
model_list:
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
# Optional rate limiting: per key or per model
router_settings:
routing_strategy: usage # or "simple", "least_latency", etc.
general_settings:
# Where to send logs (stdout by default)
logging: info
# Optional: to enable cost tracking & usage logging to a DB, etc.
# See LiteLLM docs for advanced telemetry configuration.
Create a Kubernetes ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: litellm-config
namespace: llm-infra
data:
litellm.yaml: |
litellm_settings:
default_model: gpt-4o-mini
model_list:
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
router_settings:
routing_strategy: usage
general_settings:
logging: info
Apply it:
kubectl apply -f litellm-config.yaml
Step 3: Store provider API keys in a Secret
For an OpenAI-backed LiteLLM instance:
apiVersion: v1
kind: Secret
metadata:
name: litellm-secrets
namespace: llm-infra
type: Opaque
stringData:
OPENAI_API_KEY: "sk-your-real-openai-key"
Apply it:
kubectl apply -f litellm-secrets.yaml
For multiple providers, add more keys (e.g., ANTHROPIC_API_KEY, AZURE_OPENAI_API_KEY) and reference them in your litellm.yaml with appropriate litellm_params.
Step 4: Create the LiteLLM Deployment
Below is a reference Deployment that:
- Runs LiteLLM server on port 4000
- Mounts the config from ConfigMap
- Injects API key(s) from Secrets
- Enables basic resource requests/limits
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-deployment
namespace: llm-infra
labels:
app: litellm
spec:
replicas: 2
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 4000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: OPENAI_API_KEY
volumeMounts:
- name: litellm-config-volume
mountPath: /app/config
args:
- "--port"
- "4000"
- "--config"
- "/app/config/litellm.yaml"
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
volumes:
- name: litellm-config-volume
configMap:
name: litellm-config
items:
- key: litellm.yaml
path: litellm.yaml
Apply:
kubectl apply -f litellm-deployment.yaml
You can scale replicas up as needed. A HorizontalPodAutoscaler can be added later for autoscaling based on CPU/requests.
Step 5: Expose LiteLLM via a Kubernetes Service
For a shared internal OpenAI-compatible endpoint, use a ClusterIP service:
apiVersion: v1
kind: Service
metadata:
name: litellm-service
namespace: llm-infra
labels:
app: litellm
spec:
type: ClusterIP
selector:
app: litellm
ports:
- name: http
port: 4000
targetPort: 4000
Apply:
kubectl apply -f litellm-service.yaml
Within the cluster, other services can now call:
- Host:
litellm-service.llm-infra.svc.cluster.local - Port:
4000
You now have a shared internal OpenAI-compatible endpoint reachable from any workload in the cluster (subject to network policies).
Step 6: Optional Ingress for internal HTTP routing
If you use an Ingress controller for internal traffic, you can create a host like llm.internal.company:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: litellm-ingress
namespace: llm-infra
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
ingressClassName: nginx
rules:
- host: llm.internal.company
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: litellm-service
port:
number: 4000
Once applied, internal clients (e.g., inside VPN / corporate network) can reference:
https://llm.internal.company/v1/chat/completions
You should integrate TLS (via cert-manager or your internal PKI) and possibly restrict access to internal networks only.
Step 7: Configure internal clients to use the shared endpoint
Using the OpenAI Python SDK as a proxy
With LiteLLM exposing OpenAI-compatible routes, the main change is setting the base URL and, optionally, an internal API key:
from openai import OpenAI
client = OpenAI(
base_url="https://llm.internal.company/v1",
api_key="internal-proxy-key" # if you implement auth in front of LiteLLM
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello from internal Kubernetes!"}],
)
print(resp.choices[0].message["content"])
If you don’t implement extra auth and rely on network perimeter + Kubernetes network policies, you can omit api_key or use a dummy value if the SDK requires it.
Using curl
curl https://llm.internal.company/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer internal-proxy-key" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "Test from shared internal endpoint"}
]
}'
Step 8: Adding routing, multi-provider support, and policies
To fully leverage LiteLLM as a shared internal OpenAI-compatible endpoint, update litellm.yaml with:
Multiple providers
model_list:
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
- model_name: claude-3-5-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
- model_name: gemini-1.5-pro
litellm_params:
model: google/gemini-1.5-pro
Add corresponding environment variables:
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: OPENAI_API_KEY
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: ANTHROPIC_API_KEY
- name: GEMINI_API_KEY
valueFrom:
secretKeyRef:
name: litellm-secrets
key: GEMINI_API_KEY
Simple routing and cost control
You can configure LiteLLM to route certain models or tenants to cheaper options or local models, as well as track usage and costs. For example:
router_settings:
routing_strategy: usage
budget_limit: 100.0 # monthly cost cap in USD, example only
Check the LiteLLM OSS documentation for up-to-date options around:
- Budgeting
- Per-key or per-team limits
- Fallback models if a provider fails
Step 9: Security and access control
To safely expose this shared internal endpoint:
-
Network Policies
- Restrict which namespaces/pods can talk to
litellm-service - Example: only allow traffic from specific application namespaces
- Restrict which namespaces/pods can talk to
-
Ingress / Gateway auth
- Use API keys, OAuth tokens, or mTLS at the gateway level
- Inject a validated identity (e.g.,
X-User-IDheader) that LiteLLM or your sidecar can use for logging and quotas
-
Secrets management
- Do not hardcode API keys in ConfigMaps
- Rotate keys via your secret manager
- Limit RBAC so only the LiteLLM service account can read the relevant Secrets
-
Audit and logging
- Configure LiteLLM logs to go to a centralized log system (e.g., Loki, ELK)
- For regulated environments, log which internal caller used which model and prompt metadata (with appropriate privacy handling)
Step 10: Observability, scaling, and resilience
To keep the shared internal endpoint reliable:
-
Autoscaling
- Add an HPA based on CPU, RPS, or custom metrics
-
Readiness and liveness probes
- Configure probes on
/healthor/(depending on LiteLLM’s health endpoint) to ensure pods are healthy before receiving traffic
- Configure probes on
-
Metrics & dashboards
- Use Prometheus/Grafana or your APM to track:
- Request rate per model
- Latency / error rate
- Cost per provider (if exported by LiteLLM)
- Use Prometheus/Grafana or your APM to track:
-
Deployment strategy
- Use rolling updates with
maxUnavailableandmaxSurgetuned for your availability needs - Keep staging/testing namespaces where you can roll out new LiteLLM configs or versions before promoting to production
- Use rolling updates with
Putting it all together
By deploying LiteLLM OSS on Kubernetes, you centralize LLM access behind a single, shared internal OpenAI-compatible endpoint. The workflow typically looks like:
- Developers use standard OpenAI SDKs and APIs.
- The SDKs point to an internal base URL (e.g.,
https://llm.internal.company/v1). - LiteLLM, running on Kubernetes, routes requests to multiple providers based on a config you control.
- You enforce security, logging, cost limits, and model routing policy in one place instead of every application.
This pattern makes it easier to:
- Swap or add providers without changing application code
- Maintain consistent security and compliance controls
- Monitor and optimize your organization’s LLM usage centrally
As you iterate, you can expand the deployment for more advanced needs—tenant-aware routing, fine-grained quotas per team, local GPU-backed models, or detailed observability—while preserving the same simple OpenAI-compatible interface for every internal consumer.