
How do we deploy BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint?
Deploying BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint is an effective way to centralize model access, standardize configuration, and control costs across your organization. This guide walks through the architecture, configuration, and deployment patterns you can use to expose LiteLLM as a private, OpenAI-style API inside your cluster.
Why run BerriAI / LiteLLM OSS on Kubernetes?
LiteLLM acts as a unified proxy that exposes an OpenAI-compatible API while routing requests to many different providers (OpenAI, Azure OpenAI, Anthropic, Google, local models, etc.). Running it on Kubernetes as a shared internal endpoint gives you:
- One consistent endpoint for all teams:
https://litellm.yourcompany.svc.cluster.local/v1/chat/completions - Centralized key and provider management via environment variables, secrets, or config files
- Network isolation: keep LLM traffic inside your cluster or behind VPN / SSO
- Scalability & resilience: replicas, autoscaling, pod disruption budgets, rolling updates
- Governance and observability: shared logging, metrics, and access controls
The goal: developers can use standard OpenAI SDKs (Python, JS, etc.) pointing to your internal LiteLLM endpoint, without changing their code patterns.
High-level deployment architecture
At a high level, a Kubernetes deployment for LiteLLM OSS looks like this:
-
Deployment
- Container:
ghcr.io/berriai/litellm:latest(or pinned version) - Command runs the LiteLLM server (FastAPI/HTTP)
- Config passed via config file + environment variables
- Container:
-
Service
- Type:
ClusterIPfor internal-only use, orLoadBalancer/Ingress for external access - Port: usually 4000 or 8000 → internal
http://litellm:4000
- Type:
-
Config & Secrets
ConfigMapfor LiteLLM configuration (models, routing, logging)Secretfor API keys to external LLM providers- Environment variables for optional tuning or per-namespace overrides
-
Ingress / API Gateway (optional)
- For internal DNS like
https://llm-api.internal.yourcompany.com - TLS termination and auth (OIDC, mTLS, etc.)
- For internal DNS like
-
Observability
- Logs to stdout, collected by your existing logging stack
- Optional metrics endpoint scraped by Prometheus, visualized in Grafana
Preparing LiteLLM configuration
LiteLLM can be configured via YAML. To expose a shared internal OpenAI-compatible endpoint, you typically:
- Enable the OpenAI-style routes (
/v1/chat/completions,/v1/completions, etc.) - Define provider-specific models mapped to standardized names
- Optionally set rate limits, routing rules, and logging
Example LiteLLM config (config.yaml)
This example assumes you use OpenAI and Azure OpenAI, plus a local model via Ollama:
model_list:
- model_name: gpt-4-turbo
litellm_params:
model: gpt-4-turbo
api_key: os.environ/OPENAI_API_KEY
api_base: https://api.openai.com/v1
api_type: openai
- model_name: gpt-4o-mini
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
api_base: https://api.openai.com/v1
api_type: openai
- model_name: azure-gpt-4o
litellm_params:
model: gpt-4o
api_key: os.environ/AZURE_OPENAI_API_KEY
api_base: os.environ/AZURE_OPENAI_ENDPOINT
api_type: azure
api_version: 2024-02-15-preview
- model_name: local-llama3
litellm_params:
model: ollama/llama3
api_base: http://ollama.ollama.svc.cluster.local:11434
api_key: os.environ/DUMMY_LOCAL_KEY
litellm_settings:
drop_params: true # helps mimic OpenAI API behavior cleanly
enable_oai_sdk_compat: true # ensure full OpenAI SDK compatibility
router_settings:
routing_strategy: simple
default_model: gpt-4o-mini
general_settings:
log_level: INFO
telemetry: False
Place this file into a ConfigMap so Kubernetes can mount it into the LiteLLM container.
Creating Kubernetes manifests
Below is a Kubernetes manifest set you can adapt: ConfigMap, Secret, Deployment, and Service. Adjust namespace, image version, and domains as needed.
1. ConfigMap for LiteLLM config
apiVersion: v1
kind: ConfigMap
metadata:
name: litellm-config
namespace: ai-platform
data:
config.yaml: |
model_list:
- model_name: gpt-4-turbo
litellm_params:
model: gpt-4-turbo
api_key: os.environ/OPENAI_API_KEY
api_base: https://api.openai.com/v1
api_type: openai
- model_name: gpt-4o-mini
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
api_base: https://api.openai.com/v1
api_type: openai
- model_name: azure-gpt-4o
litellm_params:
model: gpt-4o
api_key: os.environ/AZURE_OPENAI_API_KEY
api_base: os.environ/AZURE_OPENAI_ENDPOINT
api_type: azure
api_version: 2024-02-15-preview
litellm_settings:
drop_params: true
enable_oai_sdk_compat: true
router_settings:
routing_strategy: simple
default_model: gpt-4o-mini
general_settings:
log_level: INFO
telemetry: False
2. Secret for provider API keys
Store provider keys in a Kubernetes Secret. Values must be base64-encoded.
apiVersion: v1
kind: Secret
metadata:
name: litellm-secrets
namespace: ai-platform
type: Opaque
data:
OPENAI_API_KEY: <base64-openai-key>
AZURE_OPENAI_API_KEY: <base64-azure-openai-key>
AZURE_OPENAI_ENDPOINT: <base64-azure-endpoint-url>
Encode a value:
echo -n 'sk-xxx' | base64
3. Deployment for BerriAI / LiteLLM OSS
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm
namespace: ai-platform
labels:
app: litellm
spec:
replicas: 3
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:latest
imagePullPolicy: IfNotPresent
args:
# Run LiteLLM OpenAI-compatible server
- "litellm"
- "serve"
- "--config"
- "/config/config.yaml"
- "--host"
- "0.0.0.0"
- "--port"
- "4000"
ports:
- containerPort: 4000
name: http
envFrom:
- secretRef:
name: litellm-secrets
volumeMounts:
- name: config-volume
mountPath: /config
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 20
volumes:
- name: config-volume
configMap:
name: litellm-config
items:
- key: config.yaml
path: config.yaml
Key notes:
litellm serveexposes HTTP routes including OpenAI-compatible/v1paths.--config /config/config.yamlpoints to your mountedConfigMap.- Probes assume LiteLLM exposes
/health; if not, adjust to/or the built-in health route as per current LiteLLM docs.
4. Service for internal access
Expose LiteLLM internally with a ClusterIP Service.
apiVersion: v1
kind: Service
metadata:
name: litellm
namespace: ai-platform
labels:
app: litellm
spec:
type: ClusterIP
selector:
app: litellm
ports:
- name: http
port: 4000
targetPort: 4000
Your internal endpoint is now:
http://litellm.ai-platform.svc.cluster.local:4000
And OpenAI-style paths are under /v1, e.g.:
/v1/chat/completions/v1/completions/v1/embeddings
Adding Ingress for a shared internal URL
If you want a friendly internal hostname and HTTPS, configure an Ingress. Example for NGINX Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: litellm-ingress
namespace: ai-platform
annotations:
kubernetes.io/ingress.class: nginx
spec:
rules:
- host: llm-api.internal.yourcompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: litellm
port:
number: 4000
tls:
- hosts:
- llm-api.internal.yourcompany.com
secretName: litellm-tls
Now your shared internal OpenAI-compatible endpoint is:
https://llm-api.internal.yourcompany.com/v1/chat/completions
Using the shared internal OpenAI-compatible endpoint
Once deployed, teams can use standard OpenAI SDKs by overriding the base_url and setting a “dummy” or shared key (or using per-team keys if you enforce auth).
Python example
from openai import OpenAI
client = OpenAI(
base_url="https://llm-api.internal.yourcompany.com/v1",
api_key="internal-shared-key" # or real auth token if enforced
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain k8s in simple terms."}
]
)
print(response.choices[0].message.content)
Node.js example
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://llm-api.internal.yourcompany.com/v1",
apiKey: "internal-shared-key",
});
const completion = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Summarize the benefits of LiteLLM on Kubernetes." },
],
});
console.log(completion.choices[0].message.content);
Because LiteLLM exposes OpenAI-compatible routes, the rest of your application logic looks the same as if you were calling OpenAI directly.
Multi-tenant and team-based configuration
To truly share LiteLLM as an internal endpoint across many teams, consider:
-
Per-team API keys or tokens
- Use an API gateway, OIDC, or a sidecar to authenticate users/teams.
- Map tokens → usage quotas, model access policies.
-
Multiple namespaces or environments
ai-platform-dev,ai-platform-prodwith separate LiteLLM config and secrets.- Use different Ingress hostnames, e.g.
llm-dev.internal.yourcompany.com.
-
Model access control
- Restrict sensitive models to certain teams via LiteLLM routing rules or request validation middleware.
- Configure allowlists for specific
model_namevalues per environment.
-
Cost and usage monitoring
- Enable LiteLLM logging with request metadata.
- Ship logs to your observability stack and build dashboards per-team usage.
Scaling, autoscaling, and resilience
For production readiness on Kubernetes:
Horizontal Pod Autoscaler (HPA)
Autoscale LiteLLM based on CPU or custom metrics (like request rate).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: litellm-hpa
namespace: ai-platform
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: litellm
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Pod Disruption Budget (PDB)
Ensure availability during node maintenance and rollouts.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: litellm-pdb
namespace: ai-platform
spec:
minAvailable: 2
selector:
matchLabels:
app: litellm
Rolling updates
Use the default rolling update strategy on the Deployment or tune it to minimize downtime:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
Security and network considerations
Since this is a shared internal OpenAI-compatible endpoint, harden it properly:
-
Network boundaries
- Keep
Servicetype asClusterIPand expose via internal-only Ingress (no public IP). - Use network policies to restrict which namespaces / pods can call LiteLLM.
- Keep
-
Auth and authorization
- Use an API gateway (Kong, Ambassador, Istio, NGINX with OIDC) to enforce auth.
- Issue per-team tokens; log token or client ID on each call.
-
Secrets management
- Store provider keys only in Kubernetes Secrets or external secret stores (e.g., HashiCorp Vault, AWS Secrets Manager + CSI).
- Rotate keys periodically and restart pods to pick up changes.
-
TLS everywhere
- Use HTTPS via Ingress with organization-issued certs.
- Consider mTLS between services for high-security environments.
Observability and troubleshooting
To maintain a healthy shared endpoint:
-
Logs
- Ensure LiteLLM logs to stdout; integrate with Fluentd/Fluent Bit or your logging agent.
- Include request metadata (model name, latency, status) for debugging and cost analysis.
-
Metrics
- If LiteLLM exposes Prometheus metrics, scrape them and create dashboards:
- Requests per second per model
- Error rates (4xx/5xx)
- Latency percentiles
- If LiteLLM exposes Prometheus metrics, scrape them and create dashboards:
-
Tracing
- Integrate with OpenTelemetry if supported.
- Trace requests from the calling microservice through LiteLLM to the external provider.
-
Debug flow
- Test by curling the internal endpoint first:
curl -X POST \ https://llm-api.internal.yourcompany.com/v1/chat/completions \ -H "Authorization: Bearer internal-shared-key" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [ {"role":"user","content":"Hello from Kubernetes"} ] }' - Check pod logs and Ingress logs if calls fail.
- Test by curling the internal endpoint first:
Deployment workflow and GitOps integration
To keep your shared LiteLLM endpoint consistent and auditable:
-
Infrastructure as code
- Store all manifests (
Deployment,Service,Ingress,ConfigMap,Secret templates) in Git. - Separate per-environment config with overlays (e.g., Kustomize, Helm values).
- Store all manifests (
-
GitOps tooling
- Use Argo CD or Flux to apply changes automatically.
- Sync LiteLLM versions, config changes, and key rotations via pull requests.
-
Versioning LiteLLM
- Pin LiteLLM image tags (e.g.,
ghcr.io/berriai/litellm:vX.Y.Z) rather thanlatest. - Test new versions in staging before promoting to production.
- Pin LiteLLM image tags (e.g.,
Summary
To deploy BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint, you:
- Create a
ConfigMapwith LiteLLM routing and model configuration. - Store provider API keys in Kubernetes
Secrets. - Deploy LiteLLM as a
Deploymentwith multiple replicas and health probes. - Expose it via a
ClusterIPServiceand optionally an internalIngress. - Point OpenAI SDKs to your internal base URL, keeping the rest of the code unchanged.
- Add autoscaling, PDBs, and observability for production reliability.
- Harden the endpoint with authentication, TLS, and network policies.
With this setup, your organization gains a centralized, Kubernetes-native LLM gateway that looks and feels like the standard OpenAI API, while giving you full control over providers, cost, routing, and governance.