How do we deploy BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint?

Deploying BerriAI’s LiteLLM OSS on Kubernetes is a practical way to give your organization a shared, internal OpenAI-compatible endpoint that can route to multiple providers while enforcing consistent policies, logging, and cost controls. This guide walks through the conceptual architecture and gives you concrete Kubernetes manifests and configuration patterns you can adapt to your cluster.

What is LiteLLM OSS and why run it on Kubernetes?

LiteLLM OSS (from BerriAI) is an open-source proxy that:

Exposes an OpenAI-compatible API (e.g., /v1/chat/completions, /v1/completions, /v1/embeddings)
Routes requests to multiple LLM providers (OpenAI, Azure OpenAI, Anthropic, Google, Ollama, local models, etc.)
Adds features like:
- Global and per-model rate limiting
- Cost tracking and logging
- Key management and secret rotation
- Routing rules, fallbacks, and load balancing across providers

Running LiteLLM on Kubernetes gives you:

A shared internal endpoint all teams can call instead of hitting external APIs directly
Centralized configuration and governance for LLM usage
Easier scaling, resilience, and observability
A clean separation between application code and provider-specific details

High-level architecture

In a typical Kubernetes-based internal deployment:

LiteLLM Pod(s)
- Runs as a Deployment (or StatefulSet if you need persistent features later)
- Uses a configuration file (litellm.yaml or .env) mounted via ConfigMap/Secret
- Listens on an HTTP port (e.g., 4000) exposing OpenAI-compatible endpoints
Cluster Networking
- An internal Kubernetes Service (ClusterIP) exposes the LiteLLM pods
- Optionally an Ingress/Service Mesh (Istio, Linkerd, Nginx Ingress, etc.) provides:
  - Internal routing
  - TLS termination
  - mTLS and access policies
Secrets Management
- Provider API keys (e.g., OPENAI_API_KEY) stored in Kubernetes Secrets
- Referenced as environment variables or via files in the LiteLLM container
Clients (internal apps/services)
- Use the OpenAI SDK or HTTP API
- Point base_url (or api_base) to the internal LiteLLM endpoint instead of the public OpenAI URL
- Use an internal “proxy key” or token for auth (e.g., via an API gateway, custom auth middleware, mTLS, or Kubernetes network policies)

Prerequisites

Before deploying LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint, you should have:

A working Kubernetes cluster (e.g., EKS, GKE, AKS, k3s, Minikube)
kubectl configured to talk to the cluster
Optional but recommended:
- A namespace dedicated to AI/LLM infra (e.g., llm-infra)
- A cluster-wide Ingress controller for internal routing
- A secrets management strategy (Kubernetes Secrets, External Secrets, Vault, SOPS, etc.)

You’ll also need:

At least one LLM provider account (e.g., OpenAI)
Provider API key(s)

Step 1: Container image and basic server configuration

LiteLLM supports running as a server via the CLI. The official recommended image often looks like:

image: ghcr.io/berriai/litellm:latest

You can run it in server mode with:

litellm --port 4000 --config /app/config/litellm.yaml

In Kubernetes, we’ll embed that as the container command and args or use the default that the image provides, depending on the version. Always verify with the current LiteLLM docs or image entrypoint.

Step 2: Create the LiteLLM configuration (ConfigMap)

LiteLLM supports configuration via YAML. Here’s a minimal example for an internal deployment that routes to OpenAI but exposes OpenAI-compatible endpoints internally:

# litellm.yaml
litellm_settings:
  # Optional: default model if clients omit `model`
  default_model: gpt-4o-mini  

model_list:
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o

# Optional rate limiting: per key or per model
router_settings:
  routing_strategy: usage  # or "simple", "least_latency", etc.

general_settings:
  # Where to send logs (stdout by default)
  logging: info
  # Optional: to enable cost tracking & usage logging to a DB, etc.
  # See LiteLLM docs for advanced telemetry configuration.

Create a Kubernetes ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: litellm-config
  namespace: llm-infra
data:
  litellm.yaml: |
    litellm_settings:
      default_model: gpt-4o-mini

    model_list:
      - model_name: gpt-4o-mini
        litellm_params:
          model: openai/gpt-4o-mini

      - model_name: gpt-4o
        litellm_params:
          model: openai/gpt-4o

    router_settings:
      routing_strategy: usage

    general_settings:
      logging: info

Apply it:

kubectl apply -f litellm-config.yaml

Step 3: Store provider API keys in a Secret

For an OpenAI-backed LiteLLM instance:

apiVersion: v1
kind: Secret
metadata:
  name: litellm-secrets
  namespace: llm-infra
type: Opaque
stringData:
  OPENAI_API_KEY: "sk-your-real-openai-key"

Apply it:

kubectl apply -f litellm-secrets.yaml

For multiple providers, add more keys (e.g., ANTHROPIC_API_KEY, AZURE_OPENAI_API_KEY) and reference them in your litellm.yaml with appropriate litellm_params.

Step 4: Create the LiteLLM Deployment

Below is a reference Deployment that:

Runs LiteLLM server on port 4000
Mounts the config from ConfigMap
Injects API key(s) from Secrets
Enables basic resource requests/limits

apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-deployment
  namespace: llm-infra
  labels:
    app: litellm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 4000
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: litellm-secrets
                  key: OPENAI_API_KEY
          volumeMounts:
            - name: litellm-config-volume
              mountPath: /app/config
          args:
            - "--port"
            - "4000"
            - "--config"
            - "/app/config/litellm.yaml"
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"
      volumes:
        - name: litellm-config-volume
          configMap:
            name: litellm-config
            items:
              - key: litellm.yaml
                path: litellm.yaml

Apply:

kubectl apply -f litellm-deployment.yaml

You can scale replicas up as needed. A HorizontalPodAutoscaler can be added later for autoscaling based on CPU/requests.

Step 5: Expose LiteLLM via a Kubernetes Service

For a shared internal OpenAI-compatible endpoint, use a ClusterIP service:

apiVersion: v1
kind: Service
metadata:
  name: litellm-service
  namespace: llm-infra
  labels:
    app: litellm
spec:
  type: ClusterIP
  selector:
    app: litellm
  ports:
    - name: http
      port: 4000
      targetPort: 4000

Apply:

kubectl apply -f litellm-service.yaml

Within the cluster, other services can now call:

Host: litellm-service.llm-infra.svc.cluster.local
Port: 4000

You now have a shared internal OpenAI-compatible endpoint reachable from any workload in the cluster (subject to network policies).

Step 6: Optional Ingress for internal HTTP routing

If you use an Ingress controller for internal traffic, you can create a host like llm.internal.company:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: litellm-ingress
  namespace: llm-infra
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
  ingressClassName: nginx
  rules:
    - host: llm.internal.company
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: litellm-service
                port:
                  number: 4000

Once applied, internal clients (e.g., inside VPN / corporate network) can reference:

https://llm.internal.company/v1/chat/completions

You should integrate TLS (via cert-manager or your internal PKI) and possibly restrict access to internal networks only.

Step 7: Configure internal clients to use the shared endpoint

Using the OpenAI Python SDK as a proxy

With LiteLLM exposing OpenAI-compatible routes, the main change is setting the base URL and, optionally, an internal API key:

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.internal.company/v1",
    api_key="internal-proxy-key"  # if you implement auth in front of LiteLLM
)

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello from internal Kubernetes!"}],
)
print(resp.choices[0].message["content"])

If you don’t implement extra auth and rely on network perimeter + Kubernetes network policies, you can omit api_key or use a dummy value if the SDK requires it.

Using curl

curl https://llm.internal.company/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer internal-proxy-key" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "Test from shared internal endpoint"}
    ]
  }'

Step 8: Adding routing, multi-provider support, and policies

To fully leverage LiteLLM as a shared internal OpenAI-compatible endpoint, update litellm.yaml with:

Multiple providers

model_list:
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o

  - model_name: claude-3-5-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022

  - model_name: gemini-1.5-pro
    litellm_params:
      model: google/gemini-1.5-pro

Add corresponding environment variables:

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: OPENAI_API_KEY
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: ANTHROPIC_API_KEY
  - name: GEMINI_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: GEMINI_API_KEY

Simple routing and cost control

You can configure LiteLLM to route certain models or tenants to cheaper options or local models, as well as track usage and costs. For example:

router_settings:
  routing_strategy: usage
  budget_limit: 100.0  # monthly cost cap in USD, example only

Check the LiteLLM OSS documentation for up-to-date options around:

Budgeting
Per-key or per-team limits
Fallback models if a provider fails

Step 9: Security and access control

To safely expose this shared internal endpoint:

Network Policies
- Restrict which namespaces/pods can talk to litellm-service
- Example: only allow traffic from specific application namespaces
Ingress / Gateway auth
- Use API keys, OAuth tokens, or mTLS at the gateway level
- Inject a validated identity (e.g., X-User-ID header) that LiteLLM or your sidecar can use for logging and quotas
Secrets management
- Do not hardcode API keys in ConfigMaps
- Rotate keys via your secret manager
- Limit RBAC so only the LiteLLM service account can read the relevant Secrets
Audit and logging
- Configure LiteLLM logs to go to a centralized log system (e.g., Loki, ELK)
- For regulated environments, log which internal caller used which model and prompt metadata (with appropriate privacy handling)

Step 10: Observability, scaling, and resilience

To keep the shared internal endpoint reliable:

Autoscaling
- Add an HPA based on CPU, RPS, or custom metrics
Readiness and liveness probes
- Configure probes on /health or / (depending on LiteLLM’s health endpoint) to ensure pods are healthy before receiving traffic
Metrics & dashboards
- Use Prometheus/Grafana or your APM to track:
  - Request rate per model
  - Latency / error rate
  - Cost per provider (if exported by LiteLLM)
Deployment strategy
- Use rolling updates with maxUnavailable and maxSurge tuned for your availability needs
- Keep staging/testing namespaces where you can roll out new LiteLLM configs or versions before promoting to production

Putting it all together

By deploying LiteLLM OSS on Kubernetes, you centralize LLM access behind a single, shared internal OpenAI-compatible endpoint. The workflow typically looks like:

Developers use standard OpenAI SDKs and APIs.
The SDKs point to an internal base URL (e.g., https://llm.internal.company/v1).
LiteLLM, running on Kubernetes, routes requests to multiple providers based on a config you control.
You enforce security, logging, cost limits, and model routing policy in one place instead of every application.

This pattern makes it easier to:

Swap or add providers without changing application code
Maintain consistent security and compliance controls
Monitor and optimize your organization’s LLM usage centrally

As you iterate, you can expand the deployment for more advanced needs—tenant-aware routing, fine-grained quotas per team, local GPU-backed models, or detailed observability—while preserving the same simple OpenAI-compatible interface for every internal consumer.