How do we deploy BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint?
LLM Gateway & Routing

How do we deploy BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint?

9 min read

Deploying BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint is an effective way to centralize model access, standardize configuration, and control costs across your organization. This guide walks through the architecture, configuration, and deployment patterns you can use to expose LiteLLM as a private, OpenAI-style API inside your cluster.


Why run BerriAI / LiteLLM OSS on Kubernetes?

LiteLLM acts as a unified proxy that exposes an OpenAI-compatible API while routing requests to many different providers (OpenAI, Azure OpenAI, Anthropic, Google, local models, etc.). Running it on Kubernetes as a shared internal endpoint gives you:

  • One consistent endpoint for all teams: https://litellm.yourcompany.svc.cluster.local/v1/chat/completions
  • Centralized key and provider management via environment variables, secrets, or config files
  • Network isolation: keep LLM traffic inside your cluster or behind VPN / SSO
  • Scalability & resilience: replicas, autoscaling, pod disruption budgets, rolling updates
  • Governance and observability: shared logging, metrics, and access controls

The goal: developers can use standard OpenAI SDKs (Python, JS, etc.) pointing to your internal LiteLLM endpoint, without changing their code patterns.


High-level deployment architecture

At a high level, a Kubernetes deployment for LiteLLM OSS looks like this:

  1. Deployment

    • Container: ghcr.io/berriai/litellm:latest (or pinned version)
    • Command runs the LiteLLM server (FastAPI/HTTP)
    • Config passed via config file + environment variables
  2. Service

    • Type: ClusterIP for internal-only use, or LoadBalancer/Ingress for external access
    • Port: usually 4000 or 8000 → internal http://litellm:4000
  3. Config & Secrets

    • ConfigMap for LiteLLM configuration (models, routing, logging)
    • Secret for API keys to external LLM providers
    • Environment variables for optional tuning or per-namespace overrides
  4. Ingress / API Gateway (optional)

    • For internal DNS like https://llm-api.internal.yourcompany.com
    • TLS termination and auth (OIDC, mTLS, etc.)
  5. Observability

    • Logs to stdout, collected by your existing logging stack
    • Optional metrics endpoint scraped by Prometheus, visualized in Grafana

Preparing LiteLLM configuration

LiteLLM can be configured via YAML. To expose a shared internal OpenAI-compatible endpoint, you typically:

  • Enable the OpenAI-style routes (/v1/chat/completions, /v1/completions, etc.)
  • Define provider-specific models mapped to standardized names
  • Optionally set rate limits, routing rules, and logging

Example LiteLLM config (config.yaml)

This example assumes you use OpenAI and Azure OpenAI, plus a local model via Ollama:

model_list:
  - model_name: gpt-4-turbo
    litellm_params:
      model: gpt-4-turbo
      api_key: os.environ/OPENAI_API_KEY
      api_base: https://api.openai.com/v1
      api_type: openai

  - model_name: gpt-4o-mini
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      api_base: https://api.openai.com/v1
      api_type: openai

  - model_name: azure-gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/AZURE_OPENAI_API_KEY
      api_base: os.environ/AZURE_OPENAI_ENDPOINT
      api_type: azure
      api_version: 2024-02-15-preview

  - model_name: local-llama3
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama.ollama.svc.cluster.local:11434
      api_key: os.environ/DUMMY_LOCAL_KEY

litellm_settings:
  drop_params: true           # helps mimic OpenAI API behavior cleanly
  enable_oai_sdk_compat: true # ensure full OpenAI SDK compatibility

router_settings:
  routing_strategy: simple
  default_model: gpt-4o-mini

general_settings:
  log_level: INFO
  telemetry: False

Place this file into a ConfigMap so Kubernetes can mount it into the LiteLLM container.


Creating Kubernetes manifests

Below is a Kubernetes manifest set you can adapt: ConfigMap, Secret, Deployment, and Service. Adjust namespace, image version, and domains as needed.

1. ConfigMap for LiteLLM config

apiVersion: v1
kind: ConfigMap
metadata:
  name: litellm-config
  namespace: ai-platform
data:
  config.yaml: |
    model_list:
      - model_name: gpt-4-turbo
        litellm_params:
          model: gpt-4-turbo
          api_key: os.environ/OPENAI_API_KEY
          api_base: https://api.openai.com/v1
          api_type: openai

      - model_name: gpt-4o-mini
        litellm_params:
          model: gpt-4o-mini
          api_key: os.environ/OPENAI_API_KEY
          api_base: https://api.openai.com/v1
          api_type: openai

      - model_name: azure-gpt-4o
        litellm_params:
          model: gpt-4o
          api_key: os.environ/AZURE_OPENAI_API_KEY
          api_base: os.environ/AZURE_OPENAI_ENDPOINT
          api_type: azure
          api_version: 2024-02-15-preview

    litellm_settings:
      drop_params: true
      enable_oai_sdk_compat: true

    router_settings:
      routing_strategy: simple
      default_model: gpt-4o-mini

    general_settings:
      log_level: INFO
      telemetry: False

2. Secret for provider API keys

Store provider keys in a Kubernetes Secret. Values must be base64-encoded.

apiVersion: v1
kind: Secret
metadata:
  name: litellm-secrets
  namespace: ai-platform
type: Opaque
data:
  OPENAI_API_KEY: <base64-openai-key>
  AZURE_OPENAI_API_KEY: <base64-azure-openai-key>
  AZURE_OPENAI_ENDPOINT: <base64-azure-endpoint-url>

Encode a value:

echo -n 'sk-xxx' | base64

3. Deployment for BerriAI / LiteLLM OSS

apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm
  namespace: ai-platform
  labels:
    app: litellm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:latest
          imagePullPolicy: IfNotPresent
          args:
            # Run LiteLLM OpenAI-compatible server
            - "litellm"
            - "serve"
            - "--config"
            - "/config/config.yaml"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "4000"
          ports:
            - containerPort: 4000
              name: http
          envFrom:
            - secretRef:
                name: litellm-secrets
          volumeMounts:
            - name: config-volume
              mountPath: /config
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 10
            periodSeconds: 20
      volumes:
        - name: config-volume
          configMap:
            name: litellm-config
            items:
              - key: config.yaml
                path: config.yaml

Key notes:

  • litellm serve exposes HTTP routes including OpenAI-compatible /v1 paths.
  • --config /config/config.yaml points to your mounted ConfigMap.
  • Probes assume LiteLLM exposes /health; if not, adjust to / or the built-in health route as per current LiteLLM docs.

4. Service for internal access

Expose LiteLLM internally with a ClusterIP Service.

apiVersion: v1
kind: Service
metadata:
  name: litellm
  namespace: ai-platform
  labels:
    app: litellm
spec:
  type: ClusterIP
  selector:
    app: litellm
  ports:
    - name: http
      port: 4000
      targetPort: 4000

Your internal endpoint is now:

http://litellm.ai-platform.svc.cluster.local:4000

And OpenAI-style paths are under /v1, e.g.:

  • /v1/chat/completions
  • /v1/completions
  • /v1/embeddings

Adding Ingress for a shared internal URL

If you want a friendly internal hostname and HTTPS, configure an Ingress. Example for NGINX Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: litellm-ingress
  namespace: ai-platform
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
    - host: llm-api.internal.yourcompany.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: litellm
                port:
                  number: 4000
  tls:
    - hosts:
        - llm-api.internal.yourcompany.com
      secretName: litellm-tls

Now your shared internal OpenAI-compatible endpoint is:

https://llm-api.internal.yourcompany.com/v1/chat/completions

Using the shared internal OpenAI-compatible endpoint

Once deployed, teams can use standard OpenAI SDKs by overriding the base_url and setting a “dummy” or shared key (or using per-team keys if you enforce auth).

Python example

from openai import OpenAI

client = OpenAI(
    base_url="https://llm-api.internal.yourcompany.com/v1",
    api_key="internal-shared-key"  # or real auth token if enforced
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain k8s in simple terms."}
    ]
)

print(response.choices[0].message.content)

Node.js example

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://llm-api.internal.yourcompany.com/v1",
  apiKey: "internal-shared-key",
});

const completion = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Summarize the benefits of LiteLLM on Kubernetes." },
  ],
});

console.log(completion.choices[0].message.content);

Because LiteLLM exposes OpenAI-compatible routes, the rest of your application logic looks the same as if you were calling OpenAI directly.


Multi-tenant and team-based configuration

To truly share LiteLLM as an internal endpoint across many teams, consider:

  1. Per-team API keys or tokens

    • Use an API gateway, OIDC, or a sidecar to authenticate users/teams.
    • Map tokens → usage quotas, model access policies.
  2. Multiple namespaces or environments

    • ai-platform-dev, ai-platform-prod with separate LiteLLM config and secrets.
    • Use different Ingress hostnames, e.g. llm-dev.internal.yourcompany.com.
  3. Model access control

    • Restrict sensitive models to certain teams via LiteLLM routing rules or request validation middleware.
    • Configure allowlists for specific model_name values per environment.
  4. Cost and usage monitoring

    • Enable LiteLLM logging with request metadata.
    • Ship logs to your observability stack and build dashboards per-team usage.

Scaling, autoscaling, and resilience

For production readiness on Kubernetes:

Horizontal Pod Autoscaler (HPA)

Autoscale LiteLLM based on CPU or custom metrics (like request rate).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litellm-hpa
  namespace: ai-platform
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litellm
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Pod Disruption Budget (PDB)

Ensure availability during node maintenance and rollouts.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: litellm-pdb
  namespace: ai-platform
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: litellm

Rolling updates

Use the default rolling update strategy on the Deployment or tune it to minimize downtime:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

Security and network considerations

Since this is a shared internal OpenAI-compatible endpoint, harden it properly:

  1. Network boundaries

    • Keep Service type as ClusterIP and expose via internal-only Ingress (no public IP).
    • Use network policies to restrict which namespaces / pods can call LiteLLM.
  2. Auth and authorization

    • Use an API gateway (Kong, Ambassador, Istio, NGINX with OIDC) to enforce auth.
    • Issue per-team tokens; log token or client ID on each call.
  3. Secrets management

    • Store provider keys only in Kubernetes Secrets or external secret stores (e.g., HashiCorp Vault, AWS Secrets Manager + CSI).
    • Rotate keys periodically and restart pods to pick up changes.
  4. TLS everywhere

    • Use HTTPS via Ingress with organization-issued certs.
    • Consider mTLS between services for high-security environments.

Observability and troubleshooting

To maintain a healthy shared endpoint:

  • Logs

    • Ensure LiteLLM logs to stdout; integrate with Fluentd/Fluent Bit or your logging agent.
    • Include request metadata (model name, latency, status) for debugging and cost analysis.
  • Metrics

    • If LiteLLM exposes Prometheus metrics, scrape them and create dashboards:
      • Requests per second per model
      • Error rates (4xx/5xx)
      • Latency percentiles
  • Tracing

    • Integrate with OpenTelemetry if supported.
    • Trace requests from the calling microservice through LiteLLM to the external provider.
  • Debug flow

    • Test by curling the internal endpoint first:
      curl -X POST \
        https://llm-api.internal.yourcompany.com/v1/chat/completions \
        -H "Authorization: Bearer internal-shared-key" \
        -H "Content-Type: application/json" \
        -d '{
          "model": "gpt-4o-mini",
          "messages": [
            {"role":"user","content":"Hello from Kubernetes"}
          ]
        }'
      
    • Check pod logs and Ingress logs if calls fail.

Deployment workflow and GitOps integration

To keep your shared LiteLLM endpoint consistent and auditable:

  1. Infrastructure as code

    • Store all manifests (Deployment, Service, Ingress, ConfigMap, Secret templates) in Git.
    • Separate per-environment config with overlays (e.g., Kustomize, Helm values).
  2. GitOps tooling

    • Use Argo CD or Flux to apply changes automatically.
    • Sync LiteLLM versions, config changes, and key rotations via pull requests.
  3. Versioning LiteLLM

    • Pin LiteLLM image tags (e.g., ghcr.io/berriai/litellm:vX.Y.Z) rather than latest.
    • Test new versions in staging before promoting to production.

Summary

To deploy BerriAI / LiteLLM OSS on Kubernetes as a shared internal OpenAI-compatible endpoint, you:

  1. Create a ConfigMap with LiteLLM routing and model configuration.
  2. Store provider API keys in Kubernetes Secrets.
  3. Deploy LiteLLM as a Deployment with multiple replicas and health probes.
  4. Expose it via a ClusterIP Service and optionally an internal Ingress.
  5. Point OpenAI SDKs to your internal base URL, keeping the rest of the code unchanged.
  6. Add autoscaling, PDBs, and observability for production reliability.
  7. Harden the endpoint with authentication, TLS, and network policies.

With this setup, your organization gains a centralized, Kubernetes-native LLM gateway that looks and feels like the standard OpenAI API, while giving you full control over providers, cost, routing, and governance.