How do we send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack?
LLM Gateway & Routing

How do we send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack?

8 min read

Building observability and alerting into your BerriAI / LiteLLM stack is essential once you move beyond simple prototyping. You’ll want to capture metrics and logs, ship them to a backend like Datadog or an OpenTelemetry/Prometheus stack, and then wire alerts into PagerDuty and Slack so you know when things break or degrade.

This guide walks through practical patterns and configs to send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus, and then connect them to PagerDuty/Slack alerts.


Core concepts: what to monitor from BerriAI / LiteLLM

Before wiring tools together, decide what you care about:

  • Latency
    • p50/p90/p99 response times
    • Provider-level latency (OpenAI vs Anthropic vs others)
  • Error rates
    • HTTP 4xx/5xx
    • Provider-specific errors (rate limits, timeouts, context-length)
  • Usage and cost
    • Token usage per request
    • Cost per request / per model
    • Requests per second (RPS), per app, per environment
  • Quality and safety signals
    • Hallucination flags, safety filter hits (if you track these)
  • Infrastructure health
    • LiteLLM proxy availability (uptime, restarts)
    • Queue/backlog length if you batch or rate-limit

BerriAI / LiteLLM can expose these metrics via logs or HTTP metrics endpoints, which you can then scrape or forward to Datadog, OpenTelemetry, or Prometheus.


Option 1: Send BerriAI / LiteLLM metrics/logs to Datadog

Datadog is often the easiest path for teams already using it for application monitoring.

1. Instrument LiteLLM proxy with Datadog

If you’re running litellm as a proxy service, you can:

  1. Enable structured logging (JSON) so Datadog can parse requests/responses.
  2. Expose Prometheus-style metrics and let the Datadog Agent scrape them.
  3. Use Datadog’s OpenTelemetry support and send OTLP metrics directly.

Example: Enable JSON logs from LiteLLM

In your LiteLLM config (YAML or environment variables):

litellm_params:
  log_format: json
  log_level: info
  # optional: log tokens and cost if supported
  log_model_cost: true
  log_tokens: true

Then configure your container/logger (e.g., Docker, Kubernetes) to send stdout/stderr to Datadog Logs:

  • On Kubernetes, use the Datadog Agent with logs.enabled: true.
  • On VMs, install the Datadog Agent and configure a log source:
logs:
  - type: file
    path: /var/log/litellm/*.log
    service: litellm
    source: python
    log_processing_rules:
      - type: multi_line
        name: new_log_start_with_date
        pattern: "\\d{4}-\\d{2}-\\d{2}"

Each request log can include:

  • model
  • provider
  • latency_ms
  • tokens_prompt
  • tokens_completion
  • total_cost_usd
  • status (success, error, rate_limited, timeout, etc.)

Use these as facets and measures for dashboards and alert queries.

Option A: Scrape Prometheus metrics into Datadog

If LiteLLM exposes /metrics in Prometheus format:

litellm --host 0.0.0.0 --port 4000 --metrics-port 9090

Configure the Datadog Agent to scrape:

prometheus_scrape:
  enabled: true
  configurations:
    - name: litellm
      metrics:
        - litellm_requests_total
        - litellm_request_latency_seconds
        - litellm_tokens_total
      namespace: litellm
      labels:
        service: litellm
      endpoints:
        - http://litellm:9090/metrics

Datadog will then convert those metrics into its own time series, usable in monitors.

Option B: Use OpenTelemetry → Datadog

If you already instrument BerriAI / LiteLLM with OpenTelemetry SDKs, configure the Datadog Agent as an OTLP receiver:

apm_config:
  enabled: true
  otlp_config:
    receiver:
      protocols:
        grpc:  # default port 4317
        http:  # default port 4318

Then set the OTLP endpoint in your LiteLLM/SDK config:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317
export OTEL_SERVICE_NAME=litellm

This approach unifies metrics, traces, and logs under the same telemetry pipeline.


2. Creating Datadog alerts and wiring them to PagerDuty/Slack

Once metrics appear in Datadog, you can define monitors and connect them to PagerDuty and Slack.

Common LiteLLM / BerriAI alert patterns

Use Datadog queries such as:

  • High error rate
sum:litellm_requests{status:error}.as_count()
/
sum:litellm_requests{*}.as_count()

Alert when error ratio exceeds, say, 5% for 5 minutes.

  • Latency SLO breaches
p95:litellm_request_latency_seconds{*} by {model}

Alert if p95 latency > X seconds for model gpt-4o in production.

  • Cost spike
sum:litellm_cost_usd{env:prod}.rollup(sum, 300)

Alert when 5-minute spend exceeds expected baseline.

  • RPS anomaly
sum:litellm_requests{env:prod}.as_rate()

Use Datadog anomaly detection to alert on sudden traffic drops or spikes.

Connect Datadog to Slack

  1. In Datadog, go to Integrations → Slack.
  2. Install the Slack app and authorize it in your workspace.
  3. Map Datadog to a channel (e.g., #llm-alerts).
  4. In each Monitor, set Notify to @slack-<channel-name>:
@slack-llm-alerts

You can also use Slack templates to show key details (model, env, error code).

Connect Datadog to PagerDuty

  1. In PagerDuty, create a Service for litellm or llm-platform.
  2. Under Integrations, add a new Datadog integration; copy the integration key.
  3. In Datadog, go to Integrations → PagerDuty, add the key as a service.
  4. In your monitor, add @pagerduty-<service-name> to the notification message.

Example monitor message:

LLM error rate is above 10% in prod for 5 min.

Query: {{query}}
Current value: {{value}}

@pagerduty-llm-platform @slack-llm-alerts

Option 2: Send BerriAI / LiteLLM telemetry via OpenTelemetry

OpenTelemetry gives you vendor-neutral metrics, logs, and traces. From there, you can send them to Prometheus, Datadog, or other backends.

1. Instrument BerriAI / LiteLLM with OpenTelemetry

If your app is Python or Node-based and wraps LiteLLM or a BerriAI service, you can:

  • Use OpenTelemetry auto-instrumentation (HTTP, gRPC)
  • Create custom metrics and spans around LLM calls

Example: Python OpenTelemetry metrics for LiteLLM

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

resource = Resource.create({"service.name": "litellm-proxy"})
exporter = OTLPMetricExporter(endpoint="http://otel-collector:4317", insecure=True)
reader = PeriodicExportingMetricReader(exporter)

provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("litellm")

request_counter = meter.create_counter(
    "litellm_requests",
    unit="1",
    description="Number of LLM requests",
)

latency_hist = meter.create_histogram(
    "litellm_request_latency_seconds",
    unit="s",
    description="LLM request latency",
)

tokens_counter = meter.create_counter(
    "litellm_tokens",
    unit="tokens",
    description="Tokens used by LLM requests",
)

def record_llm_request(model, provider, latency, tokens_prompt, tokens_completion, status):
    request_counter.add(1, {"model": model, "provider": provider, "status": status})
    latency_hist.record(latency, {"model": model, "provider": provider})
    tokens_counter.add(tokens_prompt + tokens_completion, {"model": model, "provider": provider})

Call record_llm_request whenever your app finishes an LLM call.

2. Use the OpenTelemetry Collector as a hub

Deploy an OpenTelemetry Collector to receive signals and export to your backend(s):

receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  logging:
    loglevel: info
  prometheus:
    endpoint: "0.0.0.0:9464"
  datadog:
    api:
      key: "${DATADOG_API_KEY}"
    site: "datadoghq.com"

processors:
  batch:

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus, datadog]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [datadog]

Benefits:

  • One place to manage exports (Prometheus, Datadog, logging).
  • You can switch or add destinations without changing app code.

Option 3: Send BerriAI / LiteLLM metrics to Prometheus

If you prefer a Prometheus + Alertmanager + Grafana stack, you can scrape metrics from LiteLLM/BerriAI or from the OpenTelemetry Collector.

1. Prometheus scraping of LiteLLM metrics

If LiteLLM exposes /metrics:

scrape_configs:
  - job_name: 'litellm'
    scrape_interval: 15s
    static_configs:
      - targets: ['litellm:9090']
        labels:
          service: 'litellm'
          env: 'prod'

From OpenTelemetry Collector, scrape the Collector’s Prometheus exporter:

scrape_configs:
  - job_name: 'otel-collector'
    scrape_interval: 15s
    static_configs:
      - targets: ['otel-collector:9464']

Common metrics to expose:

  • litellm_requests_total{status,model,provider}
  • litellm_request_latency_seconds_bucket
  • litellm_tokens_total{model}
  • litellm_cost_usd_total{model} (if available)

2. Alerting via Prometheus + Alertmanager

Define alert rules in Prometheus:

groups:
  - name: litellm.rules
    rules:
      - alert: LLMHighErrorRate
        expr: |
          sum(rate(litellm_requests_total{status="error"}[5m]))
          /
          sum(rate(litellm_requests_total[5m])) > 0.05
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "High LLM error rate in {{ $labels.env }}"
          description: "Error rate > 5% for 10m"

      - alert: LLMLatencyHigh
        expr: histogram_quantile(
                0.95,
                sum(rate(litellm_request_latency_seconds_bucket[5m])) by (le, model)
              ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "LLM latency p95 high for {{ $labels.model }}"
          description: "p95 latency > 5s for 10m"

Configure Alertmanager to send alerts to PagerDuty and Slack.


Wiring alerts from Prometheus/Alertmanager to PagerDuty

Use the PagerDuty integration in Alertmanager:

  1. In PagerDuty, create or pick a Service and add a Prometheus or Events API v2 integration.
  2. Copy the Integration Key.
  3. Add this to your alertmanager.yml:
route:
  receiver: 'pagerduty'
  routes:
    - match:
        severity: 'page'
      receiver: 'pagerduty'
    - match:
        severity: 'warning'
      receiver: 'slack-llm'

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: '<PAGERDUTY_INTEGRATION_KEY>'
        severity: '{{ if eq .CommonLabels.severity "page" }}critical{{ else }}error{{ end }}'
        description: '{{ template "pagerduty.default.description" . }}'

  - name: 'slack-llm'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#llm-alerts'
        send_resolved: true
        title: '[LLM] {{ .CommonAnnotations.summary }}'
        text: >-
          {{ .CommonAnnotations.description }}

          Labels: {{ range .CommonLabels.SortedPairs }}{{ .Name }}="{{ .Value }}" {{ end }}

Alerts with severity: "page" will go to PagerDuty; others to Slack.


Wiring alerts from OpenTelemetry / Datadog to Slack

If you’re using OpenTelemetry but exporting to Datadog, you will still configure alerts in Datadog (as in the earlier section). For a pure Prometheus/OpenTelemetry stack without Datadog:

  • Use Alertmanager → Slack via webhook (shown above).
  • Alternatively, if using Grafana Cloud, define alerts in Grafana and use its Slack contact points.

Key Slack configuration considerations:

  • Use separate channels for:
    • #llm-alerts-critical
    • #llm-alerts-warning
  • Include model, provider, env, and region in messages so responders can quickly scope incidents.

Recommended patterns for a robust BerriAI / LiteLLM observability stack

To keep things maintainable as usage grows:

  1. Standardize labels/tags
    • Always include: env, service, model, provider, region, team, customer (if multi-tenant).
  2. Separate environments
    • Use different Datadog services / Prometheus labels for dev, staging, prod.
    • Only page PagerDuty for env="prod".
  3. Define SLOs for LLM calls
    • e.g., “99% of litellm requests under 5 seconds, 99.9% successful”.
    • Build Datadog or Prometheus SLOs on top of metrics.
  4. Alert on symptoms, not just causes
    • Latency, error rate, and cost spikes map directly to user experience and budget.
    • Drill into root-cause via provider error codes and traces after alerts fire.
  5. Use traces for request-level debugging
    • With OpenTelemetry, wrap each LLM call in a span:
      • llm.model, llm.provider, llm.tokens_prompt, llm.tokens_completion
    • Export traces to Datadog APM, Jaeger, or Tempo.

Putting it all together: sample architecture

A practical setup for many teams looks like:

  • BerriAI / LiteLLM:

    • Runs as a proxy or SDK inside your app
    • Emits JSON logs + OpenTelemetry metrics/traces
  • OpenTelemetry Collector:

    • Receives OTLP from LiteLLM/app
    • Exports:
      • Metrics → Prometheus (scraped by Prometheus Server)
      • Metrics/Traces/Logs → Datadog (optional)
    • Provides a /metrics endpoint for Prometheus
  • Metrics & alerts:

    • Prometheus + Alertmanager for open-source stack
    • Datadog for unified enterprise monitoring
  • Notifications:

    • PagerDuty for critical, on-call paging
    • Slack for most alerts, warnings, and FYIs

This approach makes it straightforward to send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack, while staying flexible if your tooling changes over time.