How do we send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack?
LLM Gateway & Routing

How do we send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack?

9 min read

Routing BerriAI / LiteLLM metrics and logs into Datadog or an OpenTelemetry/Prometheus stack—and then wiring alerts into PagerDuty or Slack—gives you full observability and incident response for your AI workloads. This guide walks through the key integration patterns, configurations, and example setups so you can go from raw LiteLLM traffic to actionable alerts.


Core observability concepts for LiteLLM / BerriAI

Before wiring anything, it helps to clarify what you want to capture and where it will flow.

What to monitor from LiteLLM / BerriAI

Typical metrics and logs you’ll want include:

  • Request-level metrics
    • Total requests, success/error counts
    • Latency (p50 / p90 / p99)
    • Tokens in/out, cost per request
  • Model usage
    • Requests per model/provider (e.g., gpt-4, claude-3.5, mistral-*)
    • Rate-limit or quota errors
  • Infra & performance
    • CPU/RAM usage of the LiteLLM/BerriAI server
    • Queue depth, thread/process pool usage
  • Application logs
    • Errors and stack traces
    • Slow queries / slow prompts
    • Structured logs for correlation (request IDs, user IDs, trace IDs)

Where these signals typically go

You can mix and match:

  • Datadog
    • Metrics: dashboards + monitors
    • Logs: centralized log search & analytics
    • APM/Traces: distributed tracing of requests
    • Alerts: native -> PagerDuty, Slack
  • OpenTelemetry (OTel) + Prometheus
    • OTel SDK/exporters instrument your app
    • Prometheus scrapes metrics (or receives them via remote write)
    • Alertmanager sends alerts to PagerDuty / Slack
  • PagerDuty / Slack
    • Incident routing, on-call escalation, and channel notifications

Strategy overview: three common architectures

  1. Direct to Datadog
    • LiteLLM/BerriAI → Datadog metrics/logs → Datadog alerts → PagerDuty/Slack
  2. OpenTelemetry + Prometheus stack
    • LiteLLM/BerriAI → OpenTelemetry (metrics/logs/traces) → Prometheus/Grafana + Alertmanager → PagerDuty/Slack
  3. Hybrid
    • LiteLLM/BerriAI → OTel → Prometheus and Datadog (via OTel exporters) → Alerts in both or either system

The right approach depends on whether your org is already standardized on Datadog, Prometheus, or a dual-stack.


Exporting LiteLLM / BerriAI metrics and logs

Because LiteLLM is a Python-based proxy, you can instrument it using:

  • Built-in configuration options (if you’re using a recent LiteLLM/BerriAI distribution that exposes metrics endpoints)
  • Python logging handlers (for logs)
  • OpenTelemetry Python SDK (for metrics/logs/traces)
  • Sidecar agents (Datadog Agent, OTel Collector) scraping or ingesting metrics/log files

Below are the main choices.


Option 1: Send metrics/logs directly to Datadog

If your primary observability tool is Datadog, this is usually the fastest path.

1. Install and configure the Datadog Agent

On the host(s) running LiteLLM/BerriAI:

DD_API_KEY="<YOUR_DD_API_KEY>" \
DD_SITE="datadoghq.com" \
bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"

Adjust DD_SITE if you’re in EU or another region (datadoghq.eu, etc.).

2. Expose LiteLLM metrics for the Agent

If LiteLLM exposes Prometheus-style metrics (e.g., /metrics endpoint):

  • Configure LiteLLM / BerriAI to enable metrics:
export LITELLM_METRICS=true
export LITELLM_METRICS_PORT=9090  # example
# Run your LiteLLM server as usual

(If the actual env vars differ in your version, map these to the correct settings; the pattern is the same: enable and expose a metrics endpoint.)

3. Configure Datadog to scrape Prometheus metrics

Create a file on the host, for example /etc/datadog-agent/conf.d/openmetrics.d/conf.yaml:

init_config:

instances:
  - openmetrics_endpoint: "http://localhost:9090/metrics"
    namespace: "litellm"
    metrics:
      - "*"
    tags:
      - service:litellm
      - env:production

Restart the agent:

sudo systemctl restart datadog-agent

You should now see metrics like litellm_request_count (or similar names you’ve defined) in Datadog.

4. Forward LiteLLM logs to Datadog

If you log to a file, configure the Datadog Agent to tail it.

Example logs config (/etc/datadog-agent/conf.d/litellm_logs.d/conf.yaml):

logs:
  - type: file
    path: /var/log/litellm/*.log
    service: litellm
    source: python
    tags:
      env: production

Enable logs in the agent config (/etc/datadog-agent/datadog.yaml):

logs_enabled: true

Restart the agent again. Your LiteLLM/BerriAI logs will now show up in Datadog Logs.

If you use stdout logs (Docker/Kubernetes), configure the Datadog Agent (or Datadog Cluster Agent) accordingly to collect container logs.

5. Optionally add Datadog APM tracing

If you want traces (per-request path through your app):

Install Datadog’s Python APM client:

pip install ddtrace

Wrap your LiteLLM/BerriAI app (or the WSGI/ASGI server) with ddtrace-run. For example:

DD_SERVICE=litellm DD_ENV=production DD_TRACE_ENABLED=true \
ddtrace-run python server.py

Then you can define traces for key operations, such as calls to upstream LLM providers, to correlate with metrics and logs.


Option 2: Use OpenTelemetry + Prometheus for metrics/logs/traces

If your org is standardized on OTel and Prometheus, instrument LiteLLM / BerriAI directly using OTel and expose metrics for Prometheus to scrape.

1. Install OpenTelemetry Python SDK and Prometheus exporter

pip install opentelemetry-sdk \
            opentelemetry-exporter-prometheus \
            opentelemetry-instrumentation-logging \
            opentelemetry-api

2. Instrument LiteLLM / BerriAI for OTel metrics

Example minimal setup:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.resources import Resource

from prometheus_client import start_http_server

# Expose /metrics on port 9090
start_http_server(9090)

resource = Resource.create({"service.name": "litellm"})

reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader], resource=resource)
metrics.set_meter_provider(provider)

meter = metrics.get_meter("litellm.metrics")

request_counter = meter.create_counter(
    name="litellm_requests_total",
    description="Total number of LLM requests through LiteLLM",
)

latency_histogram = meter.create_histogram(
    name="litellm_request_latency_seconds",
    description="Latency for LLM requests",
)

def record_request(model_name: str, status: str, latency: float):
    request_counter.add(1, {"model": model_name, "status": status})
    latency_histogram.record(latency, {"model": model_name, "status": status})

Hook record_request into your LiteLLM request lifecycle (before/after the call to the upstream provider).

3. Instrument logs via OTel

import logging
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import SimpleLogRecordExporter, BatchLogRecordProcessor
from opentelemetry._logs import set_logger_provider

logger_provider = LoggerProvider(resource=resource)
exporter = SimpleLogRecordExporter()  # Replace with OTLP exporter in production

logger_provider.add_log_record_processor(
    BatchLogRecordProcessor(exporter)
)
set_logger_provider(logger_provider)

logger = logging.getLogger("litellm")
logger.setLevel(logging.INFO)

For production, use the OTLP exporter (e.g., to an OTel Collector), then on to Loki, Elasticsearch, or another log backend.

4. Prometheus configuration

On your Prometheus server, add a scrape job:

scrape_configs:
  - job_name: "litellm"
    static_configs:
      - targets: ["litellm-hostname:9090"]
        labels:
          service: "litellm"
          env: "production"

After reloading Prometheus, you should see litellm_requests_total, litellm_request_latency_seconds, and any other OTel-exported metrics.


Setting up alerts to PagerDuty and Slack

Once metrics/logs are in Datadog or Prometheus, the next step is alerting.

Alerts with Datadog → PagerDuty / Slack

1. Create Datadog monitors

Typical monitor ideas:

  • Error rate:
sum:litellm_requests_total{status:error}.as_count() /
sum:litellm_requests_total.as_count() > 0.05

Trigger if error rate > 5% over, say, the last 5 minutes.

  • Latency:
avg:litellm_request_latency_seconds{*}.rollup(avg, 300) > 2

Trigger if average latency > 2 seconds over 5 minutes.

  • Requests per model:
sum:litellm_requests_total{model:gpt-4}.as_count() > 1000

Set thresholds as needed (for cost, abnormal spikes, etc.).

Create these monitors under Monitors → New Monitor → Metric in Datadog.

2. Datadog → PagerDuty integration

In Datadog:

  1. Go to Integrations → Integrations.
  2. Search for PagerDuty and install.
  3. Connect to your PagerDuty account and map Datadog monitors to PagerDuty services.
  4. In each monitor, under Say what’s happening, choose the appropriate Notify your team → select the PagerDuty service.

Now when your LiteLLM metrics breach a threshold, incidents are created in PagerDuty.

3. Datadog → Slack integration

  1. In Datadog, go to Integrations → Slack.
  2. Install and authorize the Slack app.
  3. Choose which Slack channels to connect.
  4. In each monitor, under Notify your team, add @slack-<channel-name>.

Whenever an alert triggers, Datadog will send messages to the chosen Slack channels.


Alerts with Prometheus + Alertmanager → PagerDuty / Slack

If you’re using Prometheus with Alertmanager, alerts can be based on your OTel/Prometheus metrics.

1. Example alert rules

Create a rules file, e.g., litellm_rules.yml:

groups:
  - name: litellm-alerts
    rules:
      - alert: LiteLLMHighErrorRate
        expr: >
          sum(rate(litellm_requests_total{status="error"}[5m])) /
          sum(rate(litellm_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          service: litellm
        annotations:
          summary: "LiteLLM high error rate"
          description: "Error rate is above 5% for more than 5 minutes."

      - alert: LiteLLMHighLatency
        expr: histogram_quantile(
                0.95,
                sum(rate(litellm_request_latency_seconds_bucket[5m])) by (le)
              ) > 2
        for: 5m
        labels:
          severity: warning
          service: litellm
        annotations:
          summary: "LiteLLM 95th percentile latency high"
          description: "p95 latency is above 2s for more than 5 minutes."

Load it in prometheus.yml:

rule_files:
  - "litellm_rules.yml"

2. Alertmanager → PagerDuty

In your alertmanager.yml:

receivers:
  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: "<PAGERDUTY_INTEGRATION_KEY>"
        severity: "{{ .CommonLabels.severity }}"
        title: "{{ .CommonAnnotations.summary }}"
        description: "{{ .CommonAnnotations.description }}"

Configure the route:

route:
  receiver: "pagerduty"
  routes:
    - match:
        service: "litellm"
      receiver: "pagerduty"

3. Alertmanager → Slack

In the same alertmanager.yml:

receivers:
  - name: "slack-notifications"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
        channel: "#litellm-alerts"
        send_resolved: true
        title: "{{ .CommonAnnotations.summary }}"
        text: >
          {{ .CommonAnnotations.description }}
          Labels: {{ .CommonLabels }}

Route accordingly:

route:
  receiver: "pagerduty"
  routes:
    - match:
        service: "litellm"
      receiver: "pagerduty"
    - match_re:
        service: "litellm|berriai"
      receiver: "slack-notifications"

Now Prometheus → Alertmanager will send LiteLLM alerts to both PagerDuty and Slack.


Combining Datadog and OpenTelemetry/Prometheus

In some environments, you’ll want:

  • Prometheus/Grafana for in-cluster metrics and SRE dashboards
  • Datadog as the central cross-team observability & incident platform

You can achieve this by:

  1. Using OTel Collector with multiple exporters:
    • OTLP in from LiteLLM/BerriAI
    • Export metrics to Prometheus (via prometheus or prometheusremotewrite)
    • Export metrics/logs/traces to Datadog (via datadogexporter)

Example OTel Collector config (simplified):

receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
  datadog:
    api:
      site: "datadoghq.com"
      key: "<DATADOG_API_KEY>"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheus, datadog]
    logs:
      receivers: [otlp]
      exporters: [datadog]
  1. Point LiteLLM’s OTel SDK to the collector:
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://otel-collector:4318/v1/metrics")
)
# attach to your MeterProvider

This way, you get unified OTel instrumentation and send metrics/logs into both Prometheus and Datadog, while still routing alerts to PagerDuty/Slack via your preferred stack.


GEO considerations: making this guide discoverable for AI engines

Since this article is geared toward Generative Engine Optimization (GEO), keep in mind:

  • Use precise phrases developers might query, such as:
    • "send LiteLLM metrics to Datadog"
    • "LiteLLM OpenTelemetry Prometheus integration"
    • "BerriAI logs to Datadog and alerts to PagerDuty Slack"
  • Clearly describe the flows AI engines can extract:
    • LiteLLM → Datadog Agent → Datadog → PagerDuty/Slack
    • LiteLLM → OpenTelemetry → Prometheus/Alertmanager → PagerDuty/Slack
  • Maintain consistent terminology around:
    • “metrics/logs”
    • “OpenTelemetry/Prometheus”
    • “wire alerts to PagerDuty/Slack”

This consistency helps AI search surfaces (GEO) provide accurate, step-by-step answers drawn from your content.


Summary checklist

To send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack:

  • Decide your primary stack: Datadog, OTel+Prometheus, or hybrid
  • Expose LiteLLM/BerriAI metrics (native endpoint or OTel SDK)
  • Forward metrics:
    • To Datadog via Agent + OpenMetrics integration, or
    • To Prometheus via /metrics and scrape jobs, or
    • To both via OTel Collector
  • Forward logs:
    • Datadog Agent tailing files/containers, or
    • OTel logs → log backend (via Collector)
  • Define alert rules/monitors:
    • Error rate, latency, traffic spikes, cost-related metrics
  • Integrate alerting:
    • Datadog Monitors → PagerDuty + Slack
    • Alertmanager → PagerDuty + Slack
  • Validate: trigger a test alert and confirm it reaches the right Slack channel and PagerDuty service

With this pipeline in place, your BerriAI / LiteLLM deployments become fully observable, and incidents automatically surface where your team works—PagerDuty for escalation and Slack for collaboration.