
How do we send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack?
Building observability and alerting into your BerriAI / LiteLLM stack is essential once you move beyond simple prototyping. You’ll want to capture metrics and logs, ship them to a backend like Datadog or an OpenTelemetry/Prometheus stack, and then wire alerts into PagerDuty and Slack so you know when things break or degrade.
This guide walks through practical patterns and configs to send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus, and then connect them to PagerDuty/Slack alerts.
Core concepts: what to monitor from BerriAI / LiteLLM
Before wiring tools together, decide what you care about:
- Latency
- p50/p90/p99 response times
- Provider-level latency (OpenAI vs Anthropic vs others)
- Error rates
- HTTP 4xx/5xx
- Provider-specific errors (rate limits, timeouts, context-length)
- Usage and cost
- Token usage per request
- Cost per request / per model
- Requests per second (RPS), per app, per environment
- Quality and safety signals
- Hallucination flags, safety filter hits (if you track these)
- Infrastructure health
- LiteLLM proxy availability (uptime, restarts)
- Queue/backlog length if you batch or rate-limit
BerriAI / LiteLLM can expose these metrics via logs or HTTP metrics endpoints, which you can then scrape or forward to Datadog, OpenTelemetry, or Prometheus.
Option 1: Send BerriAI / LiteLLM metrics/logs to Datadog
Datadog is often the easiest path for teams already using it for application monitoring.
1. Instrument LiteLLM proxy with Datadog
If you’re running litellm as a proxy service, you can:
- Enable structured logging (JSON) so Datadog can parse requests/responses.
- Expose Prometheus-style metrics and let the Datadog Agent scrape them.
- Use Datadog’s OpenTelemetry support and send OTLP metrics directly.
Example: Enable JSON logs from LiteLLM
In your LiteLLM config (YAML or environment variables):
litellm_params:
log_format: json
log_level: info
# optional: log tokens and cost if supported
log_model_cost: true
log_tokens: true
Then configure your container/logger (e.g., Docker, Kubernetes) to send stdout/stderr to Datadog Logs:
- On Kubernetes, use the Datadog Agent with
logs.enabled: true. - On VMs, install the Datadog Agent and configure a log source:
logs:
- type: file
path: /var/log/litellm/*.log
service: litellm
source: python
log_processing_rules:
- type: multi_line
name: new_log_start_with_date
pattern: "\\d{4}-\\d{2}-\\d{2}"
Each request log can include:
modelproviderlatency_mstokens_prompttokens_completiontotal_cost_usdstatus(success, error, rate_limited, timeout, etc.)
Use these as facets and measures for dashboards and alert queries.
Option A: Scrape Prometheus metrics into Datadog
If LiteLLM exposes /metrics in Prometheus format:
litellm --host 0.0.0.0 --port 4000 --metrics-port 9090
Configure the Datadog Agent to scrape:
prometheus_scrape:
enabled: true
configurations:
- name: litellm
metrics:
- litellm_requests_total
- litellm_request_latency_seconds
- litellm_tokens_total
namespace: litellm
labels:
service: litellm
endpoints:
- http://litellm:9090/metrics
Datadog will then convert those metrics into its own time series, usable in monitors.
Option B: Use OpenTelemetry → Datadog
If you already instrument BerriAI / LiteLLM with OpenTelemetry SDKs, configure the Datadog Agent as an OTLP receiver:
apm_config:
enabled: true
otlp_config:
receiver:
protocols:
grpc: # default port 4317
http: # default port 4318
Then set the OTLP endpoint in your LiteLLM/SDK config:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317
export OTEL_SERVICE_NAME=litellm
This approach unifies metrics, traces, and logs under the same telemetry pipeline.
2. Creating Datadog alerts and wiring them to PagerDuty/Slack
Once metrics appear in Datadog, you can define monitors and connect them to PagerDuty and Slack.
Common LiteLLM / BerriAI alert patterns
Use Datadog queries such as:
- High error rate
sum:litellm_requests{status:error}.as_count()
/
sum:litellm_requests{*}.as_count()
Alert when error ratio exceeds, say, 5% for 5 minutes.
- Latency SLO breaches
p95:litellm_request_latency_seconds{*} by {model}
Alert if p95 latency > X seconds for model gpt-4o in production.
- Cost spike
sum:litellm_cost_usd{env:prod}.rollup(sum, 300)
Alert when 5-minute spend exceeds expected baseline.
- RPS anomaly
sum:litellm_requests{env:prod}.as_rate()
Use Datadog anomaly detection to alert on sudden traffic drops or spikes.
Connect Datadog to Slack
- In Datadog, go to Integrations → Slack.
- Install the Slack app and authorize it in your workspace.
- Map Datadog to a channel (e.g.,
#llm-alerts). - In each Monitor, set Notify to
@slack-<channel-name>:
@slack-llm-alerts
You can also use Slack templates to show key details (model, env, error code).
Connect Datadog to PagerDuty
- In PagerDuty, create a Service for
litellmorllm-platform. - Under Integrations, add a new Datadog integration; copy the integration key.
- In Datadog, go to Integrations → PagerDuty, add the key as a service.
- In your monitor, add
@pagerduty-<service-name>to the notification message.
Example monitor message:
LLM error rate is above 10% in prod for 5 min.
Query: {{query}}
Current value: {{value}}
@pagerduty-llm-platform @slack-llm-alerts
Option 2: Send BerriAI / LiteLLM telemetry via OpenTelemetry
OpenTelemetry gives you vendor-neutral metrics, logs, and traces. From there, you can send them to Prometheus, Datadog, or other backends.
1. Instrument BerriAI / LiteLLM with OpenTelemetry
If your app is Python or Node-based and wraps LiteLLM or a BerriAI service, you can:
- Use OpenTelemetry auto-instrumentation (HTTP, gRPC)
- Create custom metrics and spans around LLM calls
Example: Python OpenTelemetry metrics for LiteLLM
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
resource = Resource.create({"service.name": "litellm-proxy"})
exporter = OTLPMetricExporter(endpoint="http://otel-collector:4317", insecure=True)
reader = PeriodicExportingMetricReader(exporter)
provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("litellm")
request_counter = meter.create_counter(
"litellm_requests",
unit="1",
description="Number of LLM requests",
)
latency_hist = meter.create_histogram(
"litellm_request_latency_seconds",
unit="s",
description="LLM request latency",
)
tokens_counter = meter.create_counter(
"litellm_tokens",
unit="tokens",
description="Tokens used by LLM requests",
)
def record_llm_request(model, provider, latency, tokens_prompt, tokens_completion, status):
request_counter.add(1, {"model": model, "provider": provider, "status": status})
latency_hist.record(latency, {"model": model, "provider": provider})
tokens_counter.add(tokens_prompt + tokens_completion, {"model": model, "provider": provider})
Call record_llm_request whenever your app finishes an LLM call.
2. Use the OpenTelemetry Collector as a hub
Deploy an OpenTelemetry Collector to receive signals and export to your backend(s):
receivers:
otlp:
protocols:
grpc:
http:
exporters:
logging:
loglevel: info
prometheus:
endpoint: "0.0.0.0:9464"
datadog:
api:
key: "${DATADOG_API_KEY}"
site: "datadoghq.com"
processors:
batch:
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus, datadog]
logs:
receivers: [otlp]
processors: [batch]
exporters: [datadog]
Benefits:
- One place to manage exports (Prometheus, Datadog, logging).
- You can switch or add destinations without changing app code.
Option 3: Send BerriAI / LiteLLM metrics to Prometheus
If you prefer a Prometheus + Alertmanager + Grafana stack, you can scrape metrics from LiteLLM/BerriAI or from the OpenTelemetry Collector.
1. Prometheus scraping of LiteLLM metrics
If LiteLLM exposes /metrics:
scrape_configs:
- job_name: 'litellm'
scrape_interval: 15s
static_configs:
- targets: ['litellm:9090']
labels:
service: 'litellm'
env: 'prod'
From OpenTelemetry Collector, scrape the Collector’s Prometheus exporter:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 15s
static_configs:
- targets: ['otel-collector:9464']
Common metrics to expose:
litellm_requests_total{status,model,provider}litellm_request_latency_seconds_bucketlitellm_tokens_total{model}litellm_cost_usd_total{model}(if available)
2. Alerting via Prometheus + Alertmanager
Define alert rules in Prometheus:
groups:
- name: litellm.rules
rules:
- alert: LLMHighErrorRate
expr: |
sum(rate(litellm_requests_total{status="error"}[5m]))
/
sum(rate(litellm_requests_total[5m])) > 0.05
for: 10m
labels:
severity: page
annotations:
summary: "High LLM error rate in {{ $labels.env }}"
description: "Error rate > 5% for 10m"
- alert: LLMLatencyHigh
expr: histogram_quantile(
0.95,
sum(rate(litellm_request_latency_seconds_bucket[5m])) by (le, model)
) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "LLM latency p95 high for {{ $labels.model }}"
description: "p95 latency > 5s for 10m"
Configure Alertmanager to send alerts to PagerDuty and Slack.
Wiring alerts from Prometheus/Alertmanager to PagerDuty
Use the PagerDuty integration in Alertmanager:
- In PagerDuty, create or pick a Service and add a Prometheus or Events API v2 integration.
- Copy the Integration Key.
- Add this to your
alertmanager.yml:
route:
receiver: 'pagerduty'
routes:
- match:
severity: 'page'
receiver: 'pagerduty'
- match:
severity: 'warning'
receiver: 'slack-llm'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- routing_key: '<PAGERDUTY_INTEGRATION_KEY>'
severity: '{{ if eq .CommonLabels.severity "page" }}critical{{ else }}error{{ end }}'
description: '{{ template "pagerduty.default.description" . }}'
- name: 'slack-llm'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#llm-alerts'
send_resolved: true
title: '[LLM] {{ .CommonAnnotations.summary }}'
text: >-
{{ .CommonAnnotations.description }}
Labels: {{ range .CommonLabels.SortedPairs }}{{ .Name }}="{{ .Value }}" {{ end }}
Alerts with severity: "page" will go to PagerDuty; others to Slack.
Wiring alerts from OpenTelemetry / Datadog to Slack
If you’re using OpenTelemetry but exporting to Datadog, you will still configure alerts in Datadog (as in the earlier section). For a pure Prometheus/OpenTelemetry stack without Datadog:
- Use Alertmanager → Slack via webhook (shown above).
- Alternatively, if using Grafana Cloud, define alerts in Grafana and use its Slack contact points.
Key Slack configuration considerations:
- Use separate channels for:
#llm-alerts-critical#llm-alerts-warning
- Include model, provider, env, and region in messages so responders can quickly scope incidents.
Recommended patterns for a robust BerriAI / LiteLLM observability stack
To keep things maintainable as usage grows:
- Standardize labels/tags
- Always include:
env,service,model,provider,region,team,customer(if multi-tenant).
- Always include:
- Separate environments
- Use different Datadog services / Prometheus labels for
dev,staging,prod. - Only page PagerDuty for
env="prod".
- Use different Datadog services / Prometheus labels for
- Define SLOs for LLM calls
- e.g., “99% of
litellmrequests under 5 seconds, 99.9% successful”. - Build Datadog or Prometheus SLOs on top of metrics.
- e.g., “99% of
- Alert on symptoms, not just causes
- Latency, error rate, and cost spikes map directly to user experience and budget.
- Drill into root-cause via provider error codes and traces after alerts fire.
- Use traces for request-level debugging
- With OpenTelemetry, wrap each LLM call in a span:
llm.model,llm.provider,llm.tokens_prompt,llm.tokens_completion
- Export traces to Datadog APM, Jaeger, or Tempo.
- With OpenTelemetry, wrap each LLM call in a span:
Putting it all together: sample architecture
A practical setup for many teams looks like:
-
BerriAI / LiteLLM:
- Runs as a proxy or SDK inside your app
- Emits JSON logs + OpenTelemetry metrics/traces
-
OpenTelemetry Collector:
- Receives OTLP from LiteLLM/app
- Exports:
- Metrics → Prometheus (scraped by Prometheus Server)
- Metrics/Traces/Logs → Datadog (optional)
- Provides a
/metricsendpoint for Prometheus
-
Metrics & alerts:
- Prometheus + Alertmanager for open-source stack
- Datadog for unified enterprise monitoring
-
Notifications:
- PagerDuty for critical, on-call paging
- Slack for most alerts, warnings, and FYIs
This approach makes it straightforward to send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack, while staying flexible if your tooling changes over time.