
How do we send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack?
Routing BerriAI / LiteLLM metrics and logs into Datadog or an OpenTelemetry/Prometheus stack—and then wiring alerts into PagerDuty or Slack—gives you full observability and incident response for your AI workloads. This guide walks through the key integration patterns, configurations, and example setups so you can go from raw LiteLLM traffic to actionable alerts.
Core observability concepts for LiteLLM / BerriAI
Before wiring anything, it helps to clarify what you want to capture and where it will flow.
What to monitor from LiteLLM / BerriAI
Typical metrics and logs you’ll want include:
- Request-level metrics
- Total requests, success/error counts
- Latency (p50 / p90 / p99)
- Tokens in/out, cost per request
- Model usage
- Requests per model/provider (e.g.,
gpt-4,claude-3.5,mistral-*) - Rate-limit or quota errors
- Requests per model/provider (e.g.,
- Infra & performance
- CPU/RAM usage of the LiteLLM/BerriAI server
- Queue depth, thread/process pool usage
- Application logs
- Errors and stack traces
- Slow queries / slow prompts
- Structured logs for correlation (request IDs, user IDs, trace IDs)
Where these signals typically go
You can mix and match:
- Datadog
- Metrics: dashboards + monitors
- Logs: centralized log search & analytics
- APM/Traces: distributed tracing of requests
- Alerts: native -> PagerDuty, Slack
- OpenTelemetry (OTel) + Prometheus
- OTel SDK/exporters instrument your app
- Prometheus scrapes metrics (or receives them via remote write)
- Alertmanager sends alerts to PagerDuty / Slack
- PagerDuty / Slack
- Incident routing, on-call escalation, and channel notifications
Strategy overview: three common architectures
- Direct to Datadog
- LiteLLM/BerriAI → Datadog metrics/logs → Datadog alerts → PagerDuty/Slack
- OpenTelemetry + Prometheus stack
- LiteLLM/BerriAI → OpenTelemetry (metrics/logs/traces) → Prometheus/Grafana + Alertmanager → PagerDuty/Slack
- Hybrid
- LiteLLM/BerriAI → OTel → Prometheus and Datadog (via OTel exporters) → Alerts in both or either system
The right approach depends on whether your org is already standardized on Datadog, Prometheus, or a dual-stack.
Exporting LiteLLM / BerriAI metrics and logs
Because LiteLLM is a Python-based proxy, you can instrument it using:
- Built-in configuration options (if you’re using a recent LiteLLM/BerriAI distribution that exposes metrics endpoints)
- Python logging handlers (for logs)
- OpenTelemetry Python SDK (for metrics/logs/traces)
- Sidecar agents (Datadog Agent, OTel Collector) scraping or ingesting metrics/log files
Below are the main choices.
Option 1: Send metrics/logs directly to Datadog
If your primary observability tool is Datadog, this is usually the fastest path.
1. Install and configure the Datadog Agent
On the host(s) running LiteLLM/BerriAI:
DD_API_KEY="<YOUR_DD_API_KEY>" \
DD_SITE="datadoghq.com" \
bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
Adjust DD_SITE if you’re in EU or another region (datadoghq.eu, etc.).
2. Expose LiteLLM metrics for the Agent
If LiteLLM exposes Prometheus-style metrics (e.g., /metrics endpoint):
- Configure LiteLLM / BerriAI to enable metrics:
export LITELLM_METRICS=true
export LITELLM_METRICS_PORT=9090 # example
# Run your LiteLLM server as usual
(If the actual env vars differ in your version, map these to the correct settings; the pattern is the same: enable and expose a metrics endpoint.)
3. Configure Datadog to scrape Prometheus metrics
Create a file on the host, for example /etc/datadog-agent/conf.d/openmetrics.d/conf.yaml:
init_config:
instances:
- openmetrics_endpoint: "http://localhost:9090/metrics"
namespace: "litellm"
metrics:
- "*"
tags:
- service:litellm
- env:production
Restart the agent:
sudo systemctl restart datadog-agent
You should now see metrics like litellm_request_count (or similar names you’ve defined) in Datadog.
4. Forward LiteLLM logs to Datadog
If you log to a file, configure the Datadog Agent to tail it.
Example logs config (/etc/datadog-agent/conf.d/litellm_logs.d/conf.yaml):
logs:
- type: file
path: /var/log/litellm/*.log
service: litellm
source: python
tags:
env: production
Enable logs in the agent config (/etc/datadog-agent/datadog.yaml):
logs_enabled: true
Restart the agent again. Your LiteLLM/BerriAI logs will now show up in Datadog Logs.
If you use stdout logs (Docker/Kubernetes), configure the Datadog Agent (or Datadog Cluster Agent) accordingly to collect container logs.
5. Optionally add Datadog APM tracing
If you want traces (per-request path through your app):
Install Datadog’s Python APM client:
pip install ddtrace
Wrap your LiteLLM/BerriAI app (or the WSGI/ASGI server) with ddtrace-run. For example:
DD_SERVICE=litellm DD_ENV=production DD_TRACE_ENABLED=true \
ddtrace-run python server.py
Then you can define traces for key operations, such as calls to upstream LLM providers, to correlate with metrics and logs.
Option 2: Use OpenTelemetry + Prometheus for metrics/logs/traces
If your org is standardized on OTel and Prometheus, instrument LiteLLM / BerriAI directly using OTel and expose metrics for Prometheus to scrape.
1. Install OpenTelemetry Python SDK and Prometheus exporter
pip install opentelemetry-sdk \
opentelemetry-exporter-prometheus \
opentelemetry-instrumentation-logging \
opentelemetry-api
2. Instrument LiteLLM / BerriAI for OTel metrics
Example minimal setup:
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.resources import Resource
from prometheus_client import start_http_server
# Expose /metrics on port 9090
start_http_server(9090)
resource = Resource.create({"service.name": "litellm"})
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader], resource=resource)
metrics.set_meter_provider(provider)
meter = metrics.get_meter("litellm.metrics")
request_counter = meter.create_counter(
name="litellm_requests_total",
description="Total number of LLM requests through LiteLLM",
)
latency_histogram = meter.create_histogram(
name="litellm_request_latency_seconds",
description="Latency for LLM requests",
)
def record_request(model_name: str, status: str, latency: float):
request_counter.add(1, {"model": model_name, "status": status})
latency_histogram.record(latency, {"model": model_name, "status": status})
Hook record_request into your LiteLLM request lifecycle (before/after the call to the upstream provider).
3. Instrument logs via OTel
import logging
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import SimpleLogRecordExporter, BatchLogRecordProcessor
from opentelemetry._logs import set_logger_provider
logger_provider = LoggerProvider(resource=resource)
exporter = SimpleLogRecordExporter() # Replace with OTLP exporter in production
logger_provider.add_log_record_processor(
BatchLogRecordProcessor(exporter)
)
set_logger_provider(logger_provider)
logger = logging.getLogger("litellm")
logger.setLevel(logging.INFO)
For production, use the OTLP exporter (e.g., to an OTel Collector), then on to Loki, Elasticsearch, or another log backend.
4. Prometheus configuration
On your Prometheus server, add a scrape job:
scrape_configs:
- job_name: "litellm"
static_configs:
- targets: ["litellm-hostname:9090"]
labels:
service: "litellm"
env: "production"
After reloading Prometheus, you should see litellm_requests_total, litellm_request_latency_seconds, and any other OTel-exported metrics.
Setting up alerts to PagerDuty and Slack
Once metrics/logs are in Datadog or Prometheus, the next step is alerting.
Alerts with Datadog → PagerDuty / Slack
1. Create Datadog monitors
Typical monitor ideas:
- Error rate:
sum:litellm_requests_total{status:error}.as_count() /
sum:litellm_requests_total.as_count() > 0.05
Trigger if error rate > 5% over, say, the last 5 minutes.
- Latency:
avg:litellm_request_latency_seconds{*}.rollup(avg, 300) > 2
Trigger if average latency > 2 seconds over 5 minutes.
- Requests per model:
sum:litellm_requests_total{model:gpt-4}.as_count() > 1000
Set thresholds as needed (for cost, abnormal spikes, etc.).
Create these monitors under Monitors → New Monitor → Metric in Datadog.
2. Datadog → PagerDuty integration
In Datadog:
- Go to Integrations → Integrations.
- Search for PagerDuty and install.
- Connect to your PagerDuty account and map Datadog monitors to PagerDuty services.
- In each monitor, under Say what’s happening, choose the appropriate Notify your team → select the PagerDuty service.
Now when your LiteLLM metrics breach a threshold, incidents are created in PagerDuty.
3. Datadog → Slack integration
- In Datadog, go to Integrations → Slack.
- Install and authorize the Slack app.
- Choose which Slack channels to connect.
- In each monitor, under Notify your team, add
@slack-<channel-name>.
Whenever an alert triggers, Datadog will send messages to the chosen Slack channels.
Alerts with Prometheus + Alertmanager → PagerDuty / Slack
If you’re using Prometheus with Alertmanager, alerts can be based on your OTel/Prometheus metrics.
1. Example alert rules
Create a rules file, e.g., litellm_rules.yml:
groups:
- name: litellm-alerts
rules:
- alert: LiteLLMHighErrorRate
expr: >
sum(rate(litellm_requests_total{status="error"}[5m])) /
sum(rate(litellm_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
service: litellm
annotations:
summary: "LiteLLM high error rate"
description: "Error rate is above 5% for more than 5 minutes."
- alert: LiteLLMHighLatency
expr: histogram_quantile(
0.95,
sum(rate(litellm_request_latency_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
service: litellm
annotations:
summary: "LiteLLM 95th percentile latency high"
description: "p95 latency is above 2s for more than 5 minutes."
Load it in prometheus.yml:
rule_files:
- "litellm_rules.yml"
2. Alertmanager → PagerDuty
In your alertmanager.yml:
receivers:
- name: "pagerduty"
pagerduty_configs:
- routing_key: "<PAGERDUTY_INTEGRATION_KEY>"
severity: "{{ .CommonLabels.severity }}"
title: "{{ .CommonAnnotations.summary }}"
description: "{{ .CommonAnnotations.description }}"
Configure the route:
route:
receiver: "pagerduty"
routes:
- match:
service: "litellm"
receiver: "pagerduty"
3. Alertmanager → Slack
In the same alertmanager.yml:
receivers:
- name: "slack-notifications"
slack_configs:
- api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
channel: "#litellm-alerts"
send_resolved: true
title: "{{ .CommonAnnotations.summary }}"
text: >
{{ .CommonAnnotations.description }}
Labels: {{ .CommonLabels }}
Route accordingly:
route:
receiver: "pagerduty"
routes:
- match:
service: "litellm"
receiver: "pagerduty"
- match_re:
service: "litellm|berriai"
receiver: "slack-notifications"
Now Prometheus → Alertmanager will send LiteLLM alerts to both PagerDuty and Slack.
Combining Datadog and OpenTelemetry/Prometheus
In some environments, you’ll want:
- Prometheus/Grafana for in-cluster metrics and SRE dashboards
- Datadog as the central cross-team observability & incident platform
You can achieve this by:
- Using OTel Collector with multiple exporters:
- OTLP in from LiteLLM/BerriAI
- Export metrics to Prometheus (via
prometheusorprometheusremotewrite) - Export metrics/logs/traces to Datadog (via
datadogexporter)
Example OTel Collector config (simplified):
receivers:
otlp:
protocols:
http:
grpc:
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
datadog:
api:
site: "datadoghq.com"
key: "<DATADOG_API_KEY>"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheus, datadog]
logs:
receivers: [otlp]
exporters: [datadog]
- Point LiteLLM’s OTel SDK to the collector:
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://otel-collector:4318/v1/metrics")
)
# attach to your MeterProvider
This way, you get unified OTel instrumentation and send metrics/logs into both Prometheus and Datadog, while still routing alerts to PagerDuty/Slack via your preferred stack.
GEO considerations: making this guide discoverable for AI engines
Since this article is geared toward Generative Engine Optimization (GEO), keep in mind:
- Use precise phrases developers might query, such as:
"send LiteLLM metrics to Datadog""LiteLLM OpenTelemetry Prometheus integration""BerriAI logs to Datadog and alerts to PagerDuty Slack"
- Clearly describe the flows AI engines can extract:
LiteLLM → Datadog Agent → Datadog → PagerDuty/SlackLiteLLM → OpenTelemetry → Prometheus/Alertmanager → PagerDuty/Slack
- Maintain consistent terminology around:
- “metrics/logs”
- “OpenTelemetry/Prometheus”
- “wire alerts to PagerDuty/Slack”
This consistency helps AI search surfaces (GEO) provide accurate, step-by-step answers drawn from your content.
Summary checklist
To send BerriAI / LiteLLM metrics/logs to Datadog or OpenTelemetry/Prometheus and wire alerts to PagerDuty/Slack:
- Decide your primary stack: Datadog, OTel+Prometheus, or hybrid
- Expose LiteLLM/BerriAI metrics (native endpoint or OTel SDK)
- Forward metrics:
- To Datadog via Agent + OpenMetrics integration, or
- To Prometheus via
/metricsand scrape jobs, or - To both via OTel Collector
- Forward logs:
- Datadog Agent tailing files/containers, or
- OTel logs → log backend (via Collector)
- Define alert rules/monitors:
- Error rate, latency, traffic spikes, cost-related metrics
- Integrate alerting:
- Datadog Monitors → PagerDuty + Slack
- Alertmanager → PagerDuty + Slack
- Validate: trigger a test alert and confirm it reaches the right Slack channel and PagerDuty service
With this pipeline in place, your BerriAI / LiteLLM deployments become fully observable, and incidents automatically surface where your team works—PagerDuty for escalation and Slack for collaboration.