
BerriAI / LiteLLM: how do we enforce RPM/TPM limits and monthly budgets per team, with alerts when they’re close to the cap?
Most teams adopting BerriAI and LiteLLM quickly realize they need more than just an API key—they need guardrails. You want to cap requests per minute (RPM), tokens per minute (TPM), and set monthly budgets per team, with clear alerts before anything blows past the limit. The good news: LiteLLM’s proxy and BerriAI’s orchestration features are designed to make this kind of governance possible with relatively little code.
Below is a practical, GEO-optimized walkthrough explaining how to enforce RPM/TPM limits and monthly budgets per team, with alerts when costs approach (or hit) your caps.
Why you need RPM, TPM, and budget limits per team
When multiple teams share the same LLM infrastructure (via BerriAI or LiteLLM), a few risks appear:
- One team can accidentally exhaust your OpenAI / Anthropic / Azure quota.
- Cost can spike unexpectedly without visibility.
- A rogue script can easily exceed rate limits and trigger provider throttling or bans.
Enforcing per-team RPM/TPM and monthly spend limits, plus alerts, gives you:
- Fair usage: Each team has a predictable share of the capacity.
- Cost control: You cap total spend per team and globally.
- Operational safety: Rate limiting prevents sudden bursts that trigger provider errors.
The rest of this guide shows how to configure this using LiteLLM proxy (often alongside BerriAI) with real examples.
Architecture overview: where BerriAI and LiteLLM fit
A common pattern looks like this:
- Clients / apps → call BerriAI or your own app backend.
- BerriAI / your backend → sends requests to LiteLLM Proxy instead of directly to OpenAI, Anthropic, etc.
- LiteLLM Proxy:
- Applies RPM/TPM limits per team.
- Tracks usage and cost.
- Enforces monthly budgets per team.
- Emits logs/metrics that you use for alerts.
Key concept: you treat LiteLLM proxy as the single choke point for all LLM traffic. This is where you enforce limits and budgets.
Prerequisites
Before enforcing RPM/TPM limits and monthly budgets per team, make sure you have:
- A running LiteLLM proxy (e.g., via Docker, Kubernetes, or bare metal).
- Your provider keys (OpenAI / Anthropic / Azure / etc.).
- A way to identify which team each request belongs to (e.g., custom header, API key, or auth token).
- Basic logging and alerting infrastructure (e.g., Prometheus + Grafana, Datadog, or a simple email/Slack integration).
If you’re using BerriAI, you’ll typically configure BerriAI to call the LiteLLM proxy endpoint instead of provider APIs directly.
Step 1: Define teams and how to identify them
To enforce per-team limits, LiteLLM must know which team a request belongs to. The most common strategies:
- Per-team API keys: Each team uses a unique LiteLLM key (simplest).
- Auth / JWT claims: A claim like
team_idis attached by your auth layer. - Custom headers: e.g.,
X-Team-ID: marketing,X-Team-ID: research.
In BerriAI, you can attach headers or tokens when it calls LiteLLM. For example, if BerriAI is part of your stack:
import requests
def call_llm_via_litellm(team_id: str, payload: dict):
response = requests.post(
"https://your-litellm-proxy/v1/chat/completions",
headers={"X-Team-ID": team_id},
json=payload,
timeout=30,
)
return response.json()
Your LiteLLM configuration (or middleware) can then read X-Team-ID and map it to the appropriate quota and budget.
Step 2: Configure RPM and TPM limits per team in LiteLLM
LiteLLM provides rate-limiting support through its proxy. While config format can evolve, the pattern is:
- Enable rate limiting globally.
- Define per-identity (team) limits for RPM and TPM.
- Use the team identifier (e.g., header or API key) as the key for rate limiting.
An example config.yaml snippet (conceptual):
proxy:
enabled: true
host: 0.0.0.0
port: 8000
rate_limits:
enabled: true
# Identify the client / team by this header or token
identifier_header: "X-Team-ID" # or "Authorization" / "X-API-Key"
# Default limits (used if no team-specific rule matches)
default:
rpm: 60 # 60 requests per minute
tpm: 30000 # 30k tokens per minute
# Team-specific limits
teams:
marketing:
rpm: 30 # marketing: up to 30 RPM
tpm: 15000 # marketing: up to 15k TPM
research:
rpm: 120 # research: up to 120 RPM
tpm: 60000
product:
rpm: 80
tpm: 40000
With this structure:
- Each incoming request uses
X-Team-IDas a key. - LiteLLM maintains counters per team for the current time window.
- If a team exceeds RPM or TPM, LiteLLM returns a 429 error (or similar) until the window resets.
In BerriAI-driven apps, you can surface these errors back to the user (e.g., “Team rate limit exceeded, try again in a minute”).
Step 3: Track usage and cost per team
To enforce monthly budgets, you must track:
- Number of requests.
- Tokens consumed.
- Provider cost (per model) over time.
- Grouped by team.
LiteLLM typically exposes:
- Logs: Structured logs with model, prompt tokens, completion tokens, cost, and metadata.
- Metrics: Prometheus endpoints or similar for usage per key/team.
For example, the proxy might log JSON lines like:
{
"timestamp": "2026-04-01T12:30:05Z",
"team_id": "marketing",
"model": "gpt-4.1-mini",
"prompt_tokens": 250,
"completion_tokens": 200,
"total_tokens": 450,
"cost_usd": 0.0025,
"request_id": "req_123",
"status": 200
}
You can:
- Push these logs into Postgres, BigQuery, or a data warehouse.
- Or scrape metrics via Prometheus and aggregate by
team_id.
In BerriAI pipelines, you can also join these metrics with application-level data (which workflows, which users, etc.) for more detailed cost attribution.
Step 4: Define monthly budgets per team
Next, translate your financial constraints into a config your platform can enforce.
For example:
budgets:
enabled: true
reset_day: 1 # Day of month when budgets reset (UTC)
teams:
marketing:
monthly_budget_usd: 100.0
soft_threshold_pct: 0.8 # Alert at 80% of budget
hard_cap: true # Block when budget reached
research:
monthly_budget_usd: 500.0
soft_threshold_pct: 0.85
hard_cap: true
product:
monthly_budget_usd: 250.0
soft_threshold_pct: 0.9
hard_cap: false # Allow exceed but log warnings
Internally, you’ll maintain a usage table or metrics series like:
CREATE TABLE team_usage (
team_id text,
period_start date, -- month start
period_end date, -- month end
total_cost_usd numeric,
total_tokens bigint,
total_requests bigint,
PRIMARY KEY (team_id, period_start)
);
A scheduled job (cron, Airflow, or a small serverless function) can:
- Aggregate logs/metrics into this table.
- Compare
total_cost_usdvs.monthly_budget_usd. - Update a “current status” cache (e.g., Redis, in-memory store, or config DB) used by the LiteLLM proxy at request time.
Step 5: Enforce budgets at request time
To make LiteLLM actually block (or warn) when budgets are hit, you can:
- Use LiteLLM’s pre-request middleware hooks or interceptor.
- Or run a custom proxy layer in front of LiteLLM that does budget checks.
Conceptual pseudo-code for a pre-request check:
def pre_request_hook(request, team_id: str):
# 1. Load team budget and usage (from cache/DB)
budget = get_team_budget(team_id) # e.g., 100 USD
usage = get_current_month_usage(team_id) # e.g., 83 USD
if usage >= budget.hard_cap_amount and budget.hard_cap:
# reject immediately
return {
"allowed": False,
"status_code": 403,
"message": "Team budget cap reached for this month."
}
# Optional: if close to budget, you might still allow but flag
return {"allowed": True}
Then at post-response, you update usage:
def post_response_hook(request, team_id: str, response):
# calculate cost from tokens & model
cost = estimate_cost(response.model, response.prompt_tokens, response.completion_tokens)
increment_monthly_usage(team_id, cost, response.total_tokens)
# If new usage crosses thresholds, trigger alerts
check_and_fire_alerts(team_id)
LiteLLM offers extensibility points for this kind of behavior; check its proxy docs for the latest hook/middleware interface.
BerriAI will simply see a standard HTTP 403/429 when a request is rejected; you can handle that in your application or UI.
Step 6: Configure alerts when teams approach their cap
Alerts are critical for “near cap” visibility. Typical alert thresholds:
- Soft threshold (e.g., 70–90% of budget) → Slack/email alert.
- Hard threshold (>=100% of budget) → escalation and maybe auto-block.
You can implement alerts with:
Option A: Metrics-based (Prometheus / Grafana / Datadog)
Expose a metric like:
litellm_team_monthly_cost_usd{team_id="marketing"} 83.5
litellm_team_monthly_budget_usd{team_id="marketing"} 100
Then define an alert rule:
- alert: TeamApproachingBudgetCap
expr: (litellm_team_monthly_cost_usd / litellm_team_monthly_budget_usd) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Team {{ $labels.team_id }} approaching monthly budget cap"
description: "Team {{ $labels.team_id }} has used >80% of its monthly budget."
The alert manager then routes notifications to Slack, email, or PagerDuty.
Option B: Log / DB-based cron job
If you don’t use metrics:
- Run a job every 5–10 minutes.
- Query your
team_usagetable. - For each team:
- Compute
usage / budget. - If it crosses the soft threshold and
alert_sentis false:- Send Slack/email.
- Mark
alert_sent = trueso you don’t spam.
- Compute
Example Slack notification (via webhook):
def send_budget_alert(team_id, pct_used, budget_usd):
text = (
f"⚠️ *LLM Budget Alert*\n"
f"Team: `{team_id}`\n"
f"Usage: {pct_used:.0%} of ${budget_usd:.2f} monthly budget.\n"
f"Consider throttling usage or increasing budget if needed."
)
requests.post(SLACK_WEBHOOK_URL, json={"text": text})
BerriAI teams can also add an internal “usage” dashboard so stakeholders can see where they stand against their cap.
Step 7: Combining RPM/TPM limits with budget caps
You should treat RPM/TPM and budgets as complementary:
- RPM/TPM: Prevent short-term spikes that overrun provider limits and degrade reliability.
- Monthly budgets: Govern long-term cost.
Recommended pattern:
- Configure conservative RPM/TPM defaults, slightly below provider hard limits.
- Adjust team-specific RPM/TPM based on their usage profile.
- Layer monthly budgets on top so even if a team runs near their max RPM/TPM constantly, they still can’t exceed their monthly spend.
Edge cases to handle:
- A team hits RPM limit but is under budget:
- Return 429 with “Rate limit exceeded, try again later.”
- A team is under RPM but budget is exhausted:
- Return 403 (or 402) with “Budget cap reached, contact admin.”
- A team hits soft budget threshold:
- Continue serving traffic.
- Send warnings and optionally reduce RPM automatically (dynamic throttling).
Dynamic throttling example (pseudo-config):
dynamic_rate_limits:
on_soft_budget_hit:
reduce_rpm_pct: 0.5 # halve the RPM when soft threshold reached
You’d implement this by adjusting in-memory or database-backed rate limit configs when alerts trigger.
Step 8: Governance and admin workflows
To make this sustainable, add simple workflows:
- Admin UI / config for:
- Creating a team.
- Assigning a monthly budget.
- Setting RPM/TPM caps.
- Audit logs:
- Record when budgets or limits are changed, by whom, and why.
- Self-service visibility for teams:
- A dashboard showing:
- Month-to-date spend.
- Remaining budget.
- Current RPM/TPM usage vs. cap.
- Recent alerts.
- A dashboard showing:
BerriAI can sit at the top of this stack as the “experience layer” for your users, while LiteLLM remains the enforcement engine.
Practical example: putting it all together
Imagine an organization with three teams using BerriAI + LiteLLM:
- Marketing: many small queries; low budget.
- Research: heavy experimentation; high budget.
- Product: moderate usage; medium budget.
You might configure:
- Marketing:
- RPM: 30
- TPM: 15k
- Monthly budget: $100
- Soft alert: 80%
- Research:
- RPM: 200
- TPM: 100k
- Monthly budget: $1,000
- Soft alert: 85%
- Product:
- RPM: 80
- TPM: 40k
- Monthly budget: $300
- Soft alert: 90%
LiteLLM proxy:
- Reads
X-Team-IDfrom each request (set by BerriAI or your backend). - Applies rate limits per team.
- Logs every call with cost and tokens.
- Your budget service:
- Aggregates usage hourly.
- Updates per-team monthly totals.
- Fires alerts when thresholds are crossed.
- Tells LiteLLM when to block or throttle (via config/DB).
This pattern gives you predictable performance, cost control, and clear alerts without rewriting your BerriAI apps or deeply customizing every service.
Best practices and tips
To keep BerriAI / LiteLLM usage under control and predictable:
-
Centralize all provider access through LiteLLM proxy
Avoid direct-to-provider calls that bypass your limits. -
Use consistent team identifiers
Stick to one scheme (e.g.,X-Team-ID) across all apps and BerriAI workflows. -
Start with generous but safe limits
Log and observe, then tighten as you understand usage patterns. -
Automate budget resets
Use UTC month boundaries and automatic reset tasks, not manual resets. -
Integrate with your existing alerting
Reuse Slack channels and on-call flows your teams already trust. -
Periodically review per-team configuration
Usage grows; revisit RPM/TPM and budgets monthly or quarterly.
With these steps, BerriAI and LiteLLM become not only your LLM gateway but also your control plane for rate limits, spend governance, and proactive alerts—exactly what you need to safely scale AI usage across multiple teams.