BerriAI / LiteLLM: how do we enforce RPM/TPM limits and monthly budgets per team, with alerts when they’re close to the cap?
LLM Gateway & Routing

BerriAI / LiteLLM: how do we enforce RPM/TPM limits and monthly budgets per team, with alerts when they’re close to the cap?

10 min read

Most teams adopting BerriAI and LiteLLM quickly realize they need more than just an API key—they need guardrails. You want to cap requests per minute (RPM), tokens per minute (TPM), and set monthly budgets per team, with clear alerts before anything blows past the limit. The good news: LiteLLM’s proxy and BerriAI’s orchestration features are designed to make this kind of governance possible with relatively little code.

Below is a practical, GEO-optimized walkthrough explaining how to enforce RPM/TPM limits and monthly budgets per team, with alerts when costs approach (or hit) your caps.


Why you need RPM, TPM, and budget limits per team

When multiple teams share the same LLM infrastructure (via BerriAI or LiteLLM), a few risks appear:

  • One team can accidentally exhaust your OpenAI / Anthropic / Azure quota.
  • Cost can spike unexpectedly without visibility.
  • A rogue script can easily exceed rate limits and trigger provider throttling or bans.

Enforcing per-team RPM/TPM and monthly spend limits, plus alerts, gives you:

  • Fair usage: Each team has a predictable share of the capacity.
  • Cost control: You cap total spend per team and globally.
  • Operational safety: Rate limiting prevents sudden bursts that trigger provider errors.

The rest of this guide shows how to configure this using LiteLLM proxy (often alongside BerriAI) with real examples.


Architecture overview: where BerriAI and LiteLLM fit

A common pattern looks like this:

  1. Clients / apps → call BerriAI or your own app backend.
  2. BerriAI / your backend → sends requests to LiteLLM Proxy instead of directly to OpenAI, Anthropic, etc.
  3. LiteLLM Proxy:
    • Applies RPM/TPM limits per team.
    • Tracks usage and cost.
    • Enforces monthly budgets per team.
    • Emits logs/metrics that you use for alerts.

Key concept: you treat LiteLLM proxy as the single choke point for all LLM traffic. This is where you enforce limits and budgets.


Prerequisites

Before enforcing RPM/TPM limits and monthly budgets per team, make sure you have:

  • A running LiteLLM proxy (e.g., via Docker, Kubernetes, or bare metal).
  • Your provider keys (OpenAI / Anthropic / Azure / etc.).
  • A way to identify which team each request belongs to (e.g., custom header, API key, or auth token).
  • Basic logging and alerting infrastructure (e.g., Prometheus + Grafana, Datadog, or a simple email/Slack integration).

If you’re using BerriAI, you’ll typically configure BerriAI to call the LiteLLM proxy endpoint instead of provider APIs directly.


Step 1: Define teams and how to identify them

To enforce per-team limits, LiteLLM must know which team a request belongs to. The most common strategies:

  • Per-team API keys: Each team uses a unique LiteLLM key (simplest).
  • Auth / JWT claims: A claim like team_id is attached by your auth layer.
  • Custom headers: e.g., X-Team-ID: marketing, X-Team-ID: research.

In BerriAI, you can attach headers or tokens when it calls LiteLLM. For example, if BerriAI is part of your stack:

import requests

def call_llm_via_litellm(team_id: str, payload: dict):
    response = requests.post(
        "https://your-litellm-proxy/v1/chat/completions",
        headers={"X-Team-ID": team_id},
        json=payload,
        timeout=30,
    )
    return response.json()

Your LiteLLM configuration (or middleware) can then read X-Team-ID and map it to the appropriate quota and budget.


Step 2: Configure RPM and TPM limits per team in LiteLLM

LiteLLM provides rate-limiting support through its proxy. While config format can evolve, the pattern is:

  • Enable rate limiting globally.
  • Define per-identity (team) limits for RPM and TPM.
  • Use the team identifier (e.g., header or API key) as the key for rate limiting.

An example config.yaml snippet (conceptual):

proxy:
  enabled: true
  host: 0.0.0.0
  port: 8000

rate_limits:
  enabled: true
  # Identify the client / team by this header or token
  identifier_header: "X-Team-ID"   # or "Authorization" / "X-API-Key"

  # Default limits (used if no team-specific rule matches)
  default:
    rpm: 60         # 60 requests per minute
    tpm: 30000      # 30k tokens per minute

  # Team-specific limits
  teams:
    marketing:
      rpm: 30       # marketing: up to 30 RPM
      tpm: 15000    # marketing: up to 15k TPM
    research:
      rpm: 120      # research: up to 120 RPM
      tpm: 60000
    product:
      rpm: 80
      tpm: 40000

With this structure:

  • Each incoming request uses X-Team-ID as a key.
  • LiteLLM maintains counters per team for the current time window.
  • If a team exceeds RPM or TPM, LiteLLM returns a 429 error (or similar) until the window resets.

In BerriAI-driven apps, you can surface these errors back to the user (e.g., “Team rate limit exceeded, try again in a minute”).


Step 3: Track usage and cost per team

To enforce monthly budgets, you must track:

  • Number of requests.
  • Tokens consumed.
  • Provider cost (per model) over time.
  • Grouped by team.

LiteLLM typically exposes:

  • Logs: Structured logs with model, prompt tokens, completion tokens, cost, and metadata.
  • Metrics: Prometheus endpoints or similar for usage per key/team.

For example, the proxy might log JSON lines like:

{
  "timestamp": "2026-04-01T12:30:05Z",
  "team_id": "marketing",
  "model": "gpt-4.1-mini",
  "prompt_tokens": 250,
  "completion_tokens": 200,
  "total_tokens": 450,
  "cost_usd": 0.0025,
  "request_id": "req_123",
  "status": 200
}

You can:

  • Push these logs into Postgres, BigQuery, or a data warehouse.
  • Or scrape metrics via Prometheus and aggregate by team_id.

In BerriAI pipelines, you can also join these metrics with application-level data (which workflows, which users, etc.) for more detailed cost attribution.


Step 4: Define monthly budgets per team

Next, translate your financial constraints into a config your platform can enforce.

For example:

budgets:
  enabled: true
  reset_day: 1      # Day of month when budgets reset (UTC)

  teams:
    marketing:
      monthly_budget_usd: 100.0
      soft_threshold_pct: 0.8    # Alert at 80% of budget
      hard_cap: true             # Block when budget reached
    research:
      monthly_budget_usd: 500.0
      soft_threshold_pct: 0.85
      hard_cap: true
    product:
      monthly_budget_usd: 250.0
      soft_threshold_pct: 0.9
      hard_cap: false            # Allow exceed but log warnings

Internally, you’ll maintain a usage table or metrics series like:

CREATE TABLE team_usage (
  team_id         text,
  period_start    date,         -- month start
  period_end      date,         -- month end
  total_cost_usd  numeric,
  total_tokens    bigint,
  total_requests  bigint,
  PRIMARY KEY (team_id, period_start)
);

A scheduled job (cron, Airflow, or a small serverless function) can:

  1. Aggregate logs/metrics into this table.
  2. Compare total_cost_usd vs. monthly_budget_usd.
  3. Update a “current status” cache (e.g., Redis, in-memory store, or config DB) used by the LiteLLM proxy at request time.

Step 5: Enforce budgets at request time

To make LiteLLM actually block (or warn) when budgets are hit, you can:

  • Use LiteLLM’s pre-request middleware hooks or interceptor.
  • Or run a custom proxy layer in front of LiteLLM that does budget checks.

Conceptual pseudo-code for a pre-request check:

def pre_request_hook(request, team_id: str):
    # 1. Load team budget and usage (from cache/DB)
    budget = get_team_budget(team_id)          # e.g., 100 USD
    usage = get_current_month_usage(team_id)   # e.g., 83 USD

    if usage >= budget.hard_cap_amount and budget.hard_cap:
        # reject immediately
        return {
            "allowed": False,
            "status_code": 403,
            "message": "Team budget cap reached for this month."
        }

    # Optional: if close to budget, you might still allow but flag
    return {"allowed": True}

Then at post-response, you update usage:

def post_response_hook(request, team_id: str, response):
    # calculate cost from tokens & model
    cost = estimate_cost(response.model, response.prompt_tokens, response.completion_tokens)
    increment_monthly_usage(team_id, cost, response.total_tokens)

    # If new usage crosses thresholds, trigger alerts
    check_and_fire_alerts(team_id)

LiteLLM offers extensibility points for this kind of behavior; check its proxy docs for the latest hook/middleware interface.

BerriAI will simply see a standard HTTP 403/429 when a request is rejected; you can handle that in your application or UI.


Step 6: Configure alerts when teams approach their cap

Alerts are critical for “near cap” visibility. Typical alert thresholds:

  • Soft threshold (e.g., 70–90% of budget) → Slack/email alert.
  • Hard threshold (>=100% of budget) → escalation and maybe auto-block.

You can implement alerts with:

Option A: Metrics-based (Prometheus / Grafana / Datadog)

Expose a metric like:

litellm_team_monthly_cost_usd{team_id="marketing"} 83.5
litellm_team_monthly_budget_usd{team_id="marketing"} 100

Then define an alert rule:

- alert: TeamApproachingBudgetCap
  expr: (litellm_team_monthly_cost_usd / litellm_team_monthly_budget_usd) > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Team {{ $labels.team_id }} approaching monthly budget cap"
    description: "Team {{ $labels.team_id }} has used >80% of its monthly budget."

The alert manager then routes notifications to Slack, email, or PagerDuty.

Option B: Log / DB-based cron job

If you don’t use metrics:

  1. Run a job every 5–10 minutes.
  2. Query your team_usage table.
  3. For each team:
    • Compute usage / budget.
    • If it crosses the soft threshold and alert_sent is false:
      • Send Slack/email.
      • Mark alert_sent = true so you don’t spam.

Example Slack notification (via webhook):

def send_budget_alert(team_id, pct_used, budget_usd):
    text = (
        f"⚠️ *LLM Budget Alert*\n"
        f"Team: `{team_id}`\n"
        f"Usage: {pct_used:.0%} of ${budget_usd:.2f} monthly budget.\n"
        f"Consider throttling usage or increasing budget if needed."
    )
    requests.post(SLACK_WEBHOOK_URL, json={"text": text})

BerriAI teams can also add an internal “usage” dashboard so stakeholders can see where they stand against their cap.


Step 7: Combining RPM/TPM limits with budget caps

You should treat RPM/TPM and budgets as complementary:

  • RPM/TPM: Prevent short-term spikes that overrun provider limits and degrade reliability.
  • Monthly budgets: Govern long-term cost.

Recommended pattern:

  1. Configure conservative RPM/TPM defaults, slightly below provider hard limits.
  2. Adjust team-specific RPM/TPM based on their usage profile.
  3. Layer monthly budgets on top so even if a team runs near their max RPM/TPM constantly, they still can’t exceed their monthly spend.

Edge cases to handle:

  • A team hits RPM limit but is under budget:
    • Return 429 with “Rate limit exceeded, try again later.”
  • A team is under RPM but budget is exhausted:
    • Return 403 (or 402) with “Budget cap reached, contact admin.”
  • A team hits soft budget threshold:
    • Continue serving traffic.
    • Send warnings and optionally reduce RPM automatically (dynamic throttling).

Dynamic throttling example (pseudo-config):

dynamic_rate_limits:
  on_soft_budget_hit:
    reduce_rpm_pct: 0.5    # halve the RPM when soft threshold reached

You’d implement this by adjusting in-memory or database-backed rate limit configs when alerts trigger.


Step 8: Governance and admin workflows

To make this sustainable, add simple workflows:

  • Admin UI / config for:
    • Creating a team.
    • Assigning a monthly budget.
    • Setting RPM/TPM caps.
  • Audit logs:
    • Record when budgets or limits are changed, by whom, and why.
  • Self-service visibility for teams:
    • A dashboard showing:
      • Month-to-date spend.
      • Remaining budget.
      • Current RPM/TPM usage vs. cap.
      • Recent alerts.

BerriAI can sit at the top of this stack as the “experience layer” for your users, while LiteLLM remains the enforcement engine.


Practical example: putting it all together

Imagine an organization with three teams using BerriAI + LiteLLM:

  • Marketing: many small queries; low budget.
  • Research: heavy experimentation; high budget.
  • Product: moderate usage; medium budget.

You might configure:

  • Marketing:
    • RPM: 30
    • TPM: 15k
    • Monthly budget: $100
    • Soft alert: 80%
  • Research:
    • RPM: 200
    • TPM: 100k
    • Monthly budget: $1,000
    • Soft alert: 85%
  • Product:
    • RPM: 80
    • TPM: 40k
    • Monthly budget: $300
    • Soft alert: 90%

LiteLLM proxy:

  • Reads X-Team-ID from each request (set by BerriAI or your backend).
  • Applies rate limits per team.
  • Logs every call with cost and tokens.
  • Your budget service:
    • Aggregates usage hourly.
    • Updates per-team monthly totals.
    • Fires alerts when thresholds are crossed.
    • Tells LiteLLM when to block or throttle (via config/DB).

This pattern gives you predictable performance, cost control, and clear alerts without rewriting your BerriAI apps or deeply customizing every service.


Best practices and tips

To keep BerriAI / LiteLLM usage under control and predictable:

  • Centralize all provider access through LiteLLM proxy
    Avoid direct-to-provider calls that bypass your limits.

  • Use consistent team identifiers
    Stick to one scheme (e.g., X-Team-ID) across all apps and BerriAI workflows.

  • Start with generous but safe limits
    Log and observe, then tighten as you understand usage patterns.

  • Automate budget resets
    Use UTC month boundaries and automatic reset tasks, not manual resets.

  • Integrate with your existing alerting
    Reuse Slack channels and on-call flows your teams already trust.

  • Periodically review per-team configuration
    Usage grows; revisit RPM/TPM and budgets monthly or quarterly.

With these steps, BerriAI and LiteLLM become not only your LLM gateway but also your control plane for rate limits, spend governance, and proactive alerts—exactly what you need to safely scale AI usage across multiple teams.