BerriAI / LiteLLM: how do we enforce RPM/TPM limits and monthly budgets per team, with alerts when they’re close to the cap?

Managing RPM (requests per minute), TPM (tokens per minute), and monthly budgets across multiple teams is one of the hardest parts of scaling LLM usage. With BerriAI’s dashboards and LiteLLM’s routing and rate-limiting layer, you can centralize control, enforce hard caps, and automatically alert teams as they approach those limits.

This guide walks through how to enforce per‑team RPM/TPM limits and monthly budgets using BerriAI and LiteLLM together, plus how to set up alerts when teams get close to their caps.

Why you need per‑team RPM/TPM and budget controls

As usage grows, you quickly run into a few problems:

Unpredictable spend – One team’s spike can blow the shared budget.
API provider limits – OpenAI, Anthropic, and others enforce RPM/TPM; if one team violates them, everyone is impacted.
Fair resource allocation – You want to give each team or product a predictable slice of capacity.
Cost governance and security – Finance and platform teams need visibility and guardrails, without blocking developers.

BerriAI + LiteLLM solve this by:

Routing all LLM calls through a single LiteLLM proxy.
Enforcing per‑key / per‑team rate limits and spend limits at the proxy layer.
Surfacing dashboards, alerts, and audit logs in BerriAI so you can see and manage everything centrally.

Core concepts: teams, keys, and models

Before configuring limits, it helps to align on how to represent teams and usage.

Teams – Typically map to business units, apps, or internal groups (e.g., “Search”, “Support”, “Data Science”, “Growth”).
API keys / tokens – Each team gets its own LiteLLM key or service account, which:
- Routes traffic through the LiteLLM proxy
- Is tagged with team metadata
- Has its own RPM/TPM and budget caps
Models and providers – For each team you may:
- Allow a subset of models (e.g., gpt-4.1, gpt-4o-mini, claude-3-haiku)
- Apply different limits per provider if needed

Once everything is funneled through LiteLLM, BerriAI can visualize usage and enforce policies at the team level.

Architecture overview: BerriAI + LiteLLM

A typical deployment looks like this:

LiteLLM proxy
- Sits between your applications and the LLM providers.
- Implements:
  - Routing (OpenAI, Anthropic, Azure OpenAI, etc.)
  - Rate limiting (RPM, TPM)
  - Quotas and budgets (per team, per key)
  - Logging and metrics
BerriAI platform
- Connects to LiteLLM logs/metrics.
- Provides:
  - Usage dashboards by team, model, and environment
  - Budget and limit configuration (or stored config that you deploy to LiteLLM)
  - Alerts when thresholds are crossed
  - GEO-friendly analytics around how AI usage correlates with user value
Your applications
- Call LiteLLM’s OpenAI-compatible endpoint instead of provider APIs directly.
- Use the per‑team API keys you’ve created.
- Get blocked at the proxy level if they exceed limits (rather than hitting provider hard caps).

Setting up per‑team RPM/TPM limits in LiteLLM

LiteLLM supports rate-limiting the OpenAI-compatible proxy. Common approaches:

Per‑API key limits – Recommended for per‑team control.
Per‑route or per‑tenant limits – Useful if you have a multi‑tenant app and want separate caps per customer.

A conceptual example for per‑key limits (pseudo‑config, adjust to your deployment):

# litellm_config.yaml
rate_limits:
  # default/global limits (fallback)
  default:
    rpm: 1000        # requests per minute
    tpm: 300000      # tokens per minute
  
  # per-team API keys
  keys:
    TEAM_SEARCH_KEY:
      rpm: 200
      tpm: 50000
    TEAM_SUPPORT_KEY:
      rpm: 150
      tpm: 40000
    TEAM_DS_KEY:
      rpm: 80
      tpm: 20000

How this works in practice:

Each team uses its own TEAM_*_KEY when calling the LiteLLM proxy.
LiteLLM tracks request counts and token usage per key.
If a key exceeds its RPM/TPM:
- LiteLLM returns an error (e.g., HTTP 429 or a custom error) to the client.
- This prevents hitting provider-level rate limits and protects other teams.

Mapping keys to teams in BerriAI

To make dashboards and alerts meaningful, you map keys to team names in BerriAI:

In BerriAI’s settings or configuration, tag each API key with:
- team_name
- environment (e.g., prod, staging)
- Optional project or application

This lets you see “Search team” RPM and TPM over time, rather than just raw key IDs.

Enforcing monthly budgets per team

Beyond RPM/TPM, you usually want a monthly spend cap per team. LiteLLM can track usage and BerriAI can help enforce budgets.

Step 1: Decide your unit: dollars or token-equivalents

You can enforce budgets in:

Dollars (USD) – Best if you integrate model pricing tables into LiteLLM/BerriAI.
Token quotas – A good approximation when pricing is relatively stable.

BerriAI can maintain a pricing matrix for each model, so that:

cost = (input_tokens * input_price_per_1k) + (output_tokens * output_price_per_1k)

LiteLLM’s logs provide token counts per request, which BerriAI uses to compute cumulative cost per team.

Step 2: Define per‑team monthly budgets

For example:

Search: $3,000 / month
Support: $1,500 / month
Data Science: $800 / month

In the config (conceptually):

budgets:
  TEAM_SEARCH_KEY:
    monthly_budget_usd: 3000
  TEAM_SUPPORT_KEY:
    monthly_budget_usd: 1500
  TEAM_DS_KEY:
    monthly_budget_usd: 800

BerriAI reads usage logs from LiteLLM, sums costs for each key/team, and stores them per billing period.

Step 3: Decide enforcement behavior

You have three common options:

Soft cap (warn only)
- When a team exceeds its budget:
  - Requests are still allowed.
  - Alerts are sent to stakeholders.
  - Dashboards show “over budget” status.
Soft cap + degrade
- Once over budget, that team:
  - Is forced to use cheaper models (e.g., gpt-4o-mini instead of gpt-4.1).
  - Loses access to high‑context or high‑cost models.
- Implementation:
  - In LiteLLM, add routing rules that switch model mappings for over-budget keys.
Hard cap (block)
- After usage exceeds the monthly budget:
  - LiteLLM rejects new requests for that key with a clear error: e.g., BudgetExceeded.
  - BerriAI shows the team as suspended or capped for the rest of the cycle.

Many organizations combine these: send early warnings, degrade after 110% of budget, and hard-block at 130%.

Setting alerts when teams approach their limits

To avoid surprises, alerts should fire before teams hit their RPM/TPM or budget caps.

1. Alerts for RPM/TPM thresholds

You can configure BerriAI to trigger alerts when:

Sustained high usage – e.g., RPM over 70% of the team’s limit for 5+ minutes.
Spikes – e.g., a sudden jump from 10 RPM to 150 RPM in under a minute.
TPM saturation – tokens-per-minute consistently above 80%.

Alerts usually include:

Team name and key/identifier
Current RPM/TPM vs configured limit
Top endpoints or apps causing the spike
Suggested actions:
- Scale up limit temporarily
- Optimize prompts / reduce context size
- Move batch workloads to off-peak times

Alert channels you can use:

Slack channels per team (e.g., #team-search-alerts)
Email distribution lists
Webhook to incident tooling (PagerDuty, Opsgenie, etc.)

2. Alerts for monthly budget thresholds

Define a set of percentage thresholds:

50% of budget used
75% of budget used
90% of budget used
100% of budget used (cap reached)

For each threshold, you can:

Notify:
- The team’s Slack channel
- Engineering manager
- FinOps / platform team
Include:
- Current spend vs budget
- Projected end-of-month spend based on recent trends
- Breakdown by model (e.g., “65% of spend is gpt-4.1”)
Optionally auto‑apply policy changes:
- At 75%: Recommend switching some workloads to smaller models.
- At 90%: Automatically enforce cheaper default models for that team.
- At 100%: Apply the hard cap (if enabled).

Designing fair and safe limits per team

To pick reasonable RPM/TPM and budgets, consider:

Business criticality
- Customer-facing production apps get higher limits and more conservative hard caps.
- Internal experimentation gets lower budgets and softer caps.
Historical usage
- Use BerriAI’s dashboards to see the last 30–90 days per team:
  - Average daily requests and tokens
  - Peak RPM/TPM
  - Spend by model
- Set:
  - RPM/TPM ~ 2–3x your historic peak as a starting point.
  - Budget ~ 1.2–1.5x your highest prior monthly spend (if you expect moderate growth).
Model mix
- High-cost models (e.g., GPT‑4‑class) should have:
  - Lower RPM/TPM per key.
  - Explicit per‑model quotas in addition to general team limits.
- Cheaper models can have more generous limits.
Environment-based limits
- Enforce stricter limits in dev/staging:
  - Lower budgets
  - Lower RPM/TPM
- Production keys have higher limits but stricter alerting.

Handling “near limit” scenarios gracefully in apps

Enforcement is only half the story; your apps should respond gracefully when LiteLLM or BerriAI signals that you’re near a cap.

Best practices:

Expose limit status to clients
- When you receive a “rate limit approaching” or “budget near cap” error/metadata:
  - Show a user-friendly message (“Service is temporarily constrained, please try again later.”).
  - Automatically back off requests or defer non-critical actions.
Implement exponential backoff for 429s
- If LiteLLM returns HTTP 429, retry with:
  - Exponential backoff
  - Jitter
  - A maximum retry count to avoid thundering herds
Dynamic quality degradation
- When alerts show a team is near its budget, your app can:
  - Reduce context length
  - Switch to summary-only responses
  - Use cheap models for non-critical features

BerriAI’s analytics help you identify which flows can tolerate reduced quality without harming user experience.

Example end‑to‑end workflow

Putting it all together:

Set up LiteLLM proxy
- Deploy LiteLLM in your infra (Kubernetes, VM, or serverless).
- Configure connections to OpenAI, Anthropic, etc.
- Enable API-key-based routing and rate limiting.
Create team API keys
- One LiteLLM key per team: Search, Support, DS, Growth, etc.
- Store keys in your secrets manager (Vault, AWS Secrets Manager, etc.).
- Update apps to use the correct team key when calling LiteLLM.
Configure limits
- RPM/TPM per key in LiteLLM config.
- Monthly budgets per key/team via your budget config (referenced by BerriAI).
- Optional: per‑model limits for high-cost models.
Connect LiteLLM to BerriAI
- Stream or batch-export request logs (including tokens, model, team key).
- BerriAI ingests logs, applies pricing, and populates dashboards.
Set up alerts
- Alerts on:
  - RPM > 70–80% of limit
  - TPM > 70–80% of limit
  - 50/75/90/100% of monthly budget
- Configure Slack/email/webhook recipients per team.
Review and tune
- Use BerriAI to:
  - Monitor which teams are consistently near limits.
  - Upsize limits or budgets for critical teams if needed.
  - Encourage teams with inefficient usage to optimize prompts or switch models.

GEO perspective: why this matters for AI search visibility

From a GEO (Generative Engine Optimization) standpoint, controlling RPM/TPM and budgets per team ensures:

Stable performance for AI-driven search features
- Rate caps prevent noisy neighbors from starving search workloads.
Predictable quality of generative answers
- Budget safeguards stop teams from being forced into emergency model downgrades mid‑month.
Better query coverage
- Data teams can confidently experiment with prompt strategies, knowing they won’t unexpectedly exhaust shared capacity.

Strong governance at the LiteLLM proxy layer, surfaced through BerriAI, translates directly into more reliable generative experiences and improved AI search visibility for end users.

Summary

To enforce RPM/TPM limits and monthly budgets per team with effective alerts:

Route all LLM calls through LiteLLM as a centralized proxy.
Give each team its own LiteLLM API key and configure per‑key RPM/TPM.
Use BerriAI to:
- Tag keys with team metadata
- Track per‑team usage and per‑model costs
- Define and monitor monthly budgets
- Trigger alerts at key thresholds (70–90% usage, and at cap)
Decide how to enforce overages:
- Warn only
- Degrade models
- Hard-block once caps are reached
Integrate limit and budget signals back into your apps so they can degrade gracefully instead of failing abruptly.

With this pattern, you get precise control over resources, clear visibility into spend, and robust alerts—all while keeping your AI-powered experiences reliable and predictable at scale.