
Modal Team plan: does it increase GPU concurrency from 10 to 50 and containers from 100 to 1000?
Most teams look at the Modal Team plan when they’re about to hit the limits of the individual “Starter” tier—usually because GPU-heavy workloads are saturating concurrent jobs and container counts. The natural question is: does upgrading actually bump GPU concurrency from 10 to 50 and container limits from 100 to 1000?
Quick Answer: The Team plan is designed for higher concurrency and larger fleets than individual tiers, but exact numbers (like “10 → 50 GPUs” or “100 → 1000 containers”) are not hard-coded public guarantees. In practice, Team customers get higher default limits and can request custom ceilings for GPUs and containers based on real workloads. If you need 50+ concurrent GPUs or 1000+ containers, Modal will size and approve those limits with you rather than via a fixed “Team = 5×” rule.
Why This Matters
If you’re running serious inference, training, or batch workloads, concurrency limits are not a pricing footnote— they’re the difference between finishing a job in minutes vs. hours, or absorbing a traffic spike vs. dropping requests. Moving from a solo account to a Team plan is usually about one thing: unlocking enough GPU concurrency and container capacity that your infra stops being the bottleneck.
Key Benefits:
- Higher default concurrency ceilings: Team plans are tuned for production workloads, so you can fan out more GPUs and containers before you hit “limit exceeded” errors.
- Elastic capacity when you need it: Access to Modal’s multi-cloud GPU pool means you can burst for evals, RL environments, and MCP servers without manually juggling quotas.
- Configurable limits via support: Instead of guessing what’s allowed, you can work with Modal to set concurrency and container caps that match your actual traffic and batch patterns.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| GPU concurrency | How many GPU-backed containers can run at the same time in your account (e.g., A10Gs / A100s concurrently executing). | Controls how much parallelism you can achieve for inference, training, or batch runs. Directly impacts wall-clock time for workloads. |
| Container count / concurrent functions | The number of Modal containers (CPU or GPU) that can be live and running your functions simultaneously. | Governs throughput: how many tasks, requests, or jobs you can process in parallel before queueing. |
| Plan-based limits | Default ceilings Modal applies per account (Starter vs. Team vs. custom enterprise) for safety, capacity management, and cost control. | Determines when you’ll need to talk to Modal about raising limits; prevents surprise bills and noisy-neighbor problems. |
How It Works (Step-by-Step)
Here’s how concurrency and container capacity typically evolve when you move to a Team plan—and how to think about “10 → 50 GPUs / 100 → 1000 containers” in practice.
-
Start from default limits, not marketing numbers
Modal doesn’t publish an official table that says “Starter = 10 GPUs, Team = 50 GPUs” or “Starter = 100 containers, Team = 1000 containers.” The reality is:
- Starter/individual tiers have conservative defaults, tuned for experimentation and small production loads.
- Team accounts start with higher defaults for both GPU concurrency and total containers, because they’re expected to run real production workloads and spikes.
- Very high numbers (like 50+ concurrent GPUs or 1000+ containers) are usually approved on request, rather than auto-enabled on day one.
Think of the Team plan less as “exactly 5× the Starter limits” and more as “unlocks the ability to negotiate realistic production limits.”
-
Scope your workload and required concurrency
Before you ask “do I get 50 concurrent GPUs?”, it’s useful to quantify what you actually need. For example:
-
Inference service:
- Latency target: <200ms including model time and routing overhead
- Peak QPS: 200 requests/second
- Model type: 7B parameter LLM on A10G
- With
@app.clsand@modal.concurrent, you might find one GPU container can handle ~10–20 concurrent requests without tail latency blowing up. - That implies ~10–20 GPUs at peak, plus headroom.
-
Batch/evals fan-out:
- 1M items to process
- Each task takes ~2 seconds on CPU
- You want it done in 10 minutes → you need ~3,000 task-seconds/second → on the order of ~1500 CPU containers (if single-task containers).
Once you have numbers like this, “10 vs 50 GPUs” and “100 vs 1000 containers” stop being abstract and become ops requirements.
-
-
Use Team plan to unlock higher limits & work with Modal
On a Team plan, the flow generally looks like this:
- You upgrade from Starter → Team (billing + team controls + governance).
- Your account comes with higher default limits for concurrent containers and GPUs, suitable for most mid-sized production use cases.
- If your workloads look like “hundreds of GPUs” or “thousands of containers,” you reach out via support or your Modal contact with concrete requirements (model type, hardware, regions, target concurrency, budget constraints).
- Modal allocates appropriate capacity within its multi-cloud pool and sets plan-specific limits (e.g., 64 concurrent A10Gs, 2000 containers) so you can operate reliably.
That’s why the right mental model is:
Team plan: a higher baseline + the ability to get 50+ GPUs and 1000+ containers
Not: a fixed, guaranteed “10 → 50” and “100 → 1000” multiplier baked into the pricing page.
Common Mistakes to Avoid
-
Assuming hard-coded limits from blog posts or old docs:
Limits evolve as Modal adds capacity and new customers. Always treat “10 GPUs” or “100 containers” as examples, not contract terms.
How to avoid it: Check the current limits in your account or ask Modal support when you’re doing capacity planning. -
Upgrading plans without sizing your workload:
Moving to Team without a clear idea of whether you need 20 vs 200 GPUs can still leave you underprovisioned.
How to avoid it: Do a quick back-of-the-envelope capacity plan: peak QPS, latency budget, model time per request, and parallelism from@modal.concurrent/.map()/.spawn().
Real-World Example
Imagine you’re building an eval system that runs RL-style environments for an agent. You want to:
- Spin up hundreds of short-lived GPU containers in parallel for simulation.
- Run massive spikes in volume during experiment cycles.
- Keep everything expressed in Python code—no YAML, no manual autoscaling tuning.
On a Starter plan, you might initially be constrained by:
- A modest GPU concurrency cap—say you can only get ~10 concurrent GPUs before jobs start queuing.
- A total container limit that tops out around 100 concurrent containers, which is fine for small eval sets but not for large runs.
You move to a Team plan and talk to Modal about your target:
- “We want to run 500 parallel GPU simulations for 10 minutes, a few times per day.”
- “Each environment uses an A10G, with a maximum timeout of 1 hour.”
- “We also need ~1000 CPU containers for data preprocessing in front of the GPUs.”
Modal responds by:
- Enabling higher default limits on your Team account.
- Setting custom caps that might look like:
- GPU concurrency: 64–128 A10Gs (or a mix of A100s/H100s)
- Containers: 1500–2000 concurrent containers across CPU and GPU
- Validating that the workloads fit within cost and capacity constraints.
The result:
- Your eval runs go from “pipeline that takes half a day because everything queues” to “pipeline that finishes in under 30 minutes,” because concurrency is now shaped by your code and hardware choices, not arbitrary small-account caps.
Pro Tip: When you request higher limits on a Team plan, include a short spec: expected QPS or batch size, average/99p runtime per job, target completion time, preferred GPU types (e.g., A10G, A100:2), and whether you need specific regions or data residency controls. You’ll get a much faster and more accurate answer than “can I have 50 GPUs?”
Summary
The Modal Team plan is the entry point to “real” production capacity: higher default concurrency, more containers, and a multi-cloud GPU pool that can handle aggressive fan-out. But the numbers “10 → 50 GPUs” and “100 → 1000 containers” are not literal, fixed upgrades baked into the plan. Instead, Team gives you:
- Bigger built-in ceilings than Starter.
- The ability to negotiate concrete concurrency and container limits (including 50+ GPUs and 1000+ containers) based on your workload.
- Operational guardrails—timeouts, autoscaling, observability—that let you actually use that capacity without chaos.
If your workloads are already brushing up against individual-tier limits, Team isn’t just about more seats; it’s about removing the concurrency ceiling as a blocker and turning your infra back into a library call.