Modal Team plan: does it increase GPU concurrency from 10 to 50 and containers from 100 to 1000?
Platform as a Service (PaaS)

Modal Team plan: does it increase GPU concurrency from 10 to 50 and containers from 100 to 1000?

7 min read

Most teams start on Modal’s Free or Solo plan, hit the default GPU and container limits, and then wonder: does upgrading to the Team plan simply “turn the dials up” from 10 to 50 concurrent GPUs and 100 to 1000 containers? The short version: the Team plan is what unlocks higher, flexible capacity and real production guardrails—but the exact numbers are policy- and workload-dependent, not a hardcoded 10→50 or 100→1000 jump.

Quick Answer: The Modal Team plan does increase your practical GPU concurrency and container capacity beyond the default Free/Solo levels, but not as a fixed “10 to 50” or “100 to 1000” guarantee. Instead, Team gives you access to higher ceilings, elastic GPU pools across clouds, and Modal’s engineering team to right‑size limits for your specific workloads and spend profile.

Why This Matters

If you’re running serious AI workloads—LLM inference, fine-tuning, massive eval sweeps, RL environments, MCP servers—the default concurrency and container caps are usually the first bottleneck you hit. You don’t want a marketing-friendly number; you want predictable capacity that doesn’t force you to overprovision or throttle your own customers.

The Team plan is Modal’s line between “fast experimentation” and “this is production infrastructure.” It’s where you get:

  • Higher and tunable concurrency limits.
  • Access to thousands of GPUs across clouds with intelligent scheduling.
  • Operational controls (retries, timeouts, observability, security) that let you safely push those limits.

Key Benefits:

  • Higher scalable concurrency: Move beyond starter limits on GPUs and containers so batch jobs, evals, and endpoints can fan out in parallel without constant quota juggling.
  • Elastic GPU capacity across clouds: Tap into Modal’s multi-cloud capacity pool for spiky workloads without managing reservations, quotas, or regional GPU hunting.
  • Production-grade control & support: Use programmable limits, observability, and direct access to Modal’s engineering team to design capacity that matches your SLOs and budget.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
GPU concurrencyHow many GPU-backed containers (e.g., gpu="A100") can be executing at the same time under your account/organization.Directly controls how many inferences, training jobs, or eval shards you can run in parallel without queuing.
Container capacityThe total number of containers (CPU and GPU) you can have running concurrently across all apps and functions.Determines how wide you can fan out batch workloads (.map(), .spawn()) and how many replicas can back your endpoints.
Programmable scalingDefining concurrency, hardware, and autoscaling behavior in Python (functions, classes, decorators) instead of external YAML/config.Lets you tie capacity settings directly to code, tests, and deployments—no separate infra layer to keep in sync.

How It Works (Step-by-Step)

Think of Modal’s plans as permission envelopes around the same runtime. The runtime can scale to thousands of GPUs and containers; your plan defines how far you’re allowed to push it and how much help you get from humans when you do.

  1. You define your workload in code

    You don’t provision “50 GPUs” manually. You tell Modal what a unit of work looks like, what hardware it needs, and how it should scale. For example:

    import modal
    
    app = modal.App("team-plan-capacity-example")
    
    image = (
        modal.Image.debian_slim()
        .pip_install("torch", "transformers")
    )
    
    @app.function(
        image=image,
        gpu="A10G",
        timeout=600,          # max per-call runtime
        concurrency_limit=50  # soft cap per function
    )
    async def run_inference(batch):
        # ...run your model...
        return ...
    

    Here you’re requesting something like “up to 50 concurrent GPU-backed containers for this function,” but whether you actually get 50 depends on your account limits and current load.

  2. Your account plan sets global ceilings

    Under the hood, Modal enforces:

    • An organization-level cap on total containers (CPU + GPU).
    • An organization-level GPU concurrency cap.
    • Optional per-function or per-app limits you define (concurrency_limit, modal.Retries, timeouts).

    On Free/Solo, these ceilings are conservative so you can’t accidentally spin up huge GPU fleets with a stray .map(). On Team, those ceilings are significantly higher and negotiable. You might start with something like “50 concurrent GPUs, 1000 containers” if that matches your workload and budget, but those numbers come from a conversation with Modal, not a universal default.

  3. Autoscaling and scheduling fill the envelope

    When you drive load (e.g., run_inference.map(batches)):

    batches = [...]
    results = list(run_inference.map(batches))
    

    Behind the scenes, Modal:

    • Schedules containers across its multi-cloud capacity pool (H100s, A100s, A10Gs, etc.).
    • Respects both your plan limits and your function-level concurrency_limit.
    • Scales replicas up to handle spikes, then scales back to zero when idle.

    If your Team plan has a GPU concurrency cap of 50 and your function’s concurrency_limit is 50, Modal will happily fan out to ~50 parallel GPUs. If you try to push beyond the org cap (say you declare concurrency_limit=200), jobs will queue rather than provision more GPUs than your plan allows.

Common Mistakes to Avoid

  • Assuming fixed numeric jumps across all Team plans:
    Don’t hard-code your architecture around “Team = 50 GPUs and 1000 containers.” Those are plausible numbers but not guaranteed or universal. Always design with backpressure and retries (modal.Retries) so you can operate under whatever limit you negotiate.

  • Ignoring function-level concurrency limits:
    If you only think at the plan level (“we have 50 GPUs”), it’s easy to create a single hot function with no concurrency_limit that crowds out everything else. Treat per-function concurrency as your first safety valve; use it to partition capacity between endpoints, training, and batch jobs.

Real-World Example

Say your team is building:

  • A low-latency LLM inference endpoint.
  • A nightly eval sweep that hammers the model with 100k test prompts.
  • A training/fine-tuning job that occasionally uses multi-GPU nodes (gpu="A100:2").

On a starter plan, you might:

  • Hit a GPU concurrency cap while evals are running, causing endpoint latency to spike because new containers can’t scale up.
  • Queue training jobs for hours because evals are hogging the limited GPUs.

On the Team plan, you work with Modal to set:

  • An org-level GPU concurrency cap that’s high enough to run evals and production traffic together.

  • Separate per-function concurrency patterns, e.g.:

    @app.function(gpu="A10G", concurrency_limit=10)
    @modal.fastapi_endpoint()
    async def chat_endpoint(request: Request):
        ...
    
    @app.function(gpu="A10G", concurrency_limit=40)
    def eval_job(batch):
        ...
    
    @app.function(gpu="A100:2", concurrency_limit=2, timeout=60*60*4)
    def finetune(...):
        ...
    

Now your endpoint is guaranteed a dedicated slice of GPU capacity (10 concurrent A10Gs), evals can still fan out aggressively (up to 40), and training gets its own band. The Team plan gives you a big enough concurrency envelope that this partitioning actually works, plus access to the Modal engineering team if you need to bump limits when traffic or eval volume grows.

Pro Tip: When you talk to Modal about a Team plan, come with rough numbers: peak TPS, acceptable p95 latency, desired eval batch sizes, training schedule, and which GPU types you care about. That makes it much easier to set GPU concurrency and container caps that don’t surprise you later.

Summary

The Modal Team plan doesn’t ship with a universal “10 GPUs → 50 GPUs, 100 containers → 1000 containers” upgrade sticker. Instead, it lifts you out of the starter sandbox into a regime where:

  • You can run serious parallel workloads on elastic GPU capacity across clouds.
  • You can express concurrency, hardware, and autoscaling in Python and expect the runtime to keep up.
  • You can work with Modal’s engineering team to set and adjust GPU concurrency and container limits that match your workloads and SLOs.

If you’re bumping into current caps or planning workloads that obviously will—massive evals, RL environments, high-volume MCP servers—the Team plan is the point where Modal stops being a toy and becomes your primary AI infrastructure layer.

Next Step

Get Started