
Should I choose Modal Starter or Team if I need more than 3 seats and higher GPU concurrency?
If you’re bumping into the 3-seat cap or hitting GPU concurrency limits, you’re already in the zone where the choice between Modal Starter and Team isn’t just a billing decision—it’s an operational constraint on how fast your engineers can ship. The right plan determines whether you can fan out thousands of GPU jobs for evals, run multiple LLM endpoints at once, and onboard more than a couple of devs without juggling shared logins.
Quick Answer: If you need more than 3 seats and consistently push higher GPU concurrency (multiple simultaneous GPU jobs, heavy batch fan-out, or always-on LLM endpoints), you almost certainly want the Team plan. Starter is great for early prototyping and small workloads, but Team is designed for real production traffic, multi-engineer teams, and higher, predictable access to GPU capacity.
Why This Matters
Plan choice directly impacts how your AI infra behaves under load. On Starter, you can get pretty far with a couple of engineers, a single main LLM endpoint, and some batch jobs. But once you have a small team running inference, training/fine-tuning, evals, and batch pipelines on the same account, constrained seats and concurrency turn into slow feedback loops, queued jobs, and “who’s using the only GPU?” conversations.
Modal is built to let humans iterate at full speed: sub-second cold starts, instant autoscaling, and access to thousands of GPUs across clouds, all defined in Python. That only pays off if your plan actually lets your team use those capabilities concurrently.
Key Benefits:
- Unblocked collaboration: Team-level seat limits match how many engineers actually touch infra, so you can stop sharing credentials or serializing work.
- Higher GPU concurrency: Run more concurrent GPU functions, endpoints, and batch jobs without fighting the scheduler or throttling your own experiments.
- Production-ready guardrails: Get the scaling, governance, and visibility you need to treat Modal as your primary AI runtime, not just a playground.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Seats | The number of individual users (engineers, data scientists, ops) that can be part of your Modal organization. | Determines how many people can independently deploy, monitor, and debug apps without shared accounts. |
| GPU concurrency | How many GPU-backed containers (functions, endpoints, batch jobs) can run at the same time under your plan. | Controls whether you can run training, inference, evals, and R&D workloads in parallel instead of queueing them. |
| Starter vs Team plans | Two pricing tiers: Starter optimized for early-stage usage and smaller teams; Team designed for multi-engineer orgs with sustained and spiky GPU demand. | Choosing the right plan ensures your infra won’t become a bottleneck as your workloads grow and your headcount increases. |
How It Works (Step-by-Step)
The decision boils down to two questions: (1) how many humans need to touch Modal daily, and (2) how much parallel GPU work you want to do.
-
Map your team and workloads
- Count how many people will actually run
modal run,modal deploy, inspect logs, tweak Images, and manage Secrets. - List the GPU-heavy flows you care about:
- LLM inference endpoints (e.g.,
@modal.fastapi_endpointor@modal.web_server) - Fine-tuning or training (
@app.function(gpu="A100:2")jobs that run up to 24 hours) - Batch fan-out (
fn.map()across thousands of inputs) - Sandboxed code execution in GPUs (e.g., eval frameworks, MCP servers)
- LLM inference endpoints (e.g.,
- If you already need >3 humans in that loop or you regularly want multiple GPU jobs at once, you’re outside the sweet spot of Starter.
- Count how many people will actually run
-
Translate this into Modal primitives and concurrency
- Every Modal GPU workload maps to functions and classes:
- Stateless inference:
@app.function(gpu="A10G")exposed via@modal.fastapi_endpoint - Stateful servers:
@app.cls(gpu="A100")with@modal.enterfor one-time model loading - Batch jobs:
fn.map()or.spawn()fan-outs across GPUs
- Stateless inference:
- The more of these you run at the same time, the more GPU concurrency you need. On Starter, concurrency is intentionally constrained to keep it friendly for early usage and smaller teams.
- On Team, you’re buying the right to actually use Modal’s multi-cloud capacity pool and intelligent scheduling at scale—more parallel containers, more GPUs, more endpoints.
- Every Modal GPU workload maps to functions and classes:
-
Choose based on failure modes, not just price
- If you stay on Starter while your org needs Team-level resources, you’ll see:
- Jobs queued instead of starting immediately when you hit concurrency limits.
- Engineers waiting on each other to finish big runs.
- Awkward workarounds: merging workloads, serializing pipelines, or only running heavy jobs at off-hours.
- On Team, you trade a higher base cost for:
- Enough seats that everyone can deploy their own apps.
- Higher (and more predictable) GPU concurrency, so autoscaling actually keeps up with traffic and experiment load.
- If your most important workloads are production APIs, high-volume evals, or regular fine-tuning, the cost of not upgrading is usually higher than the plan delta.
- If you stay on Starter while your org needs Team-level resources, you’ll see:
Common Mistakes to Avoid
-
Treating seats as a “nice to have” instead of a constraint:
Shared logins and one-person bottlenecks look cheap on paper but destroy iteration speed. If multiple engineers are touching LLM prompts, data transforms, infra config, and batch logic, they should each have their own seat—Starter’s 3-seat cap fights that. -
Underestimating GPU concurrency needs:
It’s easy to look at “we only need one GPU model” and forget:- You’ll want a shadow deployment for new versions.
- You’ll run eval sweeps and batch backfills.
- You’ll do interactive debugging in Sandboxes or Notebooks.
Add those up and “just one GPU” turns into “we need multiple concurrent GPU containers.” Don’t pick a plan that forces everything through a single serial pipeline.
Real-World Example
Imagine a small AI startup with:
- 5 engineers (3 ML, 2 full-stack)
- One main LLM inference endpoint (A10G) serving user traffic
- A fine-tuning pipeline that runs nightly on an A100
- A batch eval job that runs 10k prompts across different temperature settings
- A sandboxed environment used for trying new models and running user-submitted code safely
In Modal, this might look like:
import modal
app = modal.App("prod-llm-stack")
image = (
modal.Image.debian_slim()
.pip_install("torch", "transformers", "accelerate")
)
@app.cls(gpu="A10G", image=image)
class InferenceServer:
@modal.enter()
def load_model(self):
# Load weights once per container – stateful server
...
@modal.method()
def generate(self, prompt: str):
...
@app.function(gpu="A100:2", image=image, timeout=24 * 60 * 60)
def finetune(dataset_path: str):
...
@app.function(gpu="A10G", image=image)
def eval_batch(prompts: list[str]):
return list(generate.remote.map(prompts))
@app.function(gpu="A10G", image=image)
def sandbox_task(code_snippet: str):
# Run untrusted code in a gVisor sandbox
...
On Starter, with lower GPU concurrency:
- When you kick off
finetune.remote(...), you may starve the inference server if concurrency is tight—autoscaling can’t give you as many A10Gs/A100s simultaneously as you’d like. - Running
eval_batch.remote(...)while traffic spikes on your inference endpoint competes for the same constrained GPU pool. - Only 3 engineers can be formally in the org; others are either waiting or using workarounds.
On Team:
- Multiple
InferenceServercontainers can autoscale to handle spiky traffic whilefinetuneconsumes A100s in parallel. eval_batchcan fan out across many A10Gs, finishing in minutes instead of hours.- All 5 engineers have seats, each can
modal runtheir own experiments, deploy features, and inspect logs in the apps page without stepping on each other.
Pro Tip: Before upgrading, instrument your actual needs. Run a week where you don’t compromise: let everyone use GPUs when they want, run evals as often as they’d like, and track how many concurrent GPU jobs you would have used. If that number is consistently higher than what Starter allows, move to Team and size your concurrency to match that peak plus headroom.
Summary
If your only pain is “I want to try Modal,” the Starter plan is perfect: low friction, good for prototyping, fine for a solo dev or tiny team running a single LLM endpoint or a modest training job here and there.
The moment your reality looks like:
- More than 3 engineers working on AI infra, and
- Multiple GPU workloads that should run at the same time (inference, evals, fine-tuning, batch pipelines, sandboxes),
then you should treat the Team plan as the default. It’s the tier that aligns with Modal’s core promise: define everything in Python, hit modal deploy, and let the platform’s multi-cloud capacity pool and intelligent scheduling give you the parallelism you actually designed for, not the serial pipeline your plan forced you into.