
Modal vs Beam Cloud: cold starts, GPU availability, and pricing for bursty inference workloads
Bursty inference workloads are the worst of both worlds: you need GPUs ready to serve traffic in milliseconds, but you don’t want to pay for them sitting idle. This is exactly the workload where differences between Modal and Beam Cloud show up clearly—especially around cold starts, GPU availability, and how pricing behaves when traffic spikes 100–1000x for short windows.
Quick Answer: For bursty inference, Modal is optimized for sub‑second cold starts, instant autoscaling, and access to thousands of GPUs across clouds with no reservations. Beam offers a simpler GPU “serverless” experience, but with more traditional cold start behavior, tighter capacity constraints, and pricing that often pushes you toward keeping GPUs warm. If your main constraint is high‑volume spikes with strict latency SLOs, Modal’s AI‑native runtime, stateful servers, and Python‑defined infra generally give you more predictable performance and better cost control.
Why This Matters
When your LLM, vision model, or audio stack gets hit by a traffic spike, you have about 100–300 ms to respond before the UX feels broken. On most platforms, that conflicts with three realities:
- GPUs take time to initialize models.
- Capacity is limited (or requires manual reservations).
- Pricing makes you choose between paying to keep capacity warm, or suffering cold starts.
The whole point of “serverless GPUs” is to abstract this away. But in practice, implementation details—how cold starts actually work, where the capacity pool lives, and how billing is structured—determine if you can run your bursty workloads without overprovisioning or firefighting.
Key Benefits:
- Lower p95 latency during spikes: Modal’s sub‑second cold starts and stateful containers let you absorb sudden surges without blowing your latency SLOs.
- Higher effective GPU availability: Modal’s multi‑cloud capacity pool with intelligent scheduling is designed for “always have GPUs when you need them” without quotas or reservations.
- More predictable cost for bursty traffic: With scale‑to‑zero and job‑level billing, Modal lets you keep GPUs “logically warm” without paying 24/7 for idle capacity.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Cold start behavior | The time to start a container, initialize your model, and serve the first request after scale‑to‑zero or a new replica spin‑up. | Bursty inference lives or dies on p95/p99 latency; a 10–30s cold start is unacceptable when traffic is spiky. |
| GPU availability & scaling | How many GPUs you can get, how fast, and whether you need quotas, reservations, or manual cluster tuning. | If capacity isn’t truly elastic, your “serverless” setup will either throttle or fall over when you run evals, RL rollouts, or launch a campaign. |
| Pricing for bursty workloads | How the platform bills GPU time: per‑second vs hourly, idle vs active, minimum commitments, and scale‑to‑zero behavior. | Determines whether you can economically keep infra elastic, or if you’re forced into overprovisioned, always‑on GPU servers. |
How It Works (Step-by-Step)
Let’s walk through how you’d actually run a bursty inference service on Modal, then contrast what typically happens on a more conventional GPU serverless platform like Beam.
1. Define infra and model server in Python
On Modal you define the environment, hardware, and endpoint all in code—no YAML, no extra orchestration:
import modal
app = modal.App("bursty-inference")
image = (
modal.Image.debian_slim()
.pip_install(
"torch==2.2.0",
"transformers==4.39.0",
)
)
@app.cls(
image=image,
gpu="A100:1",
concurrency_limit=64, # how many concurrent requests per container
)
class ModelServer:
def __init__(self):
self.pipe = None
@modal.enter()
def load_model(self):
from transformers import pipeline
# Load once per container, not per request
self.pipe = pipeline("text-generation", model="gpt2")
@modal.method()
async def generate(self, prompt: str) -> str:
out = self.pipe(prompt, max_new_tokens=128)
return out[0]["generated_text"]
Here you’ve:
- Defined hardware (
gpu="A100:1"), environment (image), and concurrency in Python. - Used
@app.cls+@modal.enterto keep the model in memory for all calls in that container, which dramatically reduces “cold start” to container spin‑up only.
2. Expose a web endpoint with instant autoscaling
Now wrap it in a FastAPI endpoint:
from fastapi import FastAPI
from pydantic import BaseModel
web_app = FastAPI()
model = ModelServer()
class Request(BaseModel):
prompt: str
@modal.fastapi_endpoint(web_app)
@modal.asgi_app()
def serve():
return web_app
@web_app.post("/generate")
async def generate(req: Request):
return {"output": await model.generate.remote(req.prompt)}
Deploy with:
modal deploy app.py
At this point:
- Modal handles instant autoscaling: as QPS spikes, more ModelServer containers spin up across the multi‑cloud capacity pool.
- Cold starts are sub‑second for container launch + model load, because Modal’s runtime is optimized for GPU workloads (up to “100x faster than Docker” initialization).
On Beam, you’d define something similar, but cold start behavior is closer to:
- Start a GPU container (seconds to tens of seconds depending on image size).
- Load the model on first request (extra seconds).
- Time to first token can easily jump into multi‑second territory when scaling from zero or scaling up rapidly.
3. Absorb bursty traffic without manual scaling
To handle traffic that spikes 100x for 10 minutes (e.g., a marketing campaign) and then drops to near zero:
- Modal’s autoscaler spins up containers across thousands of GPUs across clouds, with no quotas or advance reservations required.
- When traffic drops, those containers scale back to zero. You don’t pay for idle GPUs.
Your code stays the same. You tune:
concurrency_limitper container,- max parallelism with
@modal.concurrentor.map()for batch workloads, - timeouts and retries (
modal.Retries) at the function level.
On a more traditional GPU serverless stack, you often have to:
- Pin min replicas to >0 to avoid brutal cold starts.
- Accept higher p95 in exchange for lower cost, or overpay for constantly warm GPUs.
- Manually manage limits to avoid hitting resource caps when traffic spikes.
Common Mistakes to Avoid
-
Treating cold starts as “just a config issue”:
Trying to tune cold starts away with keep‑warm hacks or extra min replicas misses the root cause: container startup + model load + slow orchestration. On Modal, put heavy initialization in@modal.enterand let the AI‑native runtime handle fast container launch instead of fighting the platform. -
Ignoring GPU availability until launch day:
It’s easy to prototype on a single GPU and only discover capacity limits when you run full‑scale evals or launch your product. Modal’s “access to thousands of GPUs across clouds” and absence of quota/reservation dance is specifically designed to avoid that fire drill—design your pipeline assuming you can burst hard, then validate with load tests usingmodal runand aggressive.spawn()/.map()fan‑out.
Real-World Example
Imagine you’re running a GPT‑style code assistant that mostly idles at 5–10 RPS but occasionally gets slammed to 1000+ RPS when a large customer rolls out to their org.
On a conventional GPU serverless setup:
- You either keep a bunch of A100s warm 24/7 (expensive), or
- You run with 1–2 GPUs and accept that any big rollout causes minutes of multi‑second latency while new containers spin up and models reload.
On Modal, you instead:
- Run the code assistant as a stateful
@app.clswith the model loaded once in@modal.enter, GPU set to"A100:1"or"H100:1". - Set
concurrency_limitto a high value per container and let Modal autoscale the number of containers. - Front it with
@modal.fastapi_endpointand userequires_proxy_auth=Truefor secure customer endpoints. - Use Volumes if you need cached artifacts or fine‑tuning checkpoints.
When a customer rollout hits, Modal’s runtime pulls from its multi‑cloud capacity pool, spins up dozens or hundreds of containers in seconds, and your P95 stays near “warm” latency. After rollout, everything scales to zero and you’re not billed for idle GPUs.
Pro Tip: For bursty inference, treat model load as part of container lifecycle, not request handling. Put it in
@modal.enter, keep your per‑request method thin, and pair that with aggressive autoscaling. Then run a synthetic spike test using.spawn()to mimic your worst case and check logs + metrics in the Modal apps page before you ever expose it to customers.
Summary
For bursty inference workloads, the practical differences between Modal and Beam Cloud come down to how the platform handles cold starts, GPU capacity, and billing:
- Modal is built as an AI‑native runtime with sub‑second cold starts, stateful model servers (
@app.cls+@modal.enter), and a multi‑cloud capacity pool offering elastic GPU scaling with no quotas or reservations. - That architecture lets you handle huge, short‑lived traffic spikes without pre‑warming a fleet of GPUs or accepting multi‑second p95.
- Pricing aligns with this model: scale‑to‑zero and per‑job billing mean you only pay when your functions or endpoints are actually doing work.
If your main requirement is to keep latency low and costs reasonable even when traffic is spiky and unpredictable, defining your infra and endpoints in Python on Modal is usually the more robust choice.