
Modal vs RunPod vs Baseten vs Replicate: cold starts, scale-to-zero cost, and how much custom Python code I can run
Most teams evaluating Modal against RunPod, Baseten, and Replicate are trying to answer three concrete questions: how bad are cold starts, what does scale-to-zero actually cost, and how much custom Python can I run before I fall off the “happy path” and start fighting the platform. Let’s walk through those questions from a systems point of view and be explicit about where Modal takes a different stance.
Quick Answer: Modal is a Python-first, serverless runtime designed around sub-second cold starts and elastic scale-to-zero on both CPU and GPU, where you express infra as code and can run almost arbitrary Python in containers. RunPod looks more like GPU rental; Baseten and Replicate center on model hosting and templates, with more constraints and more “platform behavior” once you go beyond vanilla inference.
Why This Matters
The cost envelope and reliability of your AI workload are dominated by three things: container startup latency, how long idle capacity sits around, and how much freedom you have to implement custom logic without hacking around a product’s assumptions. Cold starts dictate whether you can afford to scale to zero. Scale-to-zero policy dictates your idle GPU bill. And “how much Python can I run” is shorthand for: can I treat this like code-defined infrastructure or am I stuck inside a productized model host.
Key Benefits:
- Predictable low-latency inference: Sub-second cold starts and stateful servers keep LLM and vision endpoints inside tight latency budgets without paying for 24/7 warm replicas.
- Efficient scale-to-zero economics: Elastic autoscaling and billing on actual compute lets you fan out to thousands of containers during spikes and drop to zero cost when traffic disappears.
- Full Python control plane: Defining environment, hardware, scaling, and endpoints in code gives you freedom to build non-trivial systems (agents, eval harnesses, RL loops, training jobs) without waiting for product features.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Cold start latency | Time from request/trigger to container being ready to run your code and return a response. | Determines if you can scale to zero without blowing your P95 latency budget. |
| Scale-to-zero cost model | How the platform bills when your app has zero active requests or jobs. | Dictates whether you can support spiky workloads without paying for idle GPUs/CPUs. |
| Custom Python surface area | How much arbitrary Python you can run: imports, system calls, background tasks, training loops, sandboxes. | Determines whether the platform can host your whole stack (agents, evals, batch, training) or “just the inference step.” |
How It Works (Step-by-Step)
Let’s anchor on a specific workload: you want to expose an LLM-powered endpoint with tight latency, run evals and batch jobs on the same model code, and maybe fine-tune or retrain periodically. Here’s how that looks on Modal, and how it contrasts with more model-hosting-centric platforms.
1. Express infra as Python with Modal
On Modal, you define the environment, hardware, and scaling in code. No YAML, no separate infra layer.
import modal
app = modal.App("llm-service")
image = (
modal.Image.debian_slim()
.pip_install(
"transformers==4.39.0",
"torch==2.2.0",
"accelerate==0.28.0",
)
)
@app.cls(
image=image,
gpu="A10G", # or "A100:2", "H100", etc.
concurrency_limit=32, # per-container concurrency
)
class LLMServer:
@modal.enter()
def load_model(self):
from transformers import AutoModelForCausalLM, AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
self.model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()
@modal.method()
def generate(self, prompt: str) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
out = self.model.generate(**inputs, max_new_tokens=128)
return self.tokenizer.decode(out[0], skip_special_tokens=True)
@app.fastapi_endpoint()
def chat(request):
data = request.json()
return {"output": LLMServer().generate.remote(data["prompt"])}
To deploy:
modal deploy llm_service.py
You now have:
- A stateful model server (
@app.cls+@modal.enter) that loads weights once per container. - A web endpoint (
@modal.fastapi_endpoint) serving HTTP requests. - Automatic autoscaling to multiple GPUs across a multi-cloud pool, with sub-second cold starts once your Image is built.
Cold start behavior here is largely:
- Image build time (amortized; you build once).
- Container launch time (Modal’s optimized runtime, gVisor-based isolation, overhead on the order of a few hundred milliseconds).
- Model load on first
@modal.enterper container (you control this; for large models, you keep containers warm by traffic or periodic touches).
Contrast this with:
- RunPod: You typically stand up full GPU pods yourself. Cold start is “boot a VM / container,” usually on the order of tens of seconds, and you’re the one managing a server process inside. It behaves like renting a GPU box, not a function call.
- Baseten / Replicate: More model-hosting-centric. You often pick from templates, upload a model, and get an endpoint. Cold start is tied to their autoscaler and model load implementation. You have less direct control over the lifecycle hooks and exact init behavior, and you usually don’t think in terms of “arbitrary Python app with multiple Functions and Classes.”
2. Scale-to-zero and pay-for-what-you-use
On Modal, containers scale up and down based on actual calls:
- Functions invoked with
.remote(),.map(), or.spawn()spin up containers on demand. - When there’s no traffic, your app scales down to zero containers.
- You pay for compute while your functions run, not for idle machines. Free plan includes $30/month of compute, so you can test “real” workloads without a bill.
Example batch fan-out:
@app.function(
image=image,
timeout=900, # seconds
cpu=4.0,
)
def score_example(example_id: int):
# Custom Python: DB read, S3, RPCs, etc.
...
@app.local_entrypoint()
def run_eval():
example_ids = range(1_000_000)
calls = score_example.starmap([(i,) for i in example_ids])
# Pull results as they complete
for result in calls:
...
Modal will:
- Launch as many containers as needed (up to your account limits) across clouds.
- Scale back to zero when all calls finish.
- Let you track logs and function calls in the apps page, with optional
modal.Retriesfor transient failures.
Comparisons:
- RunPod: Scale-to-zero is mostly manual; you decide when to stop a pod. If you leave it up to avoid cold starts, you pay for idle GPU hours.
- Baseten: Offers autoscaling and scale-to-zero for model endpoints, but it’s primarily tied to inference workloads. Batch jobs and more general compute often require separate orchestration or product surfaces.
- Replicate: Endpoints scale based on call volume; pricing is usually per-second of model compute. It’s friendly if you fit their model templates, but less of a general compute fabric where you run arbitrary batch systems or long-running training loops.
3. Run arbitrary Python – not just “model.predict”
Modal treats Python as the control plane. Inside a Modal function/class you can:
- Import arbitrary libraries, as long as you package them in the Image.
- Talk to databases, queues, HTTP APIs.
- Coordinate multiple GPUs or spawn child jobs with
.spawn(). - Run untrusted code in Sandboxes with gVisor isolation.
For example, a training loop plus evals on separate GPUs:
train_image = (
modal.Image.debian_slim()
.pip_install("torch==2.2.0", "datasets", "wandb")
)
@app.function(
image=train_image,
gpu="A100:2",
timeout=24 * 60 * 60, # 24 hours max timeout
)
def train_model(config: dict):
import torch
from my_lib import build_model, train_loop
model = build_model(config)
train_loop(model, config)
# Save checkpoint to a Volume
model_path = "/data/checkpoints/latest.pt"
torch.save(model.state_dict(), model_path)
return model_path
eval_image = train_image # reuse dependencies
@app.function(
image=eval_image,
gpu="A10G",
)
def eval_model(checkpoint_path: str, dataset_split: str):
# Load checkpoint from Volume, run evals, log metrics
...
Use .spawn() to run many evals in parallel on separate worker GPUs:
@app.local_entrypoint()
def main():
ckpt = train_model.remote({"lr": 1e-4})
splits = ["dev", "test", "ood"]
calls = [eval_model.spawn(ckpt, s) for s in splits]
results = [c.get() for c in calls]
print(results)
This is more general than “host a single model endpoint”:
- You can implement full RL pipelines, eval harnesses, synthetic data generators, or MCP servers.
- You can keep using the same primitives (Functions, Classes, Images, Volumes) across training, inference, and offline batch.
On a spectrum:
- Modal: General-purpose Python infrastructure, where AI workloads are first-class but not the only thing.
- RunPod: General-purpose GPU VMs/containers; extreme freedom but you do your own orchestration.
- Baseten / Replicate: Higher-level, inference-centric platforms; great when you mostly want “POST to model,” more friction once you start building richer systems with lots of custom logic.
Common Mistakes to Avoid
- Treating all cold starts as the same: For LLMs, your “cold start” is often dominated by weight loading, not container spin-up. On Modal, use
@app.clswith@modal.enterso you pay that cost once per container, and pin model weights on a Volume or in a shared cache when possible. Don’t rebuild Images on every deploy. - Forgetting to pin dependencies: If you don’t pin Python package versions in your Modal Images, you’ll eventually get mystery breakage. Pin
transformers==…,torch==…, etc., so that rebuilds and new deployments behave like yesterday’s.
Real-World Example
A typical pattern we see: a team has an agentic system hitting multiple models, a retrieval layer, and a non-trivial orchestration graph. They prototype it as Python scripts and notebooks locally, then need to ship a production endpoint that can handle big spikes during evals and batch backfills.
On Modal they:
- Wrap each “unit of work” in a
@app.function(embedding calls, RAG lookups, tool invocations). - Create a stateful
@app.clsfor any heavy models to keep them in GPU memory across requests. - Expose the orchestrator as a
@modal.fastapi_endpoint, while using.spawn()and.map()underneath for concurrency. - Let Modal autoscale containers across a multi-cloud GPU pool, relying on sub-second cold starts for helper functions and long-lived model servers for the biggest models.
They don’t need to maintain a separate cluster or autoscaler. They don’t rewrite code into a hosted “model card” template. They just deploy their Python app and watch traffic graphs in the Modal dashboard.
Pro Tip: Start by modeling one end-to-end path (e.g., a single agent call) as a Modal app, then factor out heavy components into
@app.clsservers and background Functions. Don’t prematurely over-generalize your infra; the decorators and.remote()/.spawn()calls give you enough leverage to refactor in-place.
Summary
If you care about cold starts, genuine scale-to-zero, and running real Python rather than just “model.predict,” you want infra that looks like code, not a model-upload UI. Modal’s bet is that Python-defined infrastructure plus an AI-native runtime (fast container launch, stateful servers, multi-cloud GPUs) is the sweet spot: you keep iteration speed, you don’t pay for idle capacity, and you can evolve from a single endpoint to a full training + eval + batch system without switching platforms.
RunPod gives you raw GPU building blocks and full control, at the cost of managing your own servers and autoscaling. Baseten and Replicate make it easy to stand up model endpoints quickly but give you less leverage when the system around the model grows complex. Picking among them is really about where you want to sit on that spectrum of abstraction and control.