Modal vs RunPod vs Baseten vs Replicate: cold starts, scale-to-zero cost, and how much custom Python code I can run

Most teams bump into the same wall when they graduate from a single GPU box to “real” production: cold starts blow up latency, keeping GPUs warm 24/7 is expensive, and every platform has a different idea of how much custom Python you’re allowed to run. This comparison walks through how Modal, RunPod, Baseten, and Replicate behave on those axes—so you can pick based on actual constraints, not landing page slogans.

Quick Answer: Modal gives you sub‑second cold starts, real scale‑to‑zero on CPU and GPU, and lets you express your entire stack as Python code—environment, hardware, scaling, and endpoints. RunPod is closer to raw GPU rental; Baseten focuses on managed model serving; Replicate optimizes for marketplace-style hosted models with limited custom logic. If you need to run arbitrary Python workloads with big, spiky traffic and strict latency budgets, Modal is the most programmable option.

Why This Matters

You don’t get to choose your traffic pattern. Evals, agents, and fine‑tuning jobs arrive in spikes, and LLM users don’t care that your GPU cluster is “cold.” If your platform can’t:

start containers fast enough,
scale to zero without surprise fixed costs,
and let you run arbitrary Python (pre/post‑processing, tools, agents, feature stores, eval loops),

you’ll either overpay for always‑on capacity or keep duct‑taping infra to work around the service. Cold starts, scale‑to‑zero, and “how much Python can I run?” are effectively your three dimensions of freedom.

Key Benefits:

Predictable latency: Sub‑second cold starts and intelligent scheduling mean Modal can absorb large, spiky workloads without timing out your users or agents.
Lower idle cost: True scale‑to‑zero on CPU/GPU compute lets you keep endpoints and batch jobs ready without paying for warm workers.
Full Python runtime: With Modal, you define infra in code—Images, Functions, Classes—and can run complex pipelines, agents, and unsafe sandboxes without rewriting everything to fit someone else’s model‑only abstraction.

Core Concepts & Key Points

Concept	Definition	Why it's important
Cold start latency	Time from “no capacity” to “first request runs” for a function/endpoint or model server	If this is slow, you either blow up p95 latency or pay to keep workers hot 24/7
Scale-to-zero cost model	Whether the platform can drop to zero running containers/GPUs and charge only when requests arrive	Determines if you can handle lumpy traffic without overpaying for idle capacity
Custom Python surface area	How much arbitrary Python you can run: pre/post-processing, tools, agents, eval loops, arbitrary libraries	Real workloads are more than just a model call; constraints here drive complexity and lock‑in

Below I’ll frame each platform against these three.

How It Works (Step‑by‑Step)

Let’s walk through what a real workload looks like on Modal, then contrast how you’d approximate the same pattern on RunPod, Baseten, and Replicate.

Assume the workload:

An LLM agent that:
- Handles tens of QPS normally, but 1000s QPS during evals,
- Requires <300ms overhead on top of model time,
- Calls tools (your own Python code),
- Runs scheduled evals that fan out to thousands of parallel jobs,
- Must scale to zero GPUs at night/weekends.

1. Define your environment and hardware in code (Modal)

On Modal, you start by defining the container environment as Python, not YAML:

import modal

image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch==2.2.0",
        "transformers==4.37.0",
        "vllm==0.4.0",
        "httpx==0.27.0",
    )
)

app = modal.App("agent-llm-service")

Here you’ve:

Pinned dependencies (best practice for reproducible AI infra),
Avoided a Dockerfile entirely,
Defined something Modal can build and cache for fast, repeated cold starts.

On RunPod, Baseten, and Replicate this usually looks like:

RunPod: a Dockerfile + container image you push; you manage most of the build pipeline.
Baseten: a model environment with a requirements.txt, plus some deployment-specific config.
Replicate: Dockerfile or a specified “cog” interface; more constrained around models.

2. Implement a stateful model server (Modal)

Next you define a class that loads weights once per container and serves multiple requests:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

GPU = modal.gpu.A10G()  # or A100(), H100(), etc.

@app.cls(
    image=image,
    gpu=GPU,
    container_idle_timeout=300,  # scale-to-zero after 5 min idle
    concurrency_limit=16,
)
class LLMServer:
    @modal.enter()
    def setup(self):
        self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
        self.model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Llama-3-8B-Instruct",
            torch_dtype=torch.float16,
            device_map="auto",
        )

    @modal.method()
    def generate(self, prompt: str, max_tokens: int = 512) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        with torch.inference_mode():
            output_ids = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.7,
            )
        return self.tokenizer.decode(output_ids[0], skip_special_tokens=True)

Key details:

@app.cls creates a stateful model server, not just a stateless function.
@modal.enter runs once per container, so weights are loaded a single time.
container_idle_timeout controls scaling to zero on that GPU.
concurrency_limit sets requests per container; Modal will autoscale containers across a multi‑cloud GPU pool.

Cold starts: Modal’s runtime is tuned for sub‑second container startup and fast model initialization, backed by a distributed caching layer and an image builder ~100x faster than plain Docker builds. You can actually see the cold start times in the Modal dashboard for each container.

On other platforms:

RunPod: You’d typically run a full-time pod for this, or manage a cluster autoscaler. Cold starts are essentially “Kubernetes cold starts + your image pull + model load.”
Baseten: You define a model and a “Truss” or similar; cold start behavior is managed for you but not as transparent. You’ll benchmark it, but you have less direct control over lifecycle hooks (@enter, @exit).
Replicate: The process looks closest to model-as-a-function—great for short model invocations, but less ideal when you want a long-lived process with custom RPC methods.

3. Expose a web endpoint with full Python logic (Modal)

Now you wrap your model in an HTTP API, including custom pre/post‑processing, tool calls, or agent logic:

from fastapi import FastAPI
from pydantic import BaseModel

web_app = FastAPI()

class ChatRequest(BaseModel):
    user_id: str
    message: str

@app.function(image=image, secrets=[modal.Secret.from_name("my-db")])
@modal.asgi_app()
def fastapi_app():
    # This runs once per container as well
    from some_db_lib import get_user_memory

    llm = LLMServer()

    @web_app.post("/chat")
    async def chat(req: ChatRequest):
        memory = get_user_memory(req.user_id)  # arbitrary Python / DB logic
        prompt = f"{memory}\n\nUser: {req.message}\nAssistant:"
        # Call the GPU-backed model server
        response = await llm.generate.remote(prompt)
        return {"reply": response}

    return web_app

This is where the “how much Python can I run?” difference really shows up:

The entire stack—FastAPI app, DB logic, agent routing, tool calls—is just Python.
@modal.asgi_app() turns your function into a fully managed, autoscaling web server.
You can mix CPU-only endpoints and GPU-backed stateful servers in the same app.

Analogues:

RunPod: You’d write an HTTP server (FastAPI / Flask), bake it into the image, expose a port, and then wire a gateway or your own LB. Very flexible, but you own routing, logging, autoscaling.
Baseten: You typically get a model endpoint and a bit of glue logic. You can write Python, but it’s more model-centric than arbitrary application hosting.
Replicate: You implement a standard predict() interface for a model; custom multi-endpoint ASGI apps or complex orchestration are not the primary path.

With Modal the “unit of deployment” is a Python function/class. On the others, it’s more often a model or a container plus infra config.

4. Fan out evals and batch jobs (Modal)

For evals and batch processing, you can use @app.function plus .map() or .spawn():

@app.function(
    image=image,
    gpu=GPU,
    timeout=60 * 30,  # 30 minutes max per eval job
    retries=modal.Retries(max_retries=3),
)
def eval_single(prompt: str) -> float:
    llm = LLMServer()
    response = llm.generate.remote(prompt)
    return score_response(prompt, response)  # your own Python scoring logic


@app.function(image=image, timeout=60 * 60 * 4)
def run_eval_suite(prompts: list[str]) -> list[float]:
    scores = list(eval_single.map(prompts))
    return scores

Operationally:

Each eval_single runs in its own container, giving you thousands of parallel workers across GPUs.
.map() handles the fan-out/fan-in pattern; you can monitor all calls in the apps page.
modal.Retries automatically retries transient failures without you building a queue.

On RunPod you’d likely:

Implement your own queue (Redis, SQS, etc.),
Write workers that poll and process jobs,
Manage node autoscaling and shard assignment.

Baseten and Replicate are model‑first; they don’t give you a first‑class distributed job model like Function.map() with the same ergonomics. You end up combining them with another batch system.

Common Mistakes to Avoid

Chasing “zero cold starts” with permanently hot GPUs:
Keeping GPUs warm everywhere to avoid cold starts can easily 10–100x your bill. With Modal, lean on sub‑second cold starts + stateful classes instead. Set container_idle_timeout conservatively (e.g., 300–900s), monitor p95 latency in the dashboard, and only pin capacity if you truly need single‑digit millisecond overhead.
Underestimating how much non‑model Python you’ll need:
Most people assume “we just need a model endpoint” and pick a platform optimized for that. Six months later they’re bolting on agents, tools, evals, feature gating, and multi‑tenant routing that don’t fit the original abstraction. Design for a full Python runtime from the start—Modal’s “just Python” surfaces (functions, classes, ASGI apps, Sandboxes) scale with those requirements.

Real‑World Example

Imagine you’re building an MCP server that backs a fleet of AI agents:

Traffic pattern: idle for hours, then a GitHub Actions job triggers 50k tool calls for evals.
Constraints:
- You can’t reserve 20 A100s 24/7.
- Tool calls must stay under 500ms overhead.
- You need to safely run untrusted user code in sandboxes for some tools.

On Modal you might:

Define a @app.cls tool server on A10G GPUs with container_idle_timeout=120.
Expose it behind @modal.web_server or @modal.fastapi_endpoint with requires_proxy_auth=True to protect it via Proxy Auth Tokens.
Implement eval runs as @app.function(...).map() jobs that fan out across thousands of containers.
Run untrusted snippets in Sandboxes using a separate @app.function with gVisor‑backed isolation and tight timeouts.
Check logs and metrics in the Modal UI; iterate via modal run locally and modal deploy when it’s ready.

This setup:

Scales to zero GPUs when idle; you only pay for actual compute usage.
Survives large, sudden spikes because the runtime can spin up containers in seconds and tap into a multi‑cloud GPU pool.
Lets you keep everything as Python—the MCP protocol, tool implementations, eval orchestration, and isolation rules.

Doing the same with RunPod typically means managing the cluster, scheduler, load balancer, and sandboxing stack yourself. With Baseten/Replicate you’d likely have a separate system for evals and sandboxes because their primary abstraction is “model endpoint,” not “arbitrary code runtime.”

Pro Tip: When you benchmark platforms, test the worst‑case path: kill all workers, then send a burst of requests and measure time-to-first-successful-response. That’s your cold start behavior plus autoscaling. On Modal you can experiment with image sizes, @modal.enter setup logic, and concurrency limits until that curve fits your SLOs—without changing any YAML or infra wiring.

Summary

If you frame the decision around cold starts, scale‑to‑zero, and custom Python:

Modal is a Python‑first AI runtime where functions and classes are your infra. You get sub‑second cold starts, elastic GPU scaling (H100, A100, A10G, etc.), real scale‑to‑zero, gVisor isolation, and primitives (Functions, Classes, Sandboxes, Volumes, Secrets) that let you run full applications—not just models.
RunPod is closer to “rent GPUs and run whatever you want,” which is powerful but pushes autoscaling, cold start engineering, and safety (for untrusted code) back onto you.
Baseten and Replicate focus on model endpoints. They can be good fits if your surface area truly is “call a model,” but once you need agents, tools, evals, custom routing, and batch workloads, the abstraction starts to creak.

If your real question is “How do I run a lot of arbitrary Python on GPUs, pay nothing when idle, and stay within a tight latency budget?” you want infra you can express and evolve in code. That’s the hole Modal was built to fill: programmable AI infrastructure that keeps feedback loops tight and doesn’t force you to become a part‑time SRE.

Next Step

Get Started