Modal vs Baseten: cost comparison for spiky traffic (scale-to-zero) and production features like custom domains/rollbacks

Quick Answer: For spiky, scale-to-zero workloads, Modal typically comes out cheaper than Baseten because you only pay for actual CPU/GPU time, not always‑on replicas. On the production side, Modal gives you code-defined rollbacks, custom domains via standard reverse proxies, and a full Python runtime so you can treat infra as code instead of a UI-only model host.

Most teams looking at “Modal vs Baseten” have the same constraint: you want GPU-backed inference that can sit at zero for hours and then eat a traffic spike without blowing your latency budget or your cloud bill. At the same time, you need grown-up deployment features—rollbacks, custom domains, auth, logging—because this is production, not a weekend demo. This article breaks down how those constraints play out on cost, scale-to-zero behavior, and production mechanics, with a bias toward the one thing that actually matters: how much you spend to ship stable features fast.

Key Benefits:

Lower cost for spiky traffic: Modal’s serverless execution and sub‑second cold starts mean you don’t need warm replicas idling just to protect P95 latency.
Code-defined production workflows: Rollbacks, scaling, retries, and cron are all Python, not point‑and‑click state you forget to screenshot before a deploy.
Full AI infrastructure surface: Inference, training, evals, batch jobs, and sandboxes all share the same primitives (Images, Functions, Apps), so you don’t duplicate infra across tools.

Why This Matters

LLM and model-serving workloads are almost never smooth. You run evals in bursts, you get traffic spikes from a launch, and your usage looks more like a sawtooth than a flat line. If your provider makes you pay for always‑on replicas to hit latency targets, the infra bill quickly dominates your unit economics.

The right platform here doesn’t just “support” scale‑to‑zero—it lets you aggressively scale to zero between spikes without destroying tail latency, and it gives you production features (rollbacks, custom domains, auth, observability) as first‑class concepts. Otherwise, you end up duct-taping an inference tool into a real backend stack, with all the migration and incident pain that implies.

Core Concepts & Key Points

Concept	Definition	Why it's important
Scale-to-zero economics	Only paying for CPU/GPU when requests are actually running, with containers torn down between bursts.	Determines whether spiky workloads cost 10s or 1000s of dollars per month. Modal is optimized for this pattern with sub‑second cold starts.
Programmable infra (Python-first)	Defining environment, hardware, scaling, endpoints, and deployment logic directly in Python code.	Keeps your infra versioned, reviewable, and testable. Rollbacks and release strategies become git operations instead of manual UI clicks.
Production controls	Features like custom domains, rollbacks, auth, logging, retries, and multi-GPU scheduling.	These are the difference between “model demo” and “system you can safely put behind a product and an on‑call rotation.” Modal leans heavily into this.

Cost Model: Spiky Traffic and Scale-to-Zero

Let’s anchor on a concrete workload: a GPU-backed inference endpoint that usually sits at zero but occasionally spikes to a few thousand requests per minute—for example, an internal eval tool or a product feature in early rollout.

How Modal charges for this scenario

Modal’s mental model is “you pay for compute while it’s doing work”:

No long‑lived replica requirement: You don’t need a constantly-warm GPU just to avoid catastrophic cold starts.
Sub‑second cold starts: Containers and models spin up fast enough that you can actually scale from zero on the hot path.
Elastic autoscaling: When the spike hits, Modal schedules containers across a multi‑cloud capacity pool (thousands of GPUs, including H100, A100, A10G, etc.) and scales back to zero when idle.
You control hardware in code: Want to swap from an A10G to an A100:2 for a heavy batch job? Change the decorator, redeploy.

For a spiky workload—say, 50k inferences/day with bursts and long idle periods—you’re paying for roughly “50k inferences × GPU time per inference,” plus a small control-plane overhead. There’s no hidden “replica tax” to keep your SLOs plausible.

How Baseten typically behaves for the same pattern

Baseten is pitched as a model-serving platform: you deploy a model, set autoscaling parameters, and it runs on a cluster of managed GPUs/CPUs. For tight latency guarantees, you’ll usually:

Keep one or more replicas warm to avoid cold starts.
Configure min replicas > 0 for production endpoints.
Accept that you’re paying for idle capacity even when traffic is zero.

For flat or high-utilization workloads, that’s acceptable. For spiky workloads, it’s where cost balloons: your monthly spend is bounded by “replicas × hours,” not “actual compute used.”

Because Modal is built around serverless execution and fast cold starts, it’s structurally better suited to bursty traffic: you scale to zero without losing the ability to handle spikes. That’s where the cost advantage shows up.

How Modal’s Scale-to-Zero Actually Works

You can think of Modal as “Python → scalable containers in seconds.” Let’s sketch the flow for a typical inference endpoint that must tolerate idle periods and spikes.

1. Define the environment and hardware in code

import modal

image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch==2.2.1",
        "transformers==4.39.3",
        "accelerate==0.28.0",
    )
)

app = modal.App("spiky-inference")

GPU_TYPE = "A10G"  # or "A100", "H100", "A100:2" for multi-GPU

This is your container spec. No Dockerfile, no YAML, no separate build pipeline. The image build is cached and reused across cold starts.

2. Create a stateful model server with lifecycle hooks

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

@app.cls(image=image, gpu=GPU_TYPE, timeout=600)
class ModelServer:
    @modal.enter()
    def setup(self):
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
        self.model = AutoModelForCausalLM.from_pretrained(
            "mistralai/Mistral-7B-Instruct-v0.2",
            torch_dtype=torch.float16,
            device_map="auto",
        )

    @modal.method()
    def generate(self, prompt: str) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        out = self.model.generate(**inputs, max_new_tokens=256)
        return self.tokenizer.decode(out[0], skip_special_tokens=True)

Containers call setup() once per lifetime (not per request). Modal keeps them around while there’s traffic, then tears them down when idle.

3. Expose a web endpoint that scales to zero

from fastapi import FastAPI
from pydantic import BaseModel

web = FastAPI()

class Request(BaseModel):
    prompt: str

@modal.fastapi_endpoint(app=app)
def fastapi_app():
    @web.post("/generate")
    async def generate(req: Request):
        # this RPC fan-outs to a container, scaling up on demand
        return {"completion": await ModelServer().generate.remote(req.prompt)}

    return web

Deploy:

modal deploy spiky_inference.py

Modal will:

Start at zero containers.
On the first request, spin up a container, load the model, and serve traffic.
Keep containers warm as long as traffic continues.
Scale up on spikes (multiple containers) and down to zero when idle.

You’re billed for actual execution time + GPU usage, not idle replicas.

Production Features: Custom Domains, Rollbacks, and Safety Rails

Now let’s talk about the “grown-up” parts: custom domains, rollbacks, and operational controls you’d expect from a backend platform.

Custom domains

Modal gives each deployed app a stable URL. In production, most teams put Modal behind a reverse proxy or API gateway:

Cloudflare, Fastly, or a simple Nginx/Envoy layer.
Custom TLS termination and routing at your domain: api.yourcompany.com.
Modal endpoints protected via Proxy Auth Tokens (requires_proxy_auth=True) so only your edge can call them.

A minimal pattern:

@modal.web_server(app=app, mounts=[])
def web_app():
    # Optional simple web server / health endpoints
    ...

Then you configure your edge:

DNS api.yourcompany.com → edge
Edge routes /v1/inference → Modal endpoint URL
Auth at the edge + Proxy Auth Token on the Modal side

You get a proper custom domain with whatever security/post-processing stack you like, while Modal handles autoscaling and compute.

Rollbacks as code, not buttons

Because everything in Modal is defined in Python:

Deploys are modal deploy app.py from a git commit.
Rollback is just “deploy the previous commit”—the entire environment, image, and endpoint behavior are captured in code.

You don’t have to trust that a UI “rollback” button perfectly recreates a prior configuration; you can:

Tag releases (v1.2.3), deploy from those tags.
Pin dependency versions in your Image (torch==2.2.1, etc.).
Use feature flags in code to gate model variants.

Example:

git checkout v1.2.3
modal deploy spiky_inference.py

You’re back on that version, including its infra configuration.

Additional production primitives

Modal exposes a bunch of controls that matter once usage stops being a toy:

Retries: modal.Function(retries=modal.Retries(max_retries=3)) for transient failures.
Timeouts: per-function and app-level time limits (up to 24 hours).
Scheduling: modal.Cron / modal.Period for periodic jobs (e.g., nightly evals).
Volumes: persistent storage for model weights, indexes, and checkpoints.
Secrets: environment-level secrets, not hard-coded tokens.
Isolation: gVisor sandboxing, SOC2 & HIPAA, data residency controls.

Combined, you can treat Modal as the runtime for both your LLM inference and the surrounding plumbing (evals, batch jobs, periodic retraining) without switching tools.

How It Works (Step-by-Step) for a Spiky Production Endpoint

Here’s a minimal end-to-end flow to get a spiky, scale-to-zero production endpoint running on Modal, with rollbacks and a custom domain.

Model container & environment
- Define your Image with pinned deps.
- Choose GPU via the decorator (gpu="A10G" for cost-efficient inference).
- Commit to git.
Stateful server and endpoint
- Implement @app.cls with @modal.enter to load the model once per container.
- Expose via @modal.fastapi_endpoint or @modal.web_server.
- Test locally with modal run app.py.
Production deployment & edge integration
- Deploy with modal deploy app.py.
- Configure your custom domain at your CDN/load balancer.
- Route traffic to the Modal URL, secure with Proxy Auth Tokens.

When traffic is quiet, you’re at zero containers. When the spike hits, Modal’s scheduler fans out across GPUs and keeps latency low—without you paying for idle replicas.

Common Mistakes to Avoid

Keeping replicas warm out of habit:
On a platform that isn’t designed for fast cold starts, you’re forced to pay for warm capacity. On Modal, start with scale-to-zero; only consider pinned concurrency if you have extremely tight P95 targets and can measure a real need.
Splitting infra between “model host” and “real backend”:
If you run inference on a specialized model host and everything else on another platform, you inherit two deployment stories, two on-call surfaces, and more glue code. With Modal, keep inference, batch jobs, evals, and simple API surfaces in the same Python-defined app where it makes sense.

Real-World Example

Imagine you’re shipping an AI code-assistant feature for your SaaS product. Usage is wildly bursty: most of the day is quiet, but when you announce the feature or run a webinar, traffic spikes 50–100x for a few hours. You need:

Low latency (users are typing in an editor).
Scale-to-zero (the feature is in beta; you don’t want a fixed GPU bill).
Rollbacks (if a new model variant starts hallucinating, you revert fast).
Custom domain (you expose this as /v1/assist under your existing API).

On Baseten, the safe approach is to keep at least one GPU replica warm, maybe two for safety. You pay for those replicas 24/7, including the 95% of time with almost no traffic.

On Modal, you:

Wrap your model in an @app.cls server with @modal.enter loading.
Expose it via FastAPI + @modal.fastapi_endpoint.
Put Cloudflare in front at api.yourcompany.com.
Deploy from a tagged commit, with your entire infra and model versions in git.

When traffic is quiet, you pay nothing for compute. When the webinar hits, Modal fans out across GPUs in its multi-cloud capacity pool, keeps your P95 latency acceptable, and logs everything in the apps dashboard. If a new model version misbehaves, you deploy the previous tag and you’re back to the last known good state in seconds.

Pro Tip: For this pattern, benchmark your real cold start cost on Modal (first request after idle) and use that to drive your rollout strategy. In many cases, you can run with pure scale-to-zero in production and only introduce pinned concurrency for the subset of endpoints where cold start latency actually shows up in user experience metrics.

Summary

For spiky workloads that must scale to zero, the core economic difference between Modal and Baseten is structural: Modal is designed around sub‑second cold starts and serverless GPU execution, so you pay for the work, not for idle replicas. In that regime, Modal usually wins on cost while giving you a broader programmable surface: training, evals, batch processing, sandboxes, and inference all share the same Python-defined infra.

On the production side, Modal’s model is “infra as code”: rollbacks are git operations, scaling is a decorator, and custom domains are handled through the same reverse proxies you already know. You’re not locked into a specialized model-hosting UI; you get a general-purpose backend for AI-heavy services with the same operational controls you’d expect from any serious runtime.

Next Step

Get Started