Modal vs CoreWeave (or Lambda Labs): when is serverless cheaper than running always-on GPUs?
Platform as a Service (PaaS)

Modal vs CoreWeave (or Lambda Labs): when is serverless cheaper than running always-on GPUs?

12 min read

Quick Answer: Serverless GPUs on Modal are cheaper than always-on GPUs from CoreWeave or Lambda Labs whenever your workload is spiky, batched, or idle a lot of the time. If your GPUs aren’t busy at a high, steady utilization (think 40–60%+ around the clock), you’re usually paying more for capacity than for actual work—and that’s exactly where serverless wins.

Most teams move to CoreWeave or Lambda Labs because they’re tired of fighting general-purpose clouds for GPU capacity. But once you get past quotas and spot interruptions, a different problem shows up: utilization. Your H100 might be “cheap per hour,” but if it’s sitting at 5–10% utilization between evals, fine-tunes, or traffic spikes, your effective cost per token/step/request explodes.

Modal takes a different bet: you keep defining your infrastructure in Python, and we spin GPUs up and down around the actual work—LLM inference, fine-tuning runs, agent evals, batch jobs—so you’re charged for execution time, not for idle capacity. The question isn’t “Is Modal cheaper per GPU-hour?” but “Is my workload bursty enough that paying per busy second beats renting a box 24/7?”

This post walks through how to reason about that, with concrete rules of thumb you can plug your own numbers into.

Key Benefits:

  • Lower cost for bursty workloads: You pay for GPU time when your functions actually run, instead of 24/7 instance rent.
  • Fewer ops headaches at high scale: Modal’s autoscaling and multi-cloud GPU pool handle spiky traffic without you building schedulers or autoscalers.
  • Code-first control, not YAML or glue: Environment, scaling, and endpoints are Python functions and decorators you can run locally (modal run) and deploy with modal deploy.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
GPU utilizationThe fraction of time a GPU is actively doing useful work (e.g. training, inference) versus sitting idle.Low utilization is where always-on GPUs become expensive; serverless wins when you can’t keep GPUs busy.
BurstinessHow “peaky” your workload is: long idle periods, short intense spikes, or scheduled batches instead of steady 24/7 traffic.The more bursty the workload, the more money you lose on always-on GPUs—and the more serverless saves.
Effective cost per unit workWhat you pay per token, per training step, or per job—not per hour.This is the only metric that really matters when comparing Modal vs CoreWeave / Lambda Labs.

Why This Matters

Most infra debates get stuck at list price: “an A100 costs X/hr on CoreWeave, Y/hr on Lambda Labs, and Z/hr on Modal-equivalent hardware.” But you don’t ship GPU-hours. You ship: “this endpoint serves 100 RPS with p95 < 300ms,” or “this trainer finishes a fine-tune in 2 hours.”

For that, utilization and elasticity matter more than list price:

  • If you can keep a GPU cluster >60–70% busy 24/7, always-on capacity (CoreWeave, Lambda Labs) can be cost-efficient.
  • If your usage is spiky, or you carry a lot of “just in case” capacity, you’re burning money on idle GPUs—and a serverless runtime like Modal will usually be cheaper at the same or better latency.

The nice part is you can decide this empirically. You don’t need faith; you need a back-of-the-envelope model and a small POC.

How to Think About Cost: A Simple Model

Let’s define three variables:

  • C_on — hourly cost of an always-on GPU (CoreWeave / Lambda Labs).
  • C_sleffective hourly cost of the same GPU-type capacity on Modal, when fully busy.
  • U_real — your real-world utilization (0–1) on the always-on GPU.

Your effective cost per hour of useful work with always-on GPUs is:

Cost_on_effective = C_on / U_real

On Modal, you pay close to:

Cost_sl_effective ≈ C_sl

because we’re only billing while your containers are actually running work (inference calls, training steps, batch tasks). You don’t pay when they’re scaled to zero.

Modal becomes cheaper when:

C_on / U_real > C_sl
→ U_real < C_on / C_sl

You don’t need exact numbers to use this; a rough ratio is enough. If the serverless-equivalent GPU-hour is, say, 1.5–2× the “raw” bare-metal price, then any workload with <50–67% real utilization is likely cheaper on Modal.

And in practice? Many teams discover their “busy” GPUs are idling at 5–20% when you factor in nights, weekends, and slow intervals.

When Modal Is Usually Cheaper Than CoreWeave / Lambda Labs

Let’s anchor this to actual workloads instead of algebra.

1. Spiky LLM inference and agents

You’re running:

  • Chat endpoints where traffic spikes during working hours.
  • Agents or MCP servers that do long, GPU-heavy calls only when the user hits ”Run.”
  • Model-based evals that run thousands of queries in a burst and then go quiet.

On always-on GPUs, you either:

  • Overprovision to handle spikes, and pay for idle capacity during troughs.
  • Underprovision and blow up latency / queue depth during spikes.

On Modal, you define a serverless endpoint in Python and let autoscaling absorb the spike:

import modal

app = modal.App("qa-inference")

image = (
    modal.Image.debian_slim()
    .pip_install("transformers", "torch", "accelerate")
)

@app.cls(gpu="A10G")
class QAModel:
    def __init__(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
        self.model = AutoModelForCausalLM.from_pretrained(
            "mistralai/Mistral-7B-Instruct-v0.2",
            torch_dtype="auto",
            device_map="auto",
        )

    @modal.enter()
    def warmup(self):
        # Runs once per container; weights stay in memory
        _ = self.__call__("ping")

    @modal.method()
    async def __call__(self, question: str) -> str:
        import torch
        prompt = f"Answer concisely:\n\n{question}"
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        with torch.no_grad():
            out = self.model.generate(**inputs, max_new_tokens=128)
        return self.tokenizer.decode(out[0], skip_special_tokens=True)

@app.fastapi_endpoint()
def fastapi_app():
    from fastapi import FastAPI
    app = FastAPI()

    @app.post("/qa")
    async def qa(body: dict):
        answer = await QAModel().__call__.remote(body["question"])
        return {"answer": answer}

    return app

To deploy:

modal deploy qa_inference.py

You get:

  • Sub-second cold starts (weights loaded once per container via @modal.enter).
  • Autoscaling from zero to thousands of replicas in minutes.
  • No GPU rent when nobody is asking questions.

If your traffic pattern is “weekday daytime spike, nighttime LLM ghost town,” Modal almost always beats always-on boxes on cost.

2. Batch fan-out jobs and evals

Now consider:

  • Model-based evals against a large dataset.
  • Audio transcription of backlogs (“transcribe 5,000 podcasts by tomorrow”).
  • Embedding generation / feature extraction.

On CoreWeave/Lambda Labs, you:

  • Stand up a cluster.
  • Run the job.
  • Either tear it down afterwards (more automation for you), or let it sit idle “just in case” you need to rerun.

On Modal, one function plus .map() gives you elastic parallelism on-demand:

import modal

app = modal.App("batch-evals")

image = (
    modal.Image.debian_slim()
    .pip_install("openai", "tiktoken")
)

@app.function(
    image=image,
    gpu="A10G",
    timeout=60 * 60,  # max 1 hour per task
    retries=modal.Retries(max_retries=3),
)
def eval_example(prompt: str) -> float:
    import openai
    # run your eval logic against a model
    score = ...  # compute numerical score for this prompt
    return score

@app.local_entrypoint()
def main():
    prompts = load_prompts_somewhere()
    # fan out to hundreds/thousands of containers
    calls = list(eval_example.map(prompts, concurrency=512))
    scores = [c.get() for c in calls]
    persist(scores)

You pay for GPU time while the evals run. When they’re done, the GPUs disappear. If your evals run 2 hours/day, you’re not paying for 24.

This is where customers report “we transcribed podcasts on hundreds of GPUs in a fraction of the time” and still paid less than leaving a smaller cluster idle for days.

3. Sandboxes and untrusted code

Another pattern: “we give users a way to run arbitrary compute-heavy code (maybe with GPUs) and we want to isolate them.”

You could:

  • Build a multi-tenant scheduler on top of CoreWeave/Lambda Labs.
  • Keep a pool of GPUs alive just in case someone submits a job.

Or you let Modal Sandboxes handle this:

import modal

app = modal.App("sandboxed-gpu-code")

@app.function(
    gpu="A10G",
    image=modal.Image.debian_slim().pip_install("torch", "numpy"),
    timeout=60 * 10,
)
def run_user_code(code: str, input_data: bytes) -> bytes:
    # Here you'd execute user code in an isolated container (gVisor sandbox)
    ...

Each call is a separate sandboxed container with gVisor-based isolation. Again: no baseline capacity cost—just per-execution cost.

4. “Lab-style” workflows, notebooks, and experiments

A lot of teams rent long-lived GPUs for experimentation:

  • Jupyter/VSCode connected to an always-on instance.
  • Manual “start/stop” patterns that realistically drift into “always running.”

On Modal, you can:

  • Use Modal Notebooks or modal shell / modal run to prototype.
  • Spin up GPUs for minutes at a time while iterating.
  • Scale experiments to many GPUs using the same code you used interactively.

If you’re not actively training 24/7, this is almost always cheaper than keeping a big A100/H100 idling for “convenience.”

When CoreWeave / Lambda Labs Can Be Cheaper Than Modal

You should also be explicit about where always-on wins. There are real cases.

Modal is probably not cheaper if:

  1. You have extremely steady, high-utilization training.

    • Example: an H100 cluster training a frontier model 24/7 for months.
    • Utilization is 70–90%+ on every GPU, around the clock.
    • You’re comfortable managing your own schedulers, data loaders, restarts, etc.
  2. You need exotic topologies or custom networking that Modal doesn’t expose as a first-class primitive yet.

    • Very tight, custom interconnect setups and cluster topologies that require low-level control.
  3. Your workload is operation-heavy but compute-light (e.g., you mostly move data around, do a bit of CPU-bound work, and need massive bandwidth to custom storage).

    • In that scenario, you might optimize cost with CPU-heavy, bandwidth-optimized instances instead of GPU-centric serverless.

The rule of thumb: if you can guarantee high utilization and you’re willing to own cluster lifecycle and scheduling, bare-metal / “GPU cloud” will usually win on raw dollars per FLOP. If you can’t guarantee that utilization, or you care about iteration speed more than cluster ownership, serverless often wins on both money and time.

How It Works (Step-by-Step)

Here’s a concrete way to decide “Modal or CoreWeave/Lambda Labs?” for a given workload.

  1. Measure or estimate your real utilization

    • If you already run on CoreWeave/Lambda Labs, look at:
      • GPU utilization over a week (cloud metrics, nvidia-smi dmon, etc.).
      • Idle time at night, weekends, off-peak hours.
    • If you’re greenfield:
      • Estimate traffic patterns: RPS, batch windows, eval schedules.
      • Ask: “Do we need these GPUs 24/7, or only in bursts?”
  2. Approximate your effective cost per unit of work

    • On always-on:
      • Compute total bill for a month.
      • Estimate units of work: total tokens served, total training steps, total prompts evaluated.
      • Divide: $/token, $/step, or $/job.
    • On Modal:
      • Build a small POC function/endpoints.
      • Run realistic traffic or a batch through it.
      • Inspect the apps page for execution time and cost.
    • Compare: if Modal’s cost per token/step/job is lower at the scale you care about, serverless wins.
  3. Prototype the workload on Modal in Python code

    • Pick the correct primitive:
      • Long-lived, stateful model server → @app.cls + @modal.enter.
      • HTTP inference endpoint → @modal.fastapi_endpoint or @modal.web_server.
      • Batch jobs and evals → @app.function + .map() or .spawn().
      • Schedules → modal.Period / modal.Cron.
    • Keep things production-ready from day one:
      • Use modal.Retries for robustness.
      • Pin dependencies in Image builds.
      • Use Volumes for checkpoints or large cached assets.

Example: a training+eval loop where evals run on separate worker GPUs via Modal:

import modal

app = modal.App("train-with-modal-evals")

image = (
    modal.Image.debian_slim()
    .pip_install("torch", "transformers")
)

@app.function(gpu="A10G", image=image, timeout=60*30)
def run_eval(model_ckpt_path: str, dataset_shard: str) -> float:
    # Load weights from your storage / volume
    # Run eval over dataset_shard
    score = ...
    return score

def train_step(...):
    # Your training loop, running on your existing infra
    ...

def main_training_loop():
    for epoch in range(num_epochs):
        train_step(...)
        # Offload eval to Modal worker GPU(s)
        shard_paths = get_dataset_shards()
        calls = [
            run_eval.spawn(current_ckpt_path, shard)
            for shard in shard_paths
        ]
        scores = [c.get() for c in calls]
        log_eval_scores(scores)

You might still train on CoreWeave/Lambda Labs, but push evals to Modal when you need burst capacity without reserving extra GPUs.

Common Mistakes to Avoid

  • Comparing list prices instead of utilization-adjusted cost.
    Always-on might be “cheaper per hour,” but if you’re at 10–20% utilization, your real cost per token/step is worse than a higher list price on Modal that only bills busy time. Always normalize.

  • Underestimating operational drag.
    Building your own autoscaling, queues, retry logic, and endpoint hardening on top of CoreWeave/Lambda Labs has a real cost: engineer time, failed deploys, and outages. Modal bakes a lot of this into the primitives (.map(), modal.Retries, Proxy Auth Tokens, SOC2/HIPAA, data residency controls).

  • Ignoring cold starts and model load patterns.
    Naive serverless setups pay a heavy cold start tax. On Modal, you load weights once per container (@modal.enter) and keep them warm across many calls, which is how you get sub-second cold starts and low per-request overhead. Don’t design as if every call reloads the model; that’s not how Modal works.

Real-World Example

Imagine an AI product with:

  • Daytime load: 100 RPS on a 7B LLM with strict p95 latency.
  • Nighttime load: 1–2 RPS, mostly automated jobs.
  • Weekends: ~10–20% of weekday traffic.

On always-on CoreWeave/Lambda Labs, you might run:

  • A cluster sized for 150 RPS to handle spikes.
  • GPUs idle or lightly loaded ~60–70% of the week.
  • A monthly bill sized for peak, not average.

On Modal, you:

  • Wrap the LLM server in a @app.cls with gpu="A10G" or gpu="A100" depending on needs.
  • Expose a HTTP endpoint via @modal.fastapi_endpoint.
  • Let autoscaling add replicas as RPS climbs and scale back to zero when traffic disappears.
  • Handle massive spikes (e.g. evals, batch jobs) with .map() for short bursts at high concurrency.

Result:

  • Your cost curve tracks actual traffic instead of peak capacity.
  • You get integrated logs and metrics in the Modal dashboard.
  • You reduce operational overhead: no cluster lifecycle, no hand-rolled autoscaler, no juggling jobs vs endpoints vs eval scripts.

Pro Tip: If you’re not sure where the crossover point is, run the same batch workload on both platforms for a day or two. On Modal, record total GPU execution time and cost from the apps page. On CoreWeave/Lambda Labs, take your cluster cost for the same window and divide by units of work. The numbers will tell you whether your utilization is high enough to justify always-on.

Summary

Serverless GPUs on Modal are cheaper than always-on capacity from CoreWeave or Lambda Labs whenever:

  • Your workload is bursty, spiky, or scheduled instead of 24/7 steady.
  • You can’t keep GPUs busy at 40–60%+ utilization around the clock.
  • You value not running your own schedulers, autoscalers, and endpoint machinery.

Always-on GPUs win when you have large, stable, high-utilization training clusters and you’re happy to own the operational complexity. For everything else—LLM inference, agents, evals, batch jobs, sandboxes—Modal’s Python-first, serverless runtime tends to win on both cost per unit of work and developer throughput.

Next Step

Get Started