Modal vs CoreWeave (or Lambda Labs): when is serverless cheaper than running always-on GPUs?

Most teams don’t overspend on GPUs because they picked the “wrong” vendor; they overspend because their workload shape doesn’t match their provisioning model. Always‑on GPUs from CoreWeave or Lambda Labs look cheap on a $/hr basis, but if your utilization graph is mostly zeros with the occasional spike, that “cheap” box is just an expensive space heater. Serverless GPU platforms like Modal flip the question: instead of “what’s the cheapest GPU per hour?” it becomes “how many hours do I actually need?”

Quick Answer: Modal’s serverless GPUs are cheaper than always‑on CoreWeave/Lambda Labs instances whenever your effective utilization falls much below ~30–50% for spiky, bursty, or low‑throughput workloads. If you have steady, near‑flat demand that keeps a GPU hot 24/7 at high utilization, then raw rental (CoreWeave/Lambda) will usually win on cost—but you’re paying for more ops and slower iteration. The line is workload shape: spikes and idle time favor Modal; constant heavy load favors always‑on.

Why This Matters

If you’re building LLM inference, eval pipelines, agents, or RL environments, you’re probably in one of two regimes:

Spiky, unpredictable load: evals, batch jobs, overnight training, user‑driven agents. You might need 0 GPUs most of the time and 100 GPUs for 15 minutes.
Flat, predictable load: a single high‑volume inference service, or a long‑running training job that saturates GPUs around the clock.

In the first regime, overprovisioning kills you: you’re paying for 24h of GPU to handle 30 minutes of real work. In the second regime, serverless premiums add up: you’re effectively paying a margin on every single minute.

Understanding where you sit on that spectrum lets you pick the right tool and avoid building a bespoke orchestration system “just to save a bit on GPU,” only to burn weeks of engineering time.

Key Benefits:

Cost efficiency for bursty workloads: With Modal, you pay only for CPU/GPU during function execution—no idle GPU tax—which often beats always‑on boxes when utilization is low or workloads are spiky.
Massive elastic capacity without reservations: Modal can scale from zero to thousands of replicas in minutes, tapping a multi‑cloud GPU pool, so you don’t have to pre‑book A100/H100 capacity or manage quotas.
Lower operational drag: Environment, scaling, and endpoints are just Python code. No extra scheduler, no YAML forest, and no homegrown autoscaler glued to CoreWeave/Lambda APIs.

Core Concepts & Key Points

Concept	Definition	Why it's important
Effective GPU utilization	The fraction of wall‑clock time your GPU is doing useful work (e.g., 4h of inference on a 24h server → ~17% utilization).	Determines whether always‑on rental is actually cheaper than serverless: low utilization favors Modal; high, steady utilization favors CoreWeave/Lambda.
Burst vs steady workloads	Burst workloads have large, short spikes (e.g., evals, fan‑out batches); steady workloads are near‑constant (e.g., 24/7 inference).	Burst workloads benefit from autoscaling and scale‑to‑zero; steady workloads can justify the operational cost of reserved GPUs.
Code‑defined infrastructure (Modal)	You define Images, hardware, autoscaling, and endpoints in Python functions and classes that Modal runs on elastic compute.	You get serverless economics plus explicit control over hardware and scaling without building your own orchestration layer on top of CoreWeave/Lambda.

How It Works (Step-by-Step)

At a high level, the decision process looks like this:

Characterize your workload.
Measure or estimate:
- Requests / jobs per day
- Latency requirements (P95)
- Parallelism needed at peak (e.g., evals on N models × M test cases)
- How many hours per day the GPU must be busy to meet SLAs
Compute effective utilization and cost.
For an always‑on GPU (CoreWeave/Lambda):
- Cost/day = $GPU_hourly_rate * 24
- Useful GPU hours/day = total_compute_time
- Effective utilization = useful_hours / 24
For Modal:
- Cost/day ≈ GPU_time_used * serverless_rate (plus small CPU/egress overhead, depending on workload)
- No cost for idle time; scale-to-zero is free.
Factor in operational overhead and iteration speed.
Always‑on wins only if:
- 24h * rental_rate + your ops cost < GPU_time * Modal_rate
  and the extra work of managing capacity, autoscaling, and deployment is worth it for your team.

Let’s make this less abstract and talk about concrete patterns.

Step 1: Identify Your Workload Shape

You can usually slot your workload into one of these buckets:

Latency‑sensitive, spiky inference (LLMs, agents, tools)
- A few to a few hundred RPS, but highly variable over day/week.
- You care about tail latency (P95/P99 under, say, 500–1000ms).
- Night hours or off‑peak times are mostly idle.
High‑throughput batch / evals / RL
- Jobs arrive in waves: “run N=1,000,000 test cases” or “evaluate this checkpoint across 50 datasets.”
- You want to fan out to hundreds/thousands of GPUs for an hour, then back to zero.
Long‑running training and fine‑tuning
- Jobs run for hours to days on a small fixed number of GPUs.
- You can fill the GPU almost 100% of the time.
Always‑on, high‑volume inference
- A single or small set of endpoints with RPS high enough to keep several GPUs busy around the clock.

Rough rule of thumb:

Buckets (1) and (2): Modal almost always wins on cost and developer time.
Bucket (3): CoreWeave/Lambda can be cheaper, but you may still want Modal for everything around the training loop (evals, preprocessing, agents).
Bucket (4): Depends on how flat your traffic really is. True 24/7 saturation favors CoreWeave/Lambda; real‑world “office hour spikes + long sleepy tails” usually favors Modal.

Step 2: Compare Effective Utilization

Let’s run a simple thought experiment.

Example A: Spiky LLM inference

You run an internal LLM tool:

Peak: 50 RPS for 2 hours/day.
Off‑peak: ~1–2 RPS the rest of the day.
A single A10G‑class GPU can handle ~50 RPS for your model.

Always‑on GPU (CoreWeave/Lambda):

You provision 1 GPU 24/7.
Utilization is roughly:
- 2h at ~100% + 22h at maybe 2–5% average
  → call it ~10% effective utilization overall.
Cost/day = 24h * GPU_rate. If rate is, say, $1.50/h, that’s $36/day.

Modal:

You define a stateful model server with @app.cls that loads weights once per container.
You expose it with @modal.fastapi_endpoint or @modal.asgi_app.
Modal autoscaling kicks in: it adds replicas to cover peak RPS; scales to zero when idle.

You roughly pay:

GPU time during those 2 busy hours (close to 2h of GPU at high utilization).
A small amount for the tail traffic (the autoscaler might keep a container warm, but it’s minutes, not 24h).

Say your total GPU time is ~4h/day including head/tail. At a somewhat higher serverless effective rate, it’s still dramatically less than 24h rental. As long as you’re not staying hot 24h/day, serverless wins here.

Example B: Straight‑line training

You run a custom model training job for a week:

2 × A100 GPUs.
24 hours/day, training loop keeps GPUs at high utilization.

Always‑on GPU:

You rent 2 GPUs 24/7 for 7 days.
Utilization is effectively ~90%+ (training loop keeps them busy).
Cost/week = 2 * 24 * 7 * rate. With $2.50/h per A100, that’s about $840.

Modal:

You can run training on Modal (using @app.function with a long timeout and Volumes for checkpoints).
You pay only for compute time, which is basically the same 24h/day for 7 days.

Here, you’re not buying “scale to zero” or “spiky capacity”—your job is a straight line. If you’re willing to handle orchestration, logging, and failure recovery yourself, raw rental might be cheaper.

But most serious training setups also want:

Scheduled evals (spawn a separate function that runs eval on another GPU).
Data preprocessing fan‑out.
Multiple overlapping jobs.

You can mix: use always‑on training GPUs, but run eval pipelines and agents on Modal. That’s often the real‑world optimum.

Step 3: Factor in Autoscaling, Capacity, and Developer Time

Pricing calculators don’t include your team.

On CoreWeave/Lambda, you typically need:

A scheduler (Kubernetes, Nomad, or some homegrown queue).
An autoscaler that provisions/deprovisions GPUs based on queue depth.
Build pipelines to build/push containers.
Logic to route requests across replicas and handle failures.

On Modal, you get:

Scale‑to‑zero plus instant autoscaling by default.
Functions you can call synchronously via .remote(), or fan out across many GPUs via .map() or .spawn().
Durable handles (FunctionCall.get()) plus integrated logs and retries (modal.Retries).
GPU routing via Modal’s “multi‑cloud capacity pool” with intelligent scheduling—you don’t think about regions or quotas.

If you assign any dollar value to your engineers’ time, that often overwhelms a 10–20% unit cost difference.

Common Mistakes to Avoid

Ignoring utilization when comparing prices:
Looking only at “GPU $/hr” for CoreWeave/Lambda vs Modal is misleading. Always compute effective utilization. If your GPU is idle 60–70% of the time, cheap hourly pricing doesn’t matter—you’re burning money.
Building orchestration before you validate workload shape:
Teams often rush to containerize everything on Kubernetes atop CoreWeave/Lambda “for savings,” then realize they only need GPUs in short bursts. Use Modal to prototype and measure real utilization first; you can always move a subset of workloads to always‑on later if they truly need it.

Real-World Example

Suppose you’re running evals for a model training loop. For each checkpoint, you want to:

Fan out to 500k evaluation prompts.
Run them through a model (or several models).
Aggregate metrics and push them back into your training dashboard.

On CoreWeave/Lambda, you’d typically:

Maintain a fixed pool of N GPUs (say 4–8) continuously.
Write a job scheduler that feeds evaluation tasks to those GPUs.
Accept that a 500k‑prompt eval might take many minutes or hours, or that you’re over‑provisioning.

On Modal, you can define the entire pipeline in one file:

import modal

app = modal.App("eval-pipeline")

image = (
    modal.Image.debian_slim()
    .pip_install("torch", "transformers", "datasets")
)

@app.cls(image=image, gpu="A10G")
class EvalModel:
    @modal.enter()
    def load(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
        self.model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")

    @modal.method()
    def score_batch(self, prompts: list[str]) -> list[float]:
        # Dummy scoring: length-based score
        return [len(p) for p in prompts]


@app.function(
    image=image,
    timeout=60 * 30,  # 30 minutes
)
def run_eval(prompts: list[str]) -> float:
    # Fan out to many GPUs
    chunk_size = 64
    chunks = [prompts[i:i + chunk_size] for i in range(0, len(prompts), chunk_size)]

    scores_iters = EvalModel().score_batch.starmap((chunk,) for chunk in chunks)
    all_scores = []
    for scores in scores_iters:
        all_scores.extend(scores)

    # Aggregate into some metric
    return sum(all_scores) / len(all_scores)

To run an eval:

modal run eval_pipeline.py::run_eval

Modal will:

Spin up as many GPU containers as needed (thousands if you want) to parallelize score_batch.
Tear everything down when the eval is done.
Charge only for the GPU time used during score_batch execution—no idle GPU sitting around between evals.

If you run evals a few times per day, your effective utilization is tiny. On a dedicated GPU fleet, you’d either:

Accept long eval times with a small always‑on GPU pool, or
Pay for a big pool that’s idle most of the day.

With Modal, you just don’t pay for the gaps.

Pro Tip: Start by running your evals, agents, and batch jobs on Modal for a couple of weeks. Use the logs and metrics in the Modal dashboard to estimate total GPU hours and concurrency patterns. Only after you have that data does it make sense to ask, “Would pinning down a couple of always‑on GPUs anywhere actually be cheaper?”

Summary

Serverless GPUs on Modal beat always‑on CoreWeave/Lambda instances whenever your workload isn’t saturating GPUs around the clock:

Burst and spiky usage: evals, RL, agents, internal tools → Modal’s pay‑per‑use model plus fast autoscaling is usually cheaper and dramatically simpler to operate.
Steady 24/7 saturation: long‑running training and truly high‑volume inference → always‑on GPUs can win on pure unit cost, but you still might prefer Modal for the rest of your stack (preprocessing, eval, glue code).
Borderline workloads: the right move is usually hybrid—keep rare, CPU/GPU‑intensive or spiky parts on Modal, and consider dedicated GPUs only for hot spots that are provably 24/7.

The real optimization isn’t picking “Modal vs CoreWeave vs Lambda Labs” once and for all—it’s matching each workload’s shape to the right provisioning model, and using code‑defined infrastructure so you can move quickly when that shape changes.

Next Step

Get Started