serverless GPU inference providers with low cold starts for large models

Most teams discover the hard way that “serverless GPUs” are easy until you care about two things at the same time: large models and low cold starts. Running a 70B LLM or a heavyweight diffusion model on demand is trivial in a benchmark, but holding latency under 200–500 ms while autoscaling with spiky traffic is where most providers fall over.

Quick Answer: Serverless GPU inference with low cold starts for large models requires a platform that pre-warms containers, keeps model weights hot in memory, and can burst to many GPUs without quotas or manual orchestration. Modal is one of the few providers that gives you this combination in a Python-first, code-defined workflow, alongside other options like managed LLM APIs and generic container platforms that trade off flexibility, latency, or control.

Why This Matters

If your product is driven by LLMs, vision models, or generative pipelines, “serverless” and “GPU” are only half the story. The real bottlenecks are cold starts and capacity:

A single cold start that takes 20–40 seconds to load a 70B model can blow your SLO and trigger retries, timeouts, and user churn.
Spiky workloads—evals, batch generations, RL loops—either force you to permanently overprovision GPUs or accept intermittent failures.

You want infrastructure that behaves like a fast local dev loop: sub-second cold starts, predictable tail latencies, and enough elasticity that you don’t think about quotas. That’s the bar for serverless GPU inference that actually holds up in production.

Key Benefits:

Lower latency for large models: Keep model weights hot in memory and containers warm so most requests avoid full reinitialization.
Elastic GPU scaling without quotas: Burst to hundreds or thousands of GPUs across clouds for evals, batch jobs, and traffic spikes.
Code-defined infra, not YAML: Describe hardware, scaling, and endpoints in Python so you can ship and iterate quickly without wrestling with bespoke orchestration.

Core Concepts & Key Points

Concept	Definition	Why it's important
Serverless GPU inference	Running model inference on demand on GPU-backed containers, with the platform handling provisioning, scaling, and teardown.	You don’t manage nodes, ASGs, or GPU reservations; you just define functions or endpoints and let the platform scale them.
Cold starts	The time it takes to bring up a new container, initialize the runtime, and load model weights before serving a request.	For large models, this can be tens of seconds if not optimized; keeping it sub-second or low single-digit seconds is what makes an API feel “instant.”
Large-model friendly architecture	A runtime that’s designed around big weights (LLMs, diffusion, multimodal) with fast storage, smart scheduling, and long-lived containers.	Without this, you get repeated weight loads, GPU thrash, and unpredictable tail latencies as the platform churns containers.

How It Works (Step-by-Step)

Every serverless GPU provider that does well on cold starts for large models tends to converge on similar mechanics. Here’s how it typically works on Modal, which is built explicitly for this workload.

Define your environment and hardware in code

You describe your runtime image, dependencies, and GPU in Python. No YAML, no separate infra layer:

import modal

app = modal.App("llm-inference")

image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch",
        "transformers",
        "accelerate",
        "vllm",  # or your inference stack of choice
    )
)

@app.cls(
    gpu="H100",
    image=image,
    concurrency_limit=8,  # number of concurrent requests per container
)
class LLMServer:
    @modal.enter()
    def load_model(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer

        model_name = "meta-llama/Meta-Llama-3-70B-Instruct"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            torch_dtype="auto",
        )

    def generate(self, prompt: str) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=256)
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

@app.function()
@modal.fastapi_endpoint()
async def infer(prompt: str):
    server = await LLMServer()
    return {"output": server.generate(prompt)}

The key here is @app.cls plus @modal.enter(): weights load once per container, not per request. That’s the foundational cold-start optimization.

Let the platform manage warm containers and autoscaling

Once you deploy (modal deploy llm_inference.py), Modal:
- Starts containers on H100s (or A10G/A100 etc.) with sub-second cold starts for the container itself.
- Keeps them hot across requests, so most calls skip model load entirely.
- Autoscale based on load: spin up more containers when QPS spikes, scale down to zero when idle, without you managing reservations or quotas.
Under the hood, this is where multi-cloud capacity and intelligent scheduling matter. Modal maintains a large pooled capacity of GPUs across clouds, and the scheduler can place your containers where GPU and storage locality give you fast model loading and low latency.
Optimize for large models and production reality

With the basic pattern in place, you refine for large-model production:
- Pin dependencies tightly in your Image so you don’t get surprise regressions in torch or drivers.
- Use Volumes for cached assets if you need to store tokenizers, sharded checkpoints, or compiled kernels near the GPU.
- Add retries and timeouts with modal.Retries and function timeouts to keep bad actors and long requests from clogging your fleet.
- Use .map() or .spawn() for batch workloads—e.g., running evals on thousands of prompts in parallel.
Example of a batch eval fan-out using the same model:
```
@app.function(gpu="A10G", image=image)
def evaluate_prompt(prompt: str) -> float:
    server = LLMServer()  # reuses loaded weights in this container
    output = server.generate(prompt)
    # ... custom scoring logic ...
    return score_output(output)

@app.function()
def run_eval(prompts: list[str]) -> list[float]:
    return list(evaluate_prompt.map(prompts))
```
evaluate_prompt.map(prompts) fans out to many GPU containers, each holding the model hot in memory, which keeps the per-request cost dominated by the actual generation, not initialization.

Common Mistakes to Avoid

Treating “serverless” as “stateless”:
If you load weights on every request inside a pure function (no @app.cls and @modal.enter()), your cold starts will be measured in tens of seconds for large models. Always use long-lived containers for heavyweight models.
Ignoring capacity and quotas until launch day:
Some providers look serverless on the surface but are capacity-constrained behind the scenes. For large models, you want explicit guarantees: multi-cloud capacity pools, no fixed GPU quotas, and proven ability to run on “100s of GPUs in parallel” for real workloads.

Real-World Example

Imagine you’re shipping a coding agent that runs multi-turn conversations plus tool calls on a 70B model. Traffic is brutal: during US mornings, QPS spikes 20x; at night, it drops close to zero. You don’t want to hold 50 H100s idle all night, and you can’t tolerate 30-second cold starts when everyone logs in at 9am.

On Modal, you:

Wrap the model server in @app.cls and load it once per container via @modal.enter().
Expose it as an HTTP endpoint using @modal.fastapi_endpoint.
Let the platform spin containers up and down across a multi-cloud GPU pool, keeping cold starts low because images are prebuilt, storage is close, and containers boot in sub-second timeframes.
When your team runs massive evals or RL environments, you reuse the same image and hardware config, but fan out jobs with .map() or .spawn(), letting Modal schedule onto “thousands of GPUs across clouds”.

From your perspective, it’s just Python. From the user’s perspective, the agent feels instant—even when traffic jumps 10–20x in minutes.

Pro Tip: For your hottest endpoints, consider running a small, always-on baseline of containers to absorb sudden spikes with zero cold starts. Then let Modal autoscaling handle the burst layer. You get deterministic low latency for the first N requests while still scaling to zero when traffic truly disappears.

Summary

Serverless GPU inference for large models is only useful if you solve cold starts and capacity together. You want:

A platform that keeps big weights hot in long-lived containers.
Sub-second container cold starts and fast model loading from a storage layer designed for high throughput.
Elastic GPU scaling across clouds without quotas, plus a runtime that’s programmable in Python instead of YAML-driven.

Modal is built around that exact workload: LLM inference, training/fine-tuning, batch evals, and sandboxed tools, all defined in code and backed by a multi-cloud GPU pool. If you’re fighting cold starts on large models today, it’s worth trying the @app.cls + @modal.enter() pattern and measuring the impact on tail latency.

Next Step

Get Started

serverless GPU inference providers with low cold starts for large models

Why This Matters

Core Concepts & Key Points

How It Works (Step-by-Step)

Common Mistakes to Avoid

Real-World Example

Summary

Next Step

Keep Reading

More from Platform as a Service (PaaS)

Modal Team plan: how do I enable rollbacks and the static IP proxy, and does it include $100/month free credits?

How do I set up secrets (API keys) and environment variables in Modal for production deployments?

How do I fine-tune a Hugging Face model on Modal and save checkpoints to persistent storage?