What’s the simplest way to run GPU inference only when requests come in so I’m not paying for idle GPUs?
Platform as a Service (PaaS)

What’s the simplest way to run GPU inference only when requests come in so I’m not paying for idle GPUs?

7 min read

Most teams hit the same wall the first time they move a GPU-heavy model toward production: the model serves a trickle of traffic, the GPU sits mostly idle, and the cloud bill looks like you’re training GPT-6. You want the model ready when requests come in, but you don’t want to rent an A100 24/7 just in case someone calls your endpoint.

Quick Answer: The simplest way to only pay for GPU inference when requests come in is to run your model on an autoscaled, serverless GPU endpoint that scales to zero between requests. On Modal, you write a normal Python function, add a decorator to request a GPU, and Modal spins up containers on demand with sub-second cold starts, so you pay for GPU time only while requests are running.

Why This Matters

GPU inference is usually bursty: eval spikes, user demos, internal tools, overnight batched jobs. Keeping a dedicated GPU server running “just in case” is basically paying a 90% idle tax. That’s fine at FAANG scale with reservation discounts; it’s a waste for most teams trying to get a product off the ground.

A serverless GPU model flips this: your ops model follows your traffic pattern. If no one is hitting your endpoint, your GPUs scale to zero. When you get a spike—say 10,000 evals during a new model rollout—you burst across many GPUs without touching Terraform, ASGs, or Kubernetes.

Key Benefits:

  • No idle GPU cost: Containers spin up on-demand and scale to zero, so you don’t pay for idle A10Gs / A100s sitting around waiting for a request.
  • Programmable autoscaling: You express hardware, concurrency, and scaling in Python; Modal handles scheduling across thousands of GPUs across clouds.
  • Low-latency cold starts: Modal’s AI-native runtime+Images keep cold starts in the sub-second range, so you don’t blow your latency budget when scaling from zero.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Serverless GPU inferenceRunning GPU-backed endpoints that are created on-demand for requests and scaled down automatically when idle.Lets you get GPU acceleration without managing clusters or paying for idle capacity.
Modal Functions & AppsPython functions and classes decorated with Modal primitives (@app.function, @app.cls) that run in containers with specified hardware.You describe your infrastructure as code: environment, GPU, scaling, and endpoints are all expressed in Python.
Scale-to-zero autoscalingThe runtime automatically deprovisions containers when they’re not serving traffic, and spins new ones up when requests arrive.This is what ensures you only pay for GPUs while there’s actual work to do.

How It Works (Step-by-Step)

At a high level, the simplest pattern is:

  • You define a Modal App and a function that loads your model and runs inference.
  • You attach a GPU spec and image (dependencies) to that function with a decorator.
  • You expose that function as a web endpoint, and Modal handles routing, autoscaling, and container lifecycle.

Here’s how that looks in practice.

1. Define your environment and GPU in Python

Let’s get the imports out of the way and define an Image with your dependencies:

# app.py
import modal

app = modal.App("gpu-inference-on-demand")

image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch==2.2.2",
        "transformers==4.40.0",
        "accelerate==0.29.0",
        "safetensors==0.4.3",
    )
)

You can (and should) pin versions tightly so the environment is reproducible.

2. Load the model once per container

For latency, you don’t want to reload weights on every request. Instead, you load them once per container using a class-based server with lifecycle hooks:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

@app.cls(
    image=image,
    gpu="A10G",          # or "A100", "H100", etc.
    concurrency_limit=4, # how many concurrent requests per container
)
class LLMServer:
    @modal.enter()
    def setup(self):
        model_name = "meta-llama/Llama-3-8b-instruct"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="cuda",
        )
        self.model.eval()

    @modal.method()
    def generate(self, prompt: str, max_new_tokens: int = 128) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.8,
            )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

What this buys you:

  • First request on a new container loads the model into GPU RAM once.
  • Subsequent requests reuse the same weights, avoiding the GPU RAM → SRAM transfer on every token that the benchmark docs talk about doing naively.
  • Containers come and go with traffic, but each sticks around to amortize the load cost across multiple requests.

3. Expose a serverless HTTP endpoint

Now you wrap the generate method in a web endpoint. This is where “run on GPUs only when requests come in” actually happens:

from fastapi import FastAPI
from pydantic import BaseModel

web_app = FastAPI()

class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 128

@app.fastapi_endpoint(
    "/generate",
    method="post",
)
@modal.asgi_app()
def generate_endpoint():
    @web_app.post("/generate")
    async def generate(req: GenerateRequest):
        result = await LLMServer().generate.remote(
            req.prompt,
            max_new_tokens=req.max_new_tokens,
        )
        return {"completion": result}

    return web_app

Deploy it:

modal deploy app.py

Modal will:

  • Create containers with the A10G GPU when HTTP requests hit /generate.
  • Keep a small pool warm under load.
  • Scale back to zero containers when there’s no traffic.
  • Route each request into a running container, or spin a new one in seconds if you’re scaling up.

From your perspective, it’s “just Python” plus modal deploy. From a cost perspective, you’ve turned “always-on GPU server” into “GPU micro-billing with autoscaling.”

Common Mistakes to Avoid

  • Reloading the model on every request:
    Doing AutoModelForCausalLM.from_pretrained(...) inside the endpoint handler kills latency and wastes GPU time. Use @app.cls with @modal.enter to load once per container and @modal.method for per-request methods.

  • Assuming serverless means infinite concurrency on one GPU:
    If you don’t set concurrency_limit, you can accidentally overload a single GPU with too many requests and get tail latency spikes. Set concurrency_limit per container, and let Modal scale more containers instead of routing everything to one.

Real-World Example

Imagine you’re building an internal eval tool for a new LLM. During the day, you have bursts of activity as people run eval suites, then hours of silence. You don’t want to keep an A100 or H100 up 24/7, but you need the runs to be fast when they happen.

With the pattern above:

  • Your eval web UI calls /generate on Modal.
  • During an eval spike, Modal spins up enough GPU containers to hit your latency/throughput target. You can fan out across many GPUs simply by calling LLMServer().generate.map(prompts) and letting Modal run them in parallel.
  • When the team goes home, containers naturally drain and scale to zero—no cron job to shut things down, no admins logging in to stop instances.

You’ve effectively got “infinite eval cluster” semantics when you need them, and zero GPU spend when you don’t.

Pro Tip: For heavy eval or batch workloads, use .spawn() or .map() on the same GPU-bound LLMServer methods. You get queued jobs, fan-out across many GPUs, and still only pay while those jobs run. Combine this with modal.Retries to handle flaky upstream APIs or transient GPU errors without manual babysitting.

Summary

If you want the simplest way to run GPU inference only when requests come in—and stop paying for idle GPUs—the pattern is:

  • Express your environment and GPU in code via a Modal Image and @app.cls.
  • Load your model once per container with @modal.enter, expose per-request methods with @modal.method.
  • Publish a serverless HTTP endpoint using @modal.fastapi_endpoint or @modal.asgi_app.

You get sub-second cold starts, autoscaling to and from zero, and a pricing model that tracks actual work, not uptime. No clusters, no custom schedulers, and no “just to be safe” GPU reservations.

Next Step

Get Started