How do I set the GPU type (T4 vs A100 vs H100) and concurrency limits in Modal for an inference service?

Quick Answer: In Modal, you choose GPU type directly in Python on your function or class (for example gpu="T4", gpu="A100", or gpu="H100"), and you control concurrency with decorators like @modal.function(concurrency_limit=..., timeout=...) or @app.cls with concurrency_limit on methods. Because everything is code-defined, you can tune GPU size, count, and per-container concurrency per endpoint and environment and deploy with modal deploy.

Why This Matters

If you’re running an inference service, “T4 vs A100 vs H100” and “how many requests in flight?” aren’t academic questions—they dictate latency, cost, and how often you get paged. The usual pattern on legacy infra is slow iteration: edit Terraform, wait for deploys, guess at autoscaling behavior, then try to debug cold starts. With Modal, you define GPU hardware and concurrency in Python next to the model code, so you can iterate fast, run load tests, and dial in throughput without fighting YAML or reservations.

Key Benefits:

Hardware in code: Pick T4, A10G, A100, or H100 inline on the function that needs it—no separate infra stack to babysit.
Explicit concurrency guardrails: Use concurrency_limit to cap in-flight calls per function or class method so you don’t overload VRAM or hit latency cliffs.
Scale-to-zero with real capacity: Modal’s multi-cloud pool plus instant autoscaling means you can get thousands of GPUs when you spike and pay nothing when idle.

Core Concepts & Key Points

Concept	Definition	Why it's important
GPU selection	Choosing the GPU model (e.g. `"T4"`, `"A10G"`, `"A100"`, `"H100"`) and optionally count (e.g. `"H100:8"`) in your Modal function or class.	Matches hardware to workload: T4 for cheap/light models, A100/H100 for large LLMs and vision models that care about latency and throughput.
Function-level concurrency	A per-function `concurrency_limit` that caps how many calls can execute at once across containers.	Prevents overload and backpressure issues; lets you shape how aggressively Modal fans out containers under load.
Container-level parallelism	Running multiple requests per container (e.g. batching, queues) using `@app.cls` and your own routing logic.	Better GPU utilization—especially on big GPUs—by ensuring each GPU processes multiple in-flight requests or batched tensors.

How It Works (Step-by-Step)

At a high level, you:

Define your environment and model code in Python using a Modal Image.
Attach a GPU type and concurrency limit to a function or class with decorators.
Deploy as a web endpoint, then observe load, adjust GPU and concurrency, and redeploy.

Let’s go through it.

1. Define an inference function with a GPU (T4 vs A100 vs H100)

Pick the workload. Say you’re serving a transformer model for text or vision. The core pattern with Modal:

Use @app.function (or @modal.function) to turn a Python function into a scalable unit.
Specify gpu="T4" / "A10G" / "A100" / "H100" (case-insensitive).
Optionally include a GPU count: "H100:8" if you need multiple GPUs per container.
Pin CPU and memory if needed (cpu=, memory=) for auxiliary work.

import modal

app = modal.App("gpu-inference-demo")

image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch",
        "transformers",
    )
)

@app.function(
    image=image,
    gpu="T4",       # or "A10G", "A100", "H100", "H100:8", etc.
    cpu=4,
    memory="16Gi",
    timeout=600,    # seconds
)
def run_inference(input_text: str) -> str:
    # Load a small model per container or reuse a global initialized at import time
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model_name = "gpt2"  # for demo purposes
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

    tokens = tokenizer(input_text, return_tensors="pt").to("cuda")
    out = model.generate(**tokens, max_new_tokens=128)
    return tokenizer.decode(out[0], skip_special_tokens=True)

Swap GPU types by changing the gpu= argument and redeploy. For example:

Cost-sensitive / lighter model: gpu="T4"
Heavier models (7–13B) or diffusion: gpu="A10G" or "A100"
Large LLMs, low latency, or multi-GPU tensor parallelism: gpu="H100" or "H100:8"

Modal treats string GPU names case-insensitively, so "h100" and "H100" are equivalent.

2. Set concurrency limits for the inference function

Now wrap this in an HTTP endpoint and apply concurrency controls. There are two main levels:

Function-level concurrency: How many concurrent invocations of this function Modal will execute globally.
Endpoint-level routing: How many simultaneous HTTP requests you expect; Modal maps these into function calls.

Here’s a FastAPI-style endpoint with an explicit concurrency limit:

from fastapi import FastAPI
from pydantic import BaseModel

web_app = FastAPI()


class InferenceRequest(BaseModel):
    prompt: str


@app.function(
    image=image,
    gpu="A10G",
    cpu=4,
    memory="24Gi",
    concurrency_limit=64,  # global concurrency cap for this function
    timeout=30,
)
@modal.asgi_app()
def inference_service():
    @web_app.post("/generate")
    async def generate(req: InferenceRequest):
        # Call the GPU-backed function; can also batch, etc.
        return {"output": run_inference.remote(req.prompt)}

    return web_app

Here’s what concurrency_limit=64 does:

Caps the number of in-flight function calls across all containers for inference_service.
Under load, Modal scales containers up until either:
- Capacity is enough to keep latency low, or
- It hits the concurrency limit and starts queueing calls.

You can:

Set higher values (e.g. 256, 512) if your model is light and you want more parallelism.
Set lower values (e.g. 8–16) for big LLMs or diffusion models where VRAM is the bottleneck.

If you want stricter per-container behavior (e.g. one request per GPU per container), use a class-based server and control parallelism inside the container.

3. Use a stateful class to load the model once and tune per-container concurrency

For serious inference, you usually want to load the model once per container and reuse it. Modal’s @app.cls gives you a stateful model server with lifecycle hooks:

@app.cls(
    image=image,
    gpu="H100",
    cpu=8,
    memory="64Gi",
)
class LlmServer:
    def __init__(self):
        self.model = None
        self.tokenizer = None

    @modal.enter()
    def load_model(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer

        model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="auto",
            device_map="cuda",
        )

    @modal.method(concurrency_limit=4, timeout=60)
    def generate(self, prompt: str) -> str:
        # This method runs inside a container with the preloaded model
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        out = self.model.generate(**inputs, max_new_tokens=256)
        return self.tokenizer.decode(out[0], skip_special_tokens=True)

Key things happening:

gpu="H100" gives each container a full H100. You can use "H100:2" if you want two GPUs per container.
@modal.enter() loads the model once when the container starts. Subsequent calls reuse it.
@modal.method(concurrency_limit=4) constrains per-container concurrency for this method. If a container is already handling 4 generate calls, extra calls will be queued or routed to other containers.

You then expose this as an endpoint:

@app.function()
@modal.web_server()
def web():
    from fastapi import FastAPI
    from pydantic import BaseModel

    app_ = FastAPI()
    server = LlmServer()

    class Req(BaseModel):
        prompt: str

    @app_.post("/generate")
    async def generate(req: Req):
        return {"output": server.generate.remote(req.prompt)}

    return app_

Now your knobs are:

gpu= on the class (T4, A10G, A100, H100, plus counts like "H100:8").
concurrency_limit on @modal.method, which sets per-container concurrency.
The implicit autoscaling behavior: as QPS grows, Modal starts more containers, each respecting its own method-level concurrency_limit.

Picking between T4, A100, and H100 for inference

A rough heuristic you can actually operationalize:

Use T4 when:
- Models are small (≤1–2B parameters) and latency requirements are lax.
- You care a lot about cost per token/image and less about raw throughput.
- You expect bursty but low-volume traffic.
Use A10G / A100 when:
- You’re serving mid-sized LLMs (7–13B) or Stable Diffusion-like workloads.
- You need sub-second to low-second latency and steady throughput.
- You want a good balance of price/performance.
Use H100 when:
- You’re serving large LLMs, running big batches, or doing multi-GPU sharding ("H100:8").
- You care a lot about tail latency under load and want high arithmetic intensity.
- You’re pushing lots of concurrent inferences or running RL/eval workloads with massive spikes.

Since GPU type is just a Python argument, spin up three variants in different apps:

gpu_choices = ["T4", "A100", "H100"]

for gpu in gpu_choices:
    @app.function(
        image=image,
        gpu=gpu,
        concurrency_limit=32,
        timeout=30,
        name=f"infer-{gpu.lower()}",
    )
    def infer_variant(payload: dict):
        ...

Then hit each endpoint with a simple load test and compare latency and cost.

Concurrency patterns that work well on Modal

Three common patterns:

One request per container, simple function
- Use @app.function(gpu="T4", concurrency_limit=N) with no extra per-container management.
- Let Modal scale container count to handle load.
- Good for simple, stateless inference where load is modest.
Stateful server, limited per-container concurrency
- Use @app.cls(gpu="A100") and @modal.method(concurrency_limit=2–8) for heavy models.
- Each container keeps the model in VRAM and handles a few parallel calls.
- Modal scales container count horizontally as traffic increases.
Batching inside a container
- Still use @app.cls but implement an internal queue.
- Each generate call enqueues a request; a background loop builds batches and runs a single forward pass on the GPU.
- You might set a higher concurrency_limit (e.g. 64) and use batching to keep GPU utilization high.

Example skeleton for simple batching:

import asyncio
from collections import deque

@app.cls(
    image=image,
    gpu="A100",
    cpu=8,
    memory="64Gi",
)
class BatchedLlmServer:
    def __init__(self):
        self.queue = deque()
        self.batch_event = asyncio.Event()

    @modal.enter()
    def load_model(self):
        # load tokenizer + model, move to GPU as before
        ...

    async def _batch_loop(self):
        while True:
            await self.batch_event.wait()
            self.batch_event.clear()

            requests = []
            while self.queue and len(requests) < 16:  # max batch size
                requests.append(self.queue.popleft())

            if not requests:
                continue

            prompts = [r["prompt"] for r in requests]
            # run a single batched forward pass on the GPU
            outputs = self._generate_batch(prompts)

            for out, r in zip(outputs, requests):
                r["future"].set_result(out)

    def _generate_batch(self, prompts):
        # tokenize + generate in batch
        ...

    @modal.method(concurrency_limit=64)
    async def generate(self, prompt: str) -> str:
        loop = asyncio.get_event_loop()
        fut = loop.create_future()
        self.queue.append({"prompt": prompt, "future": fut})
        self.batch_event.set()
        return await fut

This pattern lets you:

Set a large concurrency_limit (logical requests).
Still control GPU pressure via batch sizes and sequence lengths.

Common Mistakes to Avoid

Oversizing GPU for tiny models: Throwing an H100 at a 350M-parameter model with low QPS just burns money. Start with "T4" or "A10G" and only move up when you see saturation.
Ignoring method-level concurrency: Using a class-based server but not adding concurrency_limit to methods can lead to too many simultaneous generations per GPU, thrashing VRAM and increasing tail latency.
Not pinning dependencies: If you don’t pin versions in pip_install, you’ll get run-to-run drift that makes performance debugging painful—pin to exact versions for reproducibility.

Real-World Example

Imagine you’re serving a 13B LLM with a hard SLA: P95 latency < 1.5s at 100 QPS, with occasional spikes to 500 QPS during eval runs. Local tests show:

On a T4, single-request latency is ~2.8s and VRAM is tight.
On an A100, you get ~1.4s at batch size 1 and decent headroom.
On an H100, you can either:
- Run batch size 4–8 at similar latency, or
- Keep latency lower and handle spikes gracefully.

You decide to:

Deploy the production endpoint on gpu="H100".
Use @app.cls with @modal.method(concurrency_limit=8) so each container handles up to 8 in-flight requests.
Let Modal autoscale out to dozens of containers during eval spikes.

Operationally:

During normal traffic, you might run 2–3 H100 containers, each at moderate utilization.
During spikes, Modal spins up more containers from the multi-cloud pool; cold starts stay sub-second because the image is warm and model loading is amortized across containers.
You watch logs and metrics on the Modal apps page, adjust concurrency_limit and batch size, and redeploy in minutes.

Pro Tip: Treat GPU type and concurrency as tunable parameters, not fixed infrastructure. Keep them near your model code in a single module, and script simple load tests (using asyncio + httpx or locust) that you can run after every change to verify latency and utilization before you ship.

Summary

You set GPU type and concurrency in Modal directly in Python:

Pick GPUs with gpu="T4", "A10G", "A100", or "H100" (and counts like "H100:8") on functions or classes.
Use function-level concurrency_limit to cap global in-flight calls.
Use method-level concurrency_limit on @app.cls servers to control per-container parallelism and protect VRAM.
Let Modal handle autoscaling and capacity, then tune these values based on real latency/load tests.

Because hardware and scaling behavior live next to your model code, iteration becomes a tight loop: change a decorator, modal deploy, hit it with traffic, adjust.

Next Step

Get Started