
How do I set the GPU type (T4 vs A100 vs H100) and concurrency limits in Modal for an inference service?
Quick Answer: In Modal, you choose GPU type directly in Python on your function or class (for example
gpu="T4",gpu="A100", orgpu="H100"), and you control concurrency with decorators like@modal.function(concurrency_limit=..., timeout=...)or@app.clswithconcurrency_limiton methods. Because everything is code-defined, you can tune GPU size, count, and per-container concurrency per endpoint and environment and deploy withmodal deploy.
Why This Matters
If you’re running an inference service, “T4 vs A100 vs H100” and “how many requests in flight?” aren’t academic questions—they dictate latency, cost, and how often you get paged. The usual pattern on legacy infra is slow iteration: edit Terraform, wait for deploys, guess at autoscaling behavior, then try to debug cold starts. With Modal, you define GPU hardware and concurrency in Python next to the model code, so you can iterate fast, run load tests, and dial in throughput without fighting YAML or reservations.
Key Benefits:
- Hardware in code: Pick
T4,A10G,A100, orH100inline on the function that needs it—no separate infra stack to babysit. - Explicit concurrency guardrails: Use
concurrency_limitto cap in-flight calls per function or class method so you don’t overload VRAM or hit latency cliffs. - Scale-to-zero with real capacity: Modal’s multi-cloud pool plus instant autoscaling means you can get thousands of GPUs when you spike and pay nothing when idle.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| GPU selection | Choosing the GPU model (e.g. "T4", "A10G", "A100", "H100") and optionally count (e.g. "H100:8") in your Modal function or class. | Matches hardware to workload: T4 for cheap/light models, A100/H100 for large LLMs and vision models that care about latency and throughput. |
| Function-level concurrency | A per-function concurrency_limit that caps how many calls can execute at once across containers. | Prevents overload and backpressure issues; lets you shape how aggressively Modal fans out containers under load. |
| Container-level parallelism | Running multiple requests per container (e.g. batching, queues) using @app.cls and your own routing logic. | Better GPU utilization—especially on big GPUs—by ensuring each GPU processes multiple in-flight requests or batched tensors. |
How It Works (Step-by-Step)
At a high level, you:
- Define your environment and model code in Python using a Modal
Image. - Attach a GPU type and concurrency limit to a function or class with decorators.
- Deploy as a web endpoint, then observe load, adjust GPU and concurrency, and redeploy.
Let’s go through it.
1. Define an inference function with a GPU (T4 vs A100 vs H100)
Pick the workload. Say you’re serving a transformer model for text or vision. The core pattern with Modal:
- Use
@app.function(or@modal.function) to turn a Python function into a scalable unit. - Specify
gpu="T4"/"A10G"/"A100"/"H100"(case-insensitive). - Optionally include a GPU count:
"H100:8"if you need multiple GPUs per container. - Pin CPU and memory if needed (
cpu=,memory=) for auxiliary work.
import modal
app = modal.App("gpu-inference-demo")
image = (
modal.Image.debian_slim()
.pip_install(
"torch",
"transformers",
)
)
@app.function(
image=image,
gpu="T4", # or "A10G", "A100", "H100", "H100:8", etc.
cpu=4,
memory="16Gi",
timeout=600, # seconds
)
def run_inference(input_text: str) -> str:
# Load a small model per container or reuse a global initialized at import time
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gpt2" # for demo purposes
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
tokens = tokenizer(input_text, return_tensors="pt").to("cuda")
out = model.generate(**tokens, max_new_tokens=128)
return tokenizer.decode(out[0], skip_special_tokens=True)
Swap GPU types by changing the gpu= argument and redeploy. For example:
- Cost-sensitive / lighter model:
gpu="T4" - Heavier models (7–13B) or diffusion:
gpu="A10G"or"A100" - Large LLMs, low latency, or multi-GPU tensor parallelism:
gpu="H100"or"H100:8"
Modal treats string GPU names case-insensitively, so "h100" and "H100" are equivalent.
2. Set concurrency limits for the inference function
Now wrap this in an HTTP endpoint and apply concurrency controls. There are two main levels:
- Function-level concurrency: How many concurrent invocations of this function Modal will execute globally.
- Endpoint-level routing: How many simultaneous HTTP requests you expect; Modal maps these into function calls.
Here’s a FastAPI-style endpoint with an explicit concurrency limit:
from fastapi import FastAPI
from pydantic import BaseModel
web_app = FastAPI()
class InferenceRequest(BaseModel):
prompt: str
@app.function(
image=image,
gpu="A10G",
cpu=4,
memory="24Gi",
concurrency_limit=64, # global concurrency cap for this function
timeout=30,
)
@modal.asgi_app()
def inference_service():
@web_app.post("/generate")
async def generate(req: InferenceRequest):
# Call the GPU-backed function; can also batch, etc.
return {"output": run_inference.remote(req.prompt)}
return web_app
Here’s what concurrency_limit=64 does:
- Caps the number of in-flight function calls across all containers for
inference_service. - Under load, Modal scales containers up until either:
- Capacity is enough to keep latency low, or
- It hits the concurrency limit and starts queueing calls.
You can:
- Set higher values (e.g. 256, 512) if your model is light and you want more parallelism.
- Set lower values (e.g. 8–16) for big LLMs or diffusion models where VRAM is the bottleneck.
If you want stricter per-container behavior (e.g. one request per GPU per container), use a class-based server and control parallelism inside the container.
3. Use a stateful class to load the model once and tune per-container concurrency
For serious inference, you usually want to load the model once per container and reuse it. Modal’s @app.cls gives you a stateful model server with lifecycle hooks:
@app.cls(
image=image,
gpu="H100",
cpu=8,
memory="64Gi",
)
class LlmServer:
def __init__(self):
self.model = None
self.tokenizer = None
@modal.enter()
def load_model(self):
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="cuda",
)
@modal.method(concurrency_limit=4, timeout=60)
def generate(self, prompt: str) -> str:
# This method runs inside a container with the preloaded model
inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
out = self.model.generate(**inputs, max_new_tokens=256)
return self.tokenizer.decode(out[0], skip_special_tokens=True)
Key things happening:
gpu="H100"gives each container a full H100. You can use"H100:2"if you want two GPUs per container.@modal.enter()loads the model once when the container starts. Subsequent calls reuse it.@modal.method(concurrency_limit=4)constrains per-container concurrency for this method. If a container is already handling 4generatecalls, extra calls will be queued or routed to other containers.
You then expose this as an endpoint:
@app.function()
@modal.web_server()
def web():
from fastapi import FastAPI
from pydantic import BaseModel
app_ = FastAPI()
server = LlmServer()
class Req(BaseModel):
prompt: str
@app_.post("/generate")
async def generate(req: Req):
return {"output": server.generate.remote(req.prompt)}
return app_
Now your knobs are:
gpu=on the class (T4,A10G,A100,H100, plus counts like"H100:8").concurrency_limiton@modal.method, which sets per-container concurrency.- The implicit autoscaling behavior: as QPS grows, Modal starts more containers, each respecting its own method-level
concurrency_limit.
Picking between T4, A100, and H100 for inference
A rough heuristic you can actually operationalize:
-
Use T4 when:
- Models are small (≤1–2B parameters) and latency requirements are lax.
- You care a lot about cost per token/image and less about raw throughput.
- You expect bursty but low-volume traffic.
-
Use A10G / A100 when:
- You’re serving mid-sized LLMs (7–13B) or Stable Diffusion-like workloads.
- You need sub-second to low-second latency and steady throughput.
- You want a good balance of price/performance.
-
Use H100 when:
- You’re serving large LLMs, running big batches, or doing multi-GPU sharding (
"H100:8"). - You care a lot about tail latency under load and want high arithmetic intensity.
- You’re pushing lots of concurrent inferences or running RL/eval workloads with massive spikes.
- You’re serving large LLMs, running big batches, or doing multi-GPU sharding (
Since GPU type is just a Python argument, spin up three variants in different apps:
gpu_choices = ["T4", "A100", "H100"]
for gpu in gpu_choices:
@app.function(
image=image,
gpu=gpu,
concurrency_limit=32,
timeout=30,
name=f"infer-{gpu.lower()}",
)
def infer_variant(payload: dict):
...
Then hit each endpoint with a simple load test and compare latency and cost.
Concurrency patterns that work well on Modal
Three common patterns:
-
One request per container, simple function
- Use
@app.function(gpu="T4", concurrency_limit=N)with no extra per-container management. - Let Modal scale container count to handle load.
- Good for simple, stateless inference where load is modest.
- Use
-
Stateful server, limited per-container concurrency
- Use
@app.cls(gpu="A100")and@modal.method(concurrency_limit=2–8)for heavy models. - Each container keeps the model in VRAM and handles a few parallel calls.
- Modal scales container count horizontally as traffic increases.
- Use
-
Batching inside a container
- Still use
@app.clsbut implement an internal queue. - Each
generatecall enqueues a request; a background loop builds batches and runs a single forward pass on the GPU. - You might set a higher
concurrency_limit(e.g. 64) and use batching to keep GPU utilization high.
- Still use
Example skeleton for simple batching:
import asyncio
from collections import deque
@app.cls(
image=image,
gpu="A100",
cpu=8,
memory="64Gi",
)
class BatchedLlmServer:
def __init__(self):
self.queue = deque()
self.batch_event = asyncio.Event()
@modal.enter()
def load_model(self):
# load tokenizer + model, move to GPU as before
...
async def _batch_loop(self):
while True:
await self.batch_event.wait()
self.batch_event.clear()
requests = []
while self.queue and len(requests) < 16: # max batch size
requests.append(self.queue.popleft())
if not requests:
continue
prompts = [r["prompt"] for r in requests]
# run a single batched forward pass on the GPU
outputs = self._generate_batch(prompts)
for out, r in zip(outputs, requests):
r["future"].set_result(out)
def _generate_batch(self, prompts):
# tokenize + generate in batch
...
@modal.method(concurrency_limit=64)
async def generate(self, prompt: str) -> str:
loop = asyncio.get_event_loop()
fut = loop.create_future()
self.queue.append({"prompt": prompt, "future": fut})
self.batch_event.set()
return await fut
This pattern lets you:
- Set a large
concurrency_limit(logical requests). - Still control GPU pressure via batch sizes and sequence lengths.
Common Mistakes to Avoid
- Oversizing GPU for tiny models: Throwing an H100 at a 350M-parameter model with low QPS just burns money. Start with
"T4"or"A10G"and only move up when you see saturation. - Ignoring method-level concurrency: Using a class-based server but not adding
concurrency_limitto methods can lead to too many simultaneous generations per GPU, thrashing VRAM and increasing tail latency. - Not pinning dependencies: If you don’t pin versions in
pip_install, you’ll get run-to-run drift that makes performance debugging painful—pin to exact versions for reproducibility.
Real-World Example
Imagine you’re serving a 13B LLM with a hard SLA: P95 latency < 1.5s at 100 QPS, with occasional spikes to 500 QPS during eval runs. Local tests show:
- On a T4, single-request latency is ~2.8s and VRAM is tight.
- On an A100, you get ~1.4s at batch size 1 and decent headroom.
- On an H100, you can either:
- Run batch size 4–8 at similar latency, or
- Keep latency lower and handle spikes gracefully.
You decide to:
- Deploy the production endpoint on
gpu="H100". - Use
@app.clswith@modal.method(concurrency_limit=8)so each container handles up to 8 in-flight requests. - Let Modal autoscale out to dozens of containers during eval spikes.
Operationally:
- During normal traffic, you might run 2–3 H100 containers, each at moderate utilization.
- During spikes, Modal spins up more containers from the multi-cloud pool; cold starts stay sub-second because the image is warm and model loading is amortized across containers.
- You watch logs and metrics on the Modal apps page, adjust
concurrency_limitand batch size, and redeploy in minutes.
Pro Tip: Treat GPU type and concurrency as tunable parameters, not fixed infrastructure. Keep them near your model code in a single module, and script simple load tests (using
asyncio+httpxorlocust) that you can run after every change to verify latency and utilization before you ship.
Summary
You set GPU type and concurrency in Modal directly in Python:
- Pick GPUs with
gpu="T4","A10G","A100", or"H100"(and counts like"H100:8") on functions or classes. - Use function-level
concurrency_limitto cap global in-flight calls. - Use method-level
concurrency_limiton@app.clsservers to control per-container parallelism and protect VRAM. - Let Modal handle autoscaling and capacity, then tune these values based on real latency/load tests.
Because hardware and scaling behavior live next to your model code, iteration becomes a tight loop: change a decorator, modal deploy, hit it with traffic, adjust.