How do I set the GPU type (T4 vs A100 vs H100) and concurrency limits in Modal for an inference service?
Platform as a Service (PaaS)

How do I set the GPU type (T4 vs A100 vs H100) and concurrency limits in Modal for an inference service?

7 min read

Quick Answer: In Modal, you set the GPU type for an inference service directly in Python on the function or class (for example gpu="T4", gpu="A100", or gpu="H100"), and you control concurrency with decorators like @modal.function(concurrency_limit=N) or @app.cls(concurrency_limit=N). You can also tune container-level parallelism with @modal.concurrent and autoscaling behavior via timeout, retries, and other function options.

Most teams only realize they’ve chosen the wrong GPU or concurrency model when the service falls over under load or the cloud bill spikes. Modal tries to make those trade-offs explicit: you define GPU type, scaling, and concurrency in code, right next to your model, then push it to production with modal deploy. That keeps you in control of latency, throughput, and cost without wrangling YAML or custom orchestrators.

Key Benefits:

  • Hardware selection in code: Pin T4 vs A100 vs H100 right on the function or class that serves your model, with simple string identifiers like "H100" or "A100:2".
  • Predictable concurrency: Use concurrency_limit and @modal.concurrent to cap in-flight requests per function/class and avoid GPU thrash or OOMs.
  • Autoscaling without reservations: Let Modal’s scheduler ramp up containers across a multi-cloud GPU pool while you specify the constraints, not the node counts.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
GPU selectionChoosing the specific GPU model (e.g. T4, A10G, A100, H100) as part of your Modal function/class definition using gpu="...".Lets you trade off cost, VRAM, and throughput, and ensure the model actually fits and hits your latency budget.
Function/class concurrencyThe number of requests that a single function or class instance can handle at once, controlled via concurrency_limit and @modal.concurrent.Protects your model from overload, reduces tail latency, and avoids OOM while maximizing GPU utilization.
Autoscaling & capacity poolModal’s ability to spin up and down containers on CPUs/GPUs in a multi-cloud capacity pool, driven by your function definitions.You get “just enough” compute for real workloads—massive spikes, eval sweeps, or agent traffic—without manual node management.

How It Works (Step-by-Step)

Let’s walk through configuring GPU type and concurrency for a simple inference service, and then we’ll map it to T4 vs A100 vs H100 specifically.

1. Define an Image and App

First, define your environment as code. This is where you pin the runtime and libraries your model needs:

import modal

app = modal.App("gpu-inference-example")

image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch==2.2.2",
        "transformers==4.40.1",
    )
)

Opinionated best practice: pin versions tightly. Your future self will thank you when you can bisect a regression instead of chasing “works on my machine” bugs.

2. Choose GPU Type (T4 vs A100 vs H100)

Now attach GPU configuration directly on the function or class that does inference.

Example: Stateless endpoint with a T4

@app.function(
    image=image,
    gpu="T4",            # choose GPU type here
    concurrency_limit=50 # max in-flight calls across all containers
)
@modal.fastapi_endpoint()
async def generate(payload: dict):
    # load a small/medium model on first use or inside a class (see below)
    ...

For more advanced use, you can specify GPUs by string, including counts:

  • gpu="T4" – good for smaller models, lower cost.
  • gpu="A10G" – mid-range, more VRAM and throughput than T4.
  • gpu="A100" or gpu="A100:2" – large models, higher throughput; :2 means 2 GPUs.
  • gpu="H100" or gpu="H100:8" – top-end for big LLMs or massive batch throughput.

Modal accepts the GPU as a case-insensitive string (e.g. "H100" and "h100" are equivalent), and for multi-GPU you append :<count> like "H100:8".

Example: Stateful model server on an A100

The more idiomatic pattern for heavy inference is a stateful class that loads weights once per container:

@app.cls(
    image=image,
    gpu="A100",
    concurrency_limit=4,  # per-container RPC concurrency
)
class LLMServer:
    @modal.enter()
    def load_model(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer

        self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf")
        self.model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Llama-2-13b-chat-hf",
            device_map="auto",
            torch_dtype="float16",
        )

    @modal.method()
    async def generate(self, prompt: str) -> str:
        # do one generation call
        ...

Here, gpu="A100" ensures each container gets an A100. concurrency_limit=4 means Modal will route at most 4 concurrent generate() calls to each container instance.

3. Configure Concurrency

There are three layers you should think about:

  1. Global concurrency limit per function/class (concurrency_limit).
  2. Per-container method parallelism (@modal.concurrent).
  3. Fan-out/fan-in at the caller level (.map(), .spawn(), FunctionCall.get()).

Global concurrency per function/class

On a function:

@app.function(
    image=image,
    gpu="T4",
    concurrency_limit=200,
)
async def classify(batch: list[str]) -> list[str]:
    ...

On a class:

@app.cls(
    image=image,
    gpu="H100",
    concurrency_limit=8,  # per-container
)
class DiffusionServer:
    ...

concurrency_limit is a backpressure mechanism: if calls exceed this, new ones will queue (or fail if your caller times out). This is the first knob you should use to keep your GPU from being flooded.

Per-container concurrency with @modal.concurrent

If your handler is CPU-bound or does lightweight work relative to the GPU, you can safely run multiple requests in parallel inside one container:

@app.cls(
    image=image,
    gpu="A10G",
    concurrency_limit=16,
)
class EmbeddingServer:
    @modal.enter()
    def load_model(self):
        self.model = ...

    @modal.method()
    @modal.concurrent(8)  # up to 8 in-flight generate() per container
    async def embed(self, texts: list[str]) -> list[list[float]]:
        ...

Here:

  • concurrency_limit=16 governs routing to each container.
  • @modal.concurrent(8) allows 8 embed() executions in parallel inside that container.

You typically set @modal.concurrent to a small multiple of the GPU’s ideal batch size, then tune based on latency/throughput measurements.

Fan-out at the caller

You then drive load from the caller, using .map() or .spawn():

# Fire off many inference calls in parallel
calls = [LLMServer.generate.spawn(prompt) for prompt in prompts]
outputs = [c.get() for c in calls]

If you do this from within Modal, make sure your concurrency_limit is high enough and you understand the backpressure semantics.

4. Deploy and Validate

To deploy:

modal deploy gpu_inference.py

Then hit your endpoint and watch the Modal UI:

  • Apps page: see new containers spinning up on the GPUs you requested.
  • Logs & traces: verify that cold starts are within your latency budget (Modal targets sub-second cold starts for well-structured images).
  • Metrics: look for saturation/OOM and adjust concurrency_limit and GPU class.

Common Mistakes to Avoid

  • Overloading a single GPU with too high concurrency: Setting concurrency_limit=1000 on an A100-backed LLM class looks impressive until latency explodes and you trigger OOM errors. Start conservative (e.g. 2–4 per-container for large LLMs, 8–16 for embedding models) and increase while monitoring latency and memory.
  • Choosing the wrong GPU tier for your model size: Trying to fit a 70B model on a T4 will just fail; pushing a small 300M-parameter model to an H100 is usually wasteful. Verify model VRAM usage locally or in a Modal Sandbox, then choose T4/A10G/A100/H100 to match VRAM and throughput needs.

Real-World Example

Imagine you’re serving a production LLM endpoint for a coding assistant. You start with:

  • gpu="A100" – enough memory for a 13B–34B model with room for KV cache.
  • concurrency_limit=4 on the LLMServer class.
  • @modal.concurrent(2) on generate().

You deploy, run load tests with 50–100 RPS, and observe:

  • Median latency is fine, but P95 is creeping up when all four slots are busy.
  • GPU utilization hovers around 60%.

You then bump @modal.concurrent to 4 and reduce per-request max tokens to keep memory headroom. Now utilization hits ~85% and P95 latency stays within budget. As traffic grows, Modal spins up more A100-backed containers, each handling up to four concurrent generations, without you touching node groups or autoscaling rules.

Pro Tip: Use a Modal Volume to cache your model weights and tokenizer and combine it with a tightly pinned Image. This reduces cold start time on A100/H100 and lets you safely scale up to tens or hundreds of containers for evals or RL environments without repeatedly pulling multi-GB checkpoints from external storage.

Summary

Setting GPU type and concurrency limits in Modal is just Python:

  • Choose your GPU with a string like "T4", "A100", or "H100" (and optionally :<count>).
  • Attach it to the function/class doing inference with gpu="...".
  • Control global concurrency with concurrency_limit and per-container parallelism with @modal.concurrent.
  • Let Modal’s multi-cloud capacity pool handle the rest—sub-second cold starts, instant autoscaling, and thousands of GPUs on demand.

When you write the infrastructure in code, you can iterate on GPU choice and concurrency the same way you tune your model: profile, change a line, redeploy.

Next Step

Get Started