Modal vs Google Cloud Run (GPU): which is easier for a Python-first team and avoids GPU quota headaches?
Platform as a Service (PaaS)

Modal vs Google Cloud Run (GPU): which is easier for a Python-first team and avoids GPU quota headaches?

8 min read

Quick Answer: For a Python‑first AI team that wants GPUs without quota roulette, Modal is almost always easier than Google Cloud Run. You describe your environment, GPU type, and scaling in Python, and Modal’s multi‑cloud capacity pool handles “do I have GPUs?” while Cloud Run still pushes you into quotas, manual capacity planning, and more YAML-shaped config.

Why This Matters

If you’re shipping GPU-heavy workloads—LLM inference, fine‑tuning, eval jobs, sandboxes—your bottleneck usually isn’t the model. It’s how fast you can iterate, get capacity, and keep latency under control when traffic spikes. The difference between “Python function → global GPU endpoint in minutes” and “tickets, quotas, and Terraform” is literally how many experiments you can run per week.

Key Benefits:

  • Python-first, no YAML: Modal lets you define images, hardware, scaling, and endpoints in Python, so you stay in the language your team already uses instead of juggling Dockerfiles, Cloud Run config, and infra glue.
  • Fewer GPU quota headaches: Modal exposes a multi‑cloud GPU pool with instant autoscaling and no fixed “reservations”; Cloud Run still lives inside GCP quotas and per‑region capacity constraints.
  • Sub-second cold starts for AI workloads: Modal optimizes specifically for fast container and model startup, while Cloud Run’s cold starts plus generic container runtime often blow your latency budget for LLM and vision APIs.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Python-defined infrastructureUsing Python code to define images, hardware, scaling, and endpoints instead of YAML/TerraformKeeps infra close to your application logic, shortens feedback loops, and reduces context switching for Python‑first teams
Elastic GPU scalingAutomatically adding/removing GPU containers based on load without manual capacity planningLets you handle eval spikes, RL rollouts, or batch jobs without overprovisioning or getting throttled by quotas
Cold start behaviorHow fast a new container starts and serves a request after scaling from zeroAI workloads pay a big penalty on slow starts; Modal is engineered for sub‑second cold starts and fast model load, while Cloud Run’s cold starts can dominate tail latency

How It Works (Step-by-Step)

At a high level, the trade‑off looks like this:

  • With Cloud Run + GPU, you’re wiring up Docker, GCP quotas, Cloud Run services, and often another layer (e.g., Vertex AI, GKE, or batch) for non‑HTTP jobs.
  • With Modal, you write Python functions/classes, decorate them, and deploy them as endpoints or jobs onto a multi‑cloud GPU pool.

Here’s how the same Python‑first team would deploy a GPU‑backed model server on each platform.

1. Define your environment

On Modal (just Python):

import modal

app = modal.App("gpu-llm-service")

image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch==2.2.1",
        "transformers==4.39.0",
        "accelerate==0.28.0",
    )
)

GPU_TYPE = "A100:1"  # or "H100", "A10G", etc.


@app.cls(
    image=image,
    gpu=GPU_TYPE,
    concurrency_limit=4,
    container_idle_timeout=600,  # keep warm for 10 minutes
)
class LLMServer:
    def __init__(self):
        self.model = None
        self.tokenizer = None

    @modal.enter()
    def load(self):
        from transformers import AutoModelForCausalLM, AutoTokenizer

        model_name = "meta-llama/Llama-3-8b-instruct"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="cuda",
            torch_dtype="auto",
        )

    @modal.method()
    def generate(self, prompt: str, max_tokens: int = 256) -> str:
        inputs = self.tokenizer(
            prompt, return_tensors="pt"
        ).to(self.model.device)
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7,
        )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)


@app.fastapi_endpoint()
def inference_app():
    from fastapi import FastAPI
    from pydantic import BaseModel

    server = LLMServer()
    web = FastAPI()

    class Request(BaseModel):
        prompt: str
        max_tokens: int = 256

    @web.post("/generate")
    async def generate(req: Request):
        return {"output": await server.generate.remote(req.prompt, req.max_tokens)}

    return web

To deploy:

modal deploy gpu_llm_service.py

You’ve just:

  • Defined the environment (Image).
  • Requested a GPU (gpu=GPU_TYPE).
  • Created a stateful model server (@app.cls + @modal.enter).
  • Exposed an HTTPS endpoint (@app.fastapi_endpoint).

No YAML, no service manifests; everything is testable via modal run before deploy.

On Cloud Run GPU:

You typically need to:

  1. Write a Dockerfile (pin CUDA, drivers, Python deps).
  2. Build and push to Artifact Registry.
  3. Make sure your project/region has GPU quota and GPU‑enabled Cloud Run.
  4. Create a Cloud Run service with GPU flags (via gcloud run deploy or console).
  5. Configure concurrency, autoscaling, min instances, and IAM.

That flow is more Docker and infra heavy than Python-heavy.

2. Wire up autoscaling and concurrency

On Modal, scaling is part of the decorator:

@app.cls(
    image=image,
    gpu="A10G",
    concurrency_limit=8,
    allow_concurrent_inputs=True,
)
class Embedder:
    ...

Modal’s runtime:

  • Launches containers in seconds (cold start optimizations tuned for AI).
  • Maps .remote() calls onto container instances.
  • Scales out to thousands of GPUs across clouds when needed.
  • Scales back to zero when idle (you only pay for active compute).

On Cloud Run, autoscaling is configured in service settings:

  • --max-instances, --concurrency, --min-instances.
  • Scaling is bound to a single region and a fixed quota.
  • If you get a big eval spike, you either hit max-instances or 429s.

To “burst” you often need to coordinate quotas across multiple regions or fall back to another product (GKE, Vertex AI, or Batch).

3. Handle non-HTTP workloads and spikes

Python-first AI teams rarely run only HTTP inference. You likely also need:

  • Evaluation jobs during training.
  • Batch data processing.
  • Sandboxes for untrusted code.
  • Periodic jobs (cron).

On Modal, these are just different ways of calling the same function:

@app.function(
    image=image,
    gpu="A10G",
    timeout=60 * 60,  # 1 hour training eval
)
def eval_model(checkpoint_path: str):
    # run eval on a separate worker GPU
    ...

# Fan out evaluation to 100 GPUs
calls = [
    eval_model.spawn(f"s3://bucket/checkpoints/ckpt-{i}.pt")
    for i in range(100)
]
results = [c.get() for c in calls]

Same primitives, same image, same GPU types. No new product to learn.

On Cloud Run, non‑HTTP workloads usually force you to:

  • Wrap work in HTTP handlers and trigger via Pub/Sub, or
  • Use a separate service like Cloud Run Jobs, Cloud Functions, or Vertex AI Batch, each with its own config, limits, and quota surface.

The operational story becomes “orchestrate across 3–4 products,” which compounds complexity for a Python‑first team.

Common Mistakes to Avoid

  • Treating GPUs as “set and forget” on Cloud Run:
    Cloud Run GPU is still gated by regional quotas, service-level caps, and per‑project limits. To avoid surprise throttling, you’d need proactive quota requests, synthetic load tests, and monitoring around max-instances. On Modal, the whole point of the multi‑cloud capacity pool is to sidestep that class of problem—take advantage of it instead of pinning yourself to a single-region quota.

  • Re‑implementing Modal’s runtime in your own infra:
    It’s tempting to build a bespoke stack: Docker + Cloud Run + Pub/Sub + Redis queues + some autoscaler. The maintenance cost (especially around cold starts and GPU fragmentation) is steep. If your workflow is “Python functions + containers + GPUs,” Modal already gives you cold‑start‑optimized containers, .map()/.spawn() fan‑out, and observability; use those primitives rather than rebuilding them on top of Cloud Run.

Real-World Example

Imagine you’re a small team training a code model and running heavy evals on each checkpoint. You want to:

  • Launch 200 parallel eval jobs on GPUs whenever a new checkpoint lands.
  • Keep latency tight on a public inference endpoint.
  • Avoid ever opening a quota ticket just to run another experiment.

On Cloud Run, a realistic setup might be:

  • A Cloud Run GPU service for inference.
  • A Cloud Run Job or Vertex AI Batch setup for evaluation.
  • Pub/Sub triggers wired from your training pipeline.
  • A Terraform or YAML layer to keep configs aligned.
  • Several rounds of quota requests (per GPU type, per region, per service).

When you hit a new usage pattern—say, RL training with thousands of short‑lived rollouts—you’ll revisit the entire stack to keep up.

On Modal, you could keep everything in one Python app:

import modal

app = modal.App("rl-system")

image = (
    modal.Image.debian_slim()
    .pip_install("torch", "rl-algorithms-lib")
)

GPU = "A10G"


@app.cls(image=image, gpu=GPU, concurrency_limit=4)
class PolicyServer:
    @modal.enter()
    def load(self):
        import torch
        from rl_algorithms_lib import Policy

        self.policy = Policy.load_from_volume("/vol/policy")
        self.device = "cuda"
        self.policy.to(self.device)

    @modal.method()
    def act(self, obs_batch):
        import torch

        obs = torch.tensor(obs_batch, device=self.device)
        return self.policy(obs).detach().cpu().tolist()


@app.function(image=image, gpu=GPU)
def rollout(seed: int, num_steps: int = 1024):
    # run environment rollouts using PolicyServer
    server = PolicyServer()
    obs = reset_env(seed)
    steps = []
    for _ in range(num_steps):
        action = server.act.remote([obs])[0]
        obs, reward, done = step_env(obs, action)
        steps.append((obs, action, reward))
        if done:
            obs = reset_env(seed)
    return steps


@app.function(
    image=image,
    schedule=modal.Cron("*/10 * * * *"),  # every 10 min
)
def periodic_eval():
    seeds = range(256)
    calls = [rollout.spawn(s) for s in seeds]
    results = [c.get() for c in calls]
    log_eval(results)

Operationally you get:

  • One codebase, one deployment (modal deploy rl_system.py).
  • Autoscaling GPUs for both serving and rollouts.
  • No separate queue system; .spawn() is your fan‑out.
  • Logs and metrics across all functions/containers on Modal’s apps page.

Pro Tip: Start by modeling your workflows as plain Python functions that you can run locally. Once they work, add Modal decorators (@app.function, @app.cls, @modal.fastapi_endpoint) and run with modal run to test on real GPUs. Only then modal deploy. This keeps the feedback loop tight and avoids debugging infra at the same time as your model code.

Summary

For a Python‑first team, the question is less “Modal vs Cloud Run as products” and more “Do you want to run GPU infra as code in Python, or do you want to be in the business of capacity planning GPUs on a general‑purpose container platform?”

  • Modal gives you Python-defined infrastructure, sub‑second cold starts tuned for AI, and elastic GPU scaling across a multi‑cloud capacity pool.
  • Cloud Run gives you a good generic container runtime, but GPU access is still quota‑gate‑heavy, and you’ll likely need a stack of adjacent GCP products to cover training, evals, batch, and sandboxes.

If your priority is fast iteration, fewer GPU quota headaches, and keeping your infra surface in Python rather than YAML and tickets, Modal is usually the simpler—and honestly more fun—choice.

Next Step

Get Started