
Modal vs Northflank: do I end up managing Kubernetes anyway, and how do GPU workloads compare?
Quick Answer: With Northflank you’re still fundamentally running on Kubernetes primitives—clusters, services, pods, and YAML-shaped concepts—just with a nicer UI. Modal skips Kubernetes entirely from the user experience: you describe hardware, scaling, and endpoints in Python, and the platform handles autoscaling across a multi-cloud GPU pool. For GPU-heavy workloads, Modal is optimized for fast model initialization, elastic scaling to thousands of GPUs, and production-grade autoscaling without quotas or reservations.
Why This Matters
If you’re shipping real AI workloads—LLM inference, fine-tuning, eval farms, or batch jobs that spike 100×—the difference between “managed Kubernetes” and “no Kubernetes at all” is not aesthetics; it’s throughput and iteration speed. Time spent fighting cluster configuration, image builds, or GPU reservations is time you’re not shipping features or running experiments. A platform that’s truly AI-native changes how fast you can iterate on models, fix latency regressions, and survive traffic spikes without overprovisioning.
Key Benefits:
- Less operational drag: Modal removes Kubernetes and most infra boilerplate from your daily loop; you define everything in Python instead of juggling Helm charts and deployment specs.
- GPU scaling for real workloads: Modal’s multi-cloud GPU pool and intelligent scheduling handle spiky, high-concurrency AI jobs (evals, RL, agents) without manual capacity management or static reservations.
- Production mechanics built-in: Retries, observability, gVisor isolation, and endpoint controls like Proxy Auth Tokens are first-class, so you can go from notebook to production endpoint with the same code.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Code-defined infrastructure | Defining environment, hardware, scaling, and endpoints directly in Python (e.g., decorators, Images) instead of YAML, dashboards, or cluster manifests. | Eliminates config drift and makes infra part of your app code, so you can version, review, and test it like everything else. |
| AI-native runtime | A runtime engineered for model workloads: fast container startup, sub-second cold starts, GPU-aware scheduling, and high-throughput storage for weights and data. | Keeps inference latency low and training/eval throughput high without you hand-tuning pod specs or pre-warming nodes. |
| Elastic GPU capacity | The ability to scale GPU count up and down automatically across clouds with no long-lived reservations or quotas to manage. | Lets you handle massive spikes—evals, RL rollouts, batch jobs—without overprovisioning or getting rate-limited by fixed cluster capacity. |
How It Works (Step-by-Step)
At a high level, the “do I end up managing Kubernetes anyway?” question comes down to where you spend your time: app code vs cluster mechanics. Let’s walk through the same workload—an LLM inference endpoint—on a Kubernetes-centric platform like Northflank vs a Python-first, AI-native platform like Modal.
1. Defining the environment
Northflank-style flow
- You build and push a Docker image, or configure a build pipeline from your repo.
- You specify:
- Base image
- Build context and Dockerfile
- Environment variables
- CPU/memory limits, maybe GPU resources
- Under the hood, this becomes a Kubernetes Deployment/Service, with workloads mapped to pods and containers.
You’re not editing raw YAML, but you are thinking in terms of containers and cluster resources. As soon as you want something non-trivial—sidecars, node selectors, GPU node pools—you’re effectively back to Kubernetes concepts.
Modal flow
On Modal, you define the environment as a Python Image:
import modal
image = (
modal.Image.debian_slim()
.pip_install(
"torch",
"transformers",
)
)
You keep everything in code; no separate Dockerfile is required unless you want one. The image definition is part of the same module as your app logic, versioned together.
2. Specifying GPU hardware and scaling
Northflank-style GPU setup
- Configure a GPU-enabled node pool or select GPU instance types.
- Set resource requests/limits for GPU per pod (e.g.,
nvidia.com/gpu: 1). - Manage cluster capacity so you have enough GPUs “idling” to handle peaks.
- When you need 50× GPUs for an eval burst, you either:
- Hope autoscaling kicks in fast enough (and wait for nodes to provision), or
- Pre-provision and pay for idle capacity.
Even with a good UI, you’re budgeting GPU capacity, planning quotas, and thinking about nodes and pods.
Modal GPU setup
On Modal, you declare GPUs per function in Python:
app = modal.App("llm-inference")
@app.function(
image=image,
gpu="A100:1", # or "H100", "A10G", etc.
timeout=600,
)
def generate(prompt: str) -> str:
# load model weights in container init (see @app.cls below)
...
For a stateful model server that loads weights once per container:
@app.cls(
gpu="A100:1",
image=image,
)
class LLMServer:
@modal.enter()
def load(self):
from transformers import AutoModelForCausalLM, AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained("...")
self.model = AutoModelForCausalLM.from_pretrained("...").cuda()
@modal.method()
def generate(self, prompt: str) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = self.model.generate(**inputs, max_new_tokens=256)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
Modal’s scheduler allocates GPUs from its multi-cloud pool. You don’t touch node pools, autoscaler configs, or reservations. Scale-up/down is driven by calls to .remote(), .spawn(), .map(), and endpoint traffic.
3. Exposing a production endpoint
Northflank-style endpoint
- Configure an HTTP service:
- Port, path, protocol
- Health checks / liveness probes
- Scaling rules (min/max replicas, CPU utilization targets)
- Decide on ingress, domain, and TLS.
- Under the hood, this is a Kubernetes Service + Ingress + HPA.
You now own all the failure modes of a misconfigured probe, a wrong path, or an undersized HPA target. Any non-trivial behavior (rate limits, auth) pushes you deeper into Kubernetes or platform-specific configurations.
Modal endpoint
On Modal, endpoints are established with a decorator and a single deploy command:
from fastapi import FastAPI
from pydantic import BaseModel
web_app = FastAPI()
class Request(BaseModel):
prompt: str
@app.cls(gpu="A100:1", image=image)
class LLMServer:
@modal.enter()
def load(self):
...
@modal.method()
def generate(self, prompt: str) -> str:
...
@app.fastapi_endpoint()
def fastapi_app():
@web_app.post("/generate")
async def generate_route(req: Request):
return {"output": await LLMServer().generate.remote(req.prompt)}
return web_app
To deploy:
modal deploy llm_app.py
You get:
- A HTTPS endpoint
- Autoscaling around traffic
- Built-in logging visible in the Modal apps page
- Optional protection with Proxy Auth Tokens (
requires_proxy_auth=True) if you want an internal-only API
No Services, no Ingress, no HPA spec. Just Python.
4. Autoscaling and job orchestration
Northflank / Kubernetes way
Autoscaling is usually tied to CPU/memory or custom metrics. For GPU workloads:
- GPU autoscaling is harder: you often end up tuning node autoscalers, pod disruption budgets, and a bunch of HPA knobs.
- Background jobs, eval sweeps, or RL rollouts are implemented via:
- CronJobs / Jobs + message queues
- Or multiple “worker” deployments pulling from a queue
You’re gluing together services and jobs, trying not to DOS your own cluster.
Modal way
Workloads are orchestrated with function-level primitives:
.remote()– run a single job.map()– fan out over a dataset.spawn()– queue work and return immediately, thenFunctionCall.get()latermodal.Period/modal.Cron– scheduled jobs
Example: run 10,000 evals over a model with automatic GPU fan-out:
@app.function(
image=image,
gpu="A10G",
timeout=900,
)
def eval_prompt(prompt: str) -> float:
# run inference, compute score
...
def run_eval_suite(prompts: list[str]) -> list[float]:
return list(eval_prompt.map(prompts))
Modal’s scheduler spreads this across as many GPUs as needed (and available), scaling from zero and back again. You don’t touch queue configs or cluster-wide parallelism limits.
5. Observability and debugging
Northflank / Kubernetes
- Logs per pod/container, often split across multiple components.
- For GPU jobs, you’re checking:
- Pod scheduling status
- Node health
- GPU driver/container compatibility
- When something fails, you end up in
kubectl describe pod/logsterritory, even if the UI is friendly.
Modal
- Each function call, container, and workload shows up in the Modal dashboard.
- Logs and stdout/stderr are grouped per call.
- Timeouts and retries (
modal.Retries) are expressed at the function level:
@app.function(
image=image,
gpu="A10G",
timeout=600,
retries=modal.Retries(
max_retries=3,
backoff_coefficient=2.0,
),
)
def batch_item(...):
...
The operational surface area is “function calls and containers” instead of “pods, deployments, services, nodes.” You debug your code, not the cluster.
Common Mistakes to Avoid
- Assuming “no Kubernetes UI” means “no Kubernetes complexity”: Even if a platform hides manifests behind a web form, you still inherit Kubernetes failure modes: scheduling issues, cluster capacity planning, and HPA tuning. If you don’t want to manage Kubernetes, pick a platform where the core abstraction is application code, not pods.
- Underestimating GPU operational overhead on generic platforms: GPUs aren’t just “add a flag.” Driver versions, node pools, quotas, and cross-region data paths all matter. With a general-purpose container platform, you often end up reinventing GPU-aware scheduling and pre-warming. On Modal, GPU config is a decorator argument and the scheduler is built for model workloads.
Real-World Example
Imagine you’re running an LLM-powered coding agent that:
- Serves low-latency interactive traffic from users.
- Runs huge eval suites overnight across 10,000+ prompts.
- Spawns episodic RL environments that need hundreds of GPUs in parallel.
On a Kubernetes-first platform:
- You define a deployment for the online model server and configure GPU resources.
- You set up a separate job service (or multiple worker deployments) for eval/ RL, plus a message queue.
- You tweak autoscaler rules so that RL bursts don’t starve the online service of GPU capacity.
- You manually plan node pool sizes and pre-provision GPUs ahead of big eval runs.
You’re basically running your own cluster ops team.
On Modal, the same system is:
- An
@app.cls-based model server with@modal.enterweight loading, exposed via@modal.fastapi_endpoint. - A fan-out eval runner built on
.map()over a list of prompts. - RL rollouts implemented with
.spawn()to queue millions of episodes, each running on GPU-backed functions. - Modal’s scheduler spreads everything over its multi-cloud GPU capacity pool, while you limit concurrency at the function level if needed.
You still have to design a good system, but you’re not also debugging cluster autoscalers. Teams like eval-heavy research orgs use Modal for exactly this pattern: “massive spikes in volume for evals, RL environments, and MCP servers” without breaking the rest of their stack.
Pro Tip: If you’re GPU-bound, treat your infra code like any other performance-critical code: pin dependencies tightly in
Imagedefinitions, keep model weights close to compute (use Modal’s storage and Volumes for caching), and push as much logic as possible into@app.clsservers to amortize model load across many requests.
Summary
When you compare Modal vs a Kubernetes-centric platform like Northflank, the key question isn’t “whose dashboard is nicer?” but “what am I actually managing day-to-day?” With Northflank, you’re still living in a Kubernetes world: containers, node pools, GPU quotas, HPAs, and the subtle ways those can fail under load. With Modal, the dominant abstraction is Python code—functions, classes, and endpoints that the platform scales across a multi-cloud GPU pool with sub-second cold starts.
For GPU workloads specifically—LLM inference, fine-tuning, large eval grids, RL—it’s the difference between treating the cluster as your product vs treating it as a dependency you don’t have to think about. If your bottleneck is iteration speed and your workloads are GPU-heavy, running them on an AI-native runtime like Modal will almost always get you to production faster, with less operational overhead, than managing Kubernetes indirectly.