
Modal pricing: how far do the $30/month free credits go for a small GPU inference prototype?
Most teams looking at Modal pricing want to know a simple thing: can you build and test a real GPU inference prototype on the $30/month free compute, or will you run out of credits after a couple of curl calls? The short answer is that for a small GPU-backed API or internal tool, those credits go a surprisingly long way—especially if you design the prototype the way Modal is meant to be used: batch your work, keep containers warm, and don’t burn GPU time on idle.
Quick Answer: For a small GPU inference prototype, Modal’s $30/month free compute is enough to run thousands to tens of thousands of inferences per month, depending on GPU type, model size, and how you structure your app. If you use a modest GPU (e.g., A10G), batch work with
.map(), and keep stateful servers warm with@app.cls, you can iterate on a serious prototype before you ever hit a bill.
Why This Matters
When you’re exploring a new LLM- or vision-powered product, the last thing you want is to spend your week learning GPU quota pages and pricing calculators instead of running experiments. You need to know whether you can actually stand up a GPU endpoint, ship it to a few users, and run evals—without immediately committing to a big infra budget.
Modal is explicitly designed to compress that experiment → deploy loop. The $30/month free compute is there so you can:
- Stand up real GPU workloads (inference, small fine-tunes, batch jobs) with zero infra setup.
- Prove latency, throughput, and unit economics before talking to finance.
- Keep your prototype in “real production shape” (metrics, retries, autoscaling) rather than throwaway notebooks.
Key Benefits:
- Realistic prototyping, not toy demos: You’re using the same primitives (Images, Functions, stateful classes) that Modal customers use in production, just with a small spend ceiling.
- Tight feedback loops:
modal runandmodal serveplus elastic GPUs let you iterate on containers and code in minutes, not days of cluster plumbing. - Cost-aware by design: Code-first infrastructure (hardware in decorators, scaling in Python) makes it straightforward to reason about and optimize GPU spend while you’re building.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Free compute (≈$30/month) | Monthly bucket of usage you can spend on CPU/GPU workloads, including inference, batch, and sandboxes. | It’s your “runway” for early experiments—understanding how many calls or hours you can buy is the whole point of this article. |
| Modal Functions & Classes | Python functions (decorated with @app.function) and classes (@app.cls) that Modal turns into scalable, containerized workloads and stateful servers. | How you structure these (e.g., load model once per container vs. per request) changes how far your free credits go. |
| Autoscaling & cold starts | Modal spins up containers on demand and scales to zero when idle, with sub-second cold starts and fast model initialization. | Good autoscaling plus stateful containers dramatically reduces wasted GPU time, stretching your free credits. |
How It Works (Step-by-Step)
Let’s anchor on a concrete scenario: you want to ship a small GPU inference prototype—a Python API that runs a mid-sized open-source LLM or vision model behind an HTTP endpoint.
At a high level:
- You define your environment and model in a Modal Image.
- You wrap inference in a Modal Function or stateful
@app.clsserver. - You deploy as an endpoint and let Modal autoscale, while you watch how much of the $30 you actually burn.
1. Package your model in a Modal Image
You start with a Python file like app.py:
import modal
app = modal.App("small-gpu-inference-prototype")
image = (
modal.Image.debian_slim()
.pip_install(
"torch==2.2.1",
"transformers==4.39.3",
"accelerate==0.28.0",
)
)
This defines your container environment—no Dockerfile, no YAML. The image build is cached, so you’re not paying GPU time to reinstall dependencies on every run.
Cost tip: Pin dependencies tightly. Reproducible images mean fewer debugging runs and wasted experiments.
2. Load the model once per container with @app.cls
For GPU efficiency, you want to load the model once per container and reuse it:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
GPU = modal.gpu.A10G() # for example
@app.cls(
gpu=GPU,
image=image,
concurrency_limit=4, # up to 4 requests per container
)
class SmallLlmServer:
@modal.enter()
def setup(self):
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="cuda",
)
@modal.method()
def generate(self, prompt: str, max_new_tokens: int = 128) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.inference_mode():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
The @modal.enter hook runs once per container, so you’re not paying cold-start cost on every request.
Cost tip: Statefulness is your friend. Loading a 7B model can take a few seconds and IO—amortizing that across many calls saves both time and money.
3. Expose a web endpoint and deploy
Now add a simple FastAPI endpoint:
from fastapi import FastAPI
from pydantic import BaseModel
web_app = FastAPI()
class InferenceRequest(BaseModel):
prompt: str
max_new_tokens: int = 128
@web_app.post("/generate")
async def generate_endpoint(req: InferenceRequest):
return {
"completion": await SmallLlmServer().generate.remote(
req.prompt,
req.max_new_tokens,
)
}
@app.asgi_app()
def fastapi_app():
return web_app
Deploy it:
modal deploy app.py
Now you have a production-grade GPU endpoint, autoscaling on demand, spending from that $30 pool only when you’re actually doing work.
So, how far does that $30 go in practice?
We’re not going to fabricate exact per-GPU prices here, but we can reason from the underlying mechanics:
- GPU billing is essentially GPU-seconds × rate.
- Your job is to maximize inferences per GPU-second:
- Use a modest GPU (A10G or similar) if it fits your latency target.
- Keep the model loaded via
@app.cls. - Batch work where it makes sense (
.map()or batched tokenization).
If each inference call uses, say, 200–500 ms of actual GPU compute on an A10G, that gives you thousands of calls per GPU-hour. Even with overhead, you’re realistically looking at thousands to tens of thousands of inferences before touching the $30 limit—more than enough for iteration loops, internal dogfooding, and early user testing.
Common Mistakes to Avoid
-
Spinning up a new GPU per request:
If you call a@app.function(gpu=...)that loads the model on every invocation, you’ll waste your credits on repeated initialization.
How to avoid it: Use@app.clswith@modal.enter()to load once per container, and reuse the same server for many requests. -
Forgetting about batching and fan-out:
Running evals one request at a time is the slowest and often the most expensive way to validate a model.
How to avoid it: Put your eval workload behind a CPU or modest GPU function and use.map()/.spawn()to fan out over many inputs in parallel. Pay for GPU time in bulk, not in tiny fragments.
Real-World Example
Imagine you’re building a small RAG-backed support bot for your SaaS:
- You use Modal Volumes to store your embeddings and index.
- You run an embedding model on CPU or a small GPU.
- You host a 7B instruction model on an A10G with
@app.clsas above. - The bot is only exposed to your internal team for the first month.
Your usage pattern might look like:
- A few hundred queries/day during work hours.
- Occasional load spikes when you run regression tests or eval suites.
- Idle nights and weekends.
On Modal, that usage pattern is almost ideal for stretching free credits:
- Autoscaling means no GPU cost when nobody’s hitting the endpoint.
- Containers spin up quickly enough that you don’t need to keep a large, always-on pool.
- Batch runs (evals, backfill jobs) can be run via
modal runwith.map()over thousands of items, using GPUs intensely for a short time, then scaling back to zero.
By the time you’ve burned through the $30, you’ve likely:
- Profilied latency and throughput on different GPU SKUs.
- Figured out whether you need a 7B, 13B, or something smaller.
- Collected enough usage to estimate per-request cost and margins.
You’re not guessing anymore—you’ve measured.
Pro Tip: Start your prototype on a modest GPU (e.g., A10G) with tight timeouts and clear metrics. If you later find you’re bottlenecked on latency, scale up to a bigger GPU, but keep the same code structure—just change the
gpu=...parameter and redeploy.
Summary
Modal’s $30/month free compute is not meant for running a production LLM API at scale; it’s meant to get you from idea to working GPU-backed prototype without touching Terraform, without begging for quotas, and without surprise infra bills.
If you:
- Use stateful servers (
@app.cls+@modal.enter) to load models once per container, - Pick an appropriate GPU size instead of overprovisioning,
- Batch work where possible (
.map(),.spawn()),
you can ship a small GPU inference prototype, run thousands of real calls, and iterate on the architecture before you outgrow the free tier.