
Modal vs Replicate: which is better for deploying Hugging Face models as a production API (custom code, not just a model card)?
Most teams hit the same wall when they try to turn a Hugging Face model into a real production API: the “hello world” demo is easy, but everything around it—custom pre/post-processing, latency budgets, GPUs, autoscaling, observability—gets messy fast. Both Modal and Replicate try to smooth that path, but they make very different bets about who controls the stack and how much you can customize.
Quick Answer: For deploying Hugging Face models as a production API with custom Python code, Modal is usually the better fit. Replicate is great for simple “model-card-as-a-service” use cases, but Modal gives you full Python-defined infrastructure, more control over GPUs, scaling and endpoints, and a workflow that feels like running local code—just backed by thousands of containers.
Why This Matters
If you’re exposing a Hugging Face model to real users or other services, you’re no longer just “running a model”—you’re operating a backend. That means you care about p95 latency, cold starts, GPU utilization, retries, request queuing, auth, and how fast you can ship changes without breaking everything. Choosing between Modal and Replicate changes what’s possible:
- Can you run arbitrary Python around the model (RAG, feature stores, business logic)?
- Can you tune autoscaling for spiky workloads (evals, batch inference, agents)?
- Can you inspect, debug, and evolve the system like any other backend?
The answer decides whether you’re stuck in a model-hosting sandbox, or you end up with an AI-native runtime that behaves like real infrastructure.
Key Benefits:
- Modal for full custom apps: Treat your Hugging Face model as part of a larger Python service—RAG, routing, agents—deployed via decorators and
modal deploy. - Replicate for simple model APIs: Ship a quick API for an off‑the‑shelf model card with minimal setup, as long as you don’t need deep custom logic or infra control.
- Production ergonomics vs. presets: Modal optimizes for programmable infrastructure (Images, Functions, Classes); Replicate optimizes for “call a hosted model with JSON.”
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Python-defined infrastructure | Expressing environment, hardware, scaling, and endpoints directly in Python code (e.g., Modal Images + decorators) | Lets you version control infra alongside app code and evolve it like any other module—critical when your app is more than just a model card |
| Custom inference pipeline | Pre/post-processing, routing, RAG, and business logic wrapped around a base model | Real production APIs almost always need this; platforms that only expose “model inference” primitives become limiting |
| Elastic GPU scaling | Automatically adding/removing GPU containers based on traffic while keeping cold start/latency low | Makes it feasible to handle eval spikes, batch jobs, and traffic bursts without overprovisioning or hitting quota walls |
How It Works (Step-by-Step)
At a high level, both platforms want the same outcome: you send an HTTP request, a Hugging Face model runs on a GPU somewhere, and you get a response. The path to get there is very different.
Below I’ll zoom in on the Modal side first, because that’s where you get full control over the Python app. Then I’ll contrast how you’d do the equivalent on Replicate.
1. Express your environment and model in code (Modal)
On Modal, you define everything in Python: the container image, Python deps, GPU type, and the function or class that becomes your API.
Let’s say you want to serve a Hugging Face text generation model with custom tokenization and logging:
import modal
app = modal.App("hf-prod-api")
image = (
modal.Image.debian_slim()
.pip_install(
"torch",
"transformers",
"accelerate",
"huggingface_hub",
"loguru",
)
)
GPU = "A10G" # e.g. "A10G", "A100:2", "H100"
@app.cls(
image=image,
gpu=GPU,
concurrency_limit=8, # max parallel requests per container
)
class HFModelServer:
def __init__(self):
self.logger = None
self.pipe = None
@modal.enter()
def load_model(self):
from loguru import logger
from transformers import AutoModelForCausalLM, AutoTokenizer
self.logger = logger
model_id = "gpt2"
self.logger.info(f"Loading model {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
self.pipe = (tokenizer, model)
self.logger.info("Model loaded")
@modal.method()
def infer(self, prompt: str, max_new_tokens: int = 128) -> str:
tokenizer, model = self.pipe
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
You now have:
- A stateful server (
@app.cls) that loads the Hugging Face model once per container via@modal.enter - Explicit GPU selection
- Concurrency control per replica
Next, expose this as an HTTP API:
from fastapi import FastAPI
from pydantic import BaseModel
api = FastAPI()
class InferenceRequest(BaseModel):
prompt: str
max_new_tokens: int = 128
@app.function()
@modal.asgi_app()
def fastapi_app():
return api
@api.post("/generate")
async def generate(req: InferenceRequest):
result = await HFModelServer().infer.remote(
req.prompt,
max_new_tokens=req.max_new_tokens,
)
return {"output": result}
Deploy it:
modal deploy hf_prod_api.py
You get:
- A production FastAPI endpoint hosted on Modal
- Autoscaling containers with your HF model
- Logs + metrics visible in the Modal dashboard
- The ability to iterate with
modal serve hf_prod_api.pyand automatic reloads while you edit the code
2. Scale with traffic and workloads (Modal)
Modal is built around “functions as entrypoints”:
- For online inference, the FastAPI endpoint calls
.remote()on your class method - For parallel batch jobs, you can call
.map()across thousands of inputs - For job queues, you can use
.spawn()andFunctionCall.get()to poll results
Example: running batched evaluation on a separate worker GPU during training:
@app.function(gpu=GPU, timeout=60 * 60)
def eval_model(batch_inputs: list[str]) -> list[str]:
# reuse the same HFModelServer class or instantiate your own
server = HFModelServer()
return [server.infer(x) for x in batch_inputs]
And then from your training loop (locally or inside Modal):
from modal import FunctionCall
calls: list[FunctionCall] = []
for batch in eval_batches:
calls.append(eval_model.spawn(batch))
# Collect results later
results = []
for call in calls:
results.extend(call.get())
Operational knobs you get out of the box:
timeoutup to 24 hours for long jobsmodal.Retriesfor transient failures- Volumes for caching models or storing checkpoints
- Team controls, Proxy Auth Tokens (
requires_proxy_auth=True) for protected endpoints - SOC2 & HIPAA and gVisor isolation for untrusted code/sandboxes
3. How the equivalent flow looks on Replicate
Replicate takes a different approach: you typically define a “model” or “prediction” that wraps your code and environment, and Replicate exposes that as an API.
For a Hugging Face model with custom logic you’d:
- Create a Replicate model with a Dockerfile or config specifying deps
- Implement an
inference.py(or similar) that defines the prediction logic - Push/build it, then call it via the Replicate HTTP API
Roughly, your predict function would look like:
# pseudo-example, not exact Replicate API
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
def predict(prompt: str, max_new_tokens: int = 128) -> str:
# no @modal.enter; you have to rely on how Replicate caches state
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Then you call it from your app using their client, passing input JSON.
Where it feels different:
- You’re working inside a “model hosting” abstraction, not building a general backend service
- Custom web frameworks/routers (FastAPI, custom auth flows, multi-endpoint apps) are less of a first-class story
- You don’t get the same blueprint of “Images + Functions + Classes + Volumes + web endpoints” all defined in Python
For some teams, this is totally fine: they just want an HTTP endpoint for a model card with minimal setup. But as soon as you need a real app around the model, these constraints start to show.
Common Mistakes to Avoid
- Treating a production app like a one-off model demo: If you assume your Hugging Face endpoint will forever be “just call generate() on this model,” you’ll feel fine on any hosting platform—until you need RAG, routing, feature gating, or business logic. Plan for the fact that your code around the model will grow.
- Ignoring latency and scaling upfront: Not all “serverless” is created equal. If you don’t look at cold start behavior, GPU spin‑up, and p95 latency characteristics early, you’ll just rediscover the same problems later when traffic hits or your eval pipeline fans out to thousands of calls.
Real-World Example
Imagine you’re shipping a production summarization API built on a Hugging Face model, but you also need:
- Custom input normalization (strip boilerplate, redact PII)
- Different summarization styles (“legal,” “news,” “casual”) routed by headers
- RAG against your own documentation for “context-aware” summaries
- A batch endpoint for nightly summarization jobs over 100k documents
- Auth via Proxy Tokens for internal services
On Modal you’d:
- Put the HF model in an
@app.clswith@modal.enterto load once per GPU container - Expose a FastAPI app via
@modal.asgi_app()with multiple routes:/summarizefor online inference/batch-summarizethat calls a Modal Function with.map()across many docs
- Use a Volume to cache the model weights and any RAG index
- Add Proxy Auth to the endpoints and tune
concurrency_limit+ GPU type
You’re not fighting the platform; this is exactly the use case Modal is built around: “a backend that happens to be running AI workloads.”
On Replicate, you’d either:
- Squeeze all of that behavior into a single “model” entrypoint API, or
- Split it into multiple model definitions and stitch them together from another backend you host elsewhere
You can absolutely make it work, but your production topology gets more fragmented and less introspectable.
Pro Tip: When you design your Hugging Face “deployment,” write down everything that needs to happen before and after the model call. If that list includes routing, RAG, batching, custom auth, or nontrivial business logic, treat this as a backend service and choose a platform (like Modal) that lets you express all of it in Python—not just the
model.generate()call.
Summary
If your goal is to deploy Hugging Face models as a production API with real custom code—not just a thin wrapper around a model card—the platform choice matters:
- Modal is a Python-first, serverless AI runtime where you define infrastructure as code: Images, Functions, Classes, GPUs, and HTTP endpoints all live in your Python repo. You get sub‑second cold starts, elastic GPU scaling, stateful model servers, batch and queue primitives, and production controls (retries, timeouts, Volumes, Proxy Auth).
- Replicate is strongest when you want “host this model as an API” with minimal fuss and your custom logic remains simple. Once you start building a full inference pipeline or a multi-endpoint application around the model, you end up working around its model-centric abstraction.
For most teams building serious Hugging Face–backed services—RAG APIs, eval pipelines, agents, multi-model backends—Modal is the better long-term fit because it treats AI workloads as backend engineering, not just managed model hosting.