
How do I fine-tune a Hugging Face model on Modal and save checkpoints to persistent storage?
Most teams discover the hard way that fine-tuning a Hugging Face model in the cloud is the easy part; keeping that training durable, resumable, and fast under real workloads is where things fall apart. On Modal, you get Python-first, elastic GPU infrastructure plus persistent Volumes, so you can fine-tune, checkpoint, resume, and scale without rewriting your training loop or bolting on extra storage glue.
Quick Answer: Fine-tune a Hugging Face model on Modal by wrapping your existing training loop in a Modal
@app.functionthat runs on GPUs, and mount a ModalVolumeto store checkpoints and model weights. On each run, your function checks the Volume for the latest checkpoint, resumes if it exists, and periodically writes new checkpoints so training remains durable and resumable across runs and preemptions.
Why This Matters
When you’re fine-tuning large models, you don’t want to restart from scratch every time a job times out, a GPU gets preempted, or you change code. Durable checkpoints and persistent storage turn fine-tuning from “hope this run doesn’t die” into “resume from last good step,” which is critical when runs last hours, not minutes.
On Modal, you define that durability in Python: a Volume acts as a distributed filesystem for checkpoints, and a GPU-backed @app.function runs your existing Hugging Face training code without modification. You keep your feedback loop tight, your GPU bill sane, and your training jobs resilient.
Key Benefits:
- Durable, resumable training: Save checkpoints to a Modal Volume so you can resume from the last completed step instead of restarting entire runs.
- Elastic GPU fine-tuning: Run on A10G/A100/H100 (single or multi-GPU) with Modal’s autoscaling, no quota juggling or manual cluster orchestration.
- Python-defined infra: Express environment, hardware, scaling, and storage in code—no YAML, no brittle bash scripts—using the same patterns you deploy to production.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Modal Volume | A distributed filesystem you mount into Modal Functions and Classes, accessible like a local directory. | Stores training data, checkpoints, and final weights so they persist across runs and can be shared by multiple functions. |
| Checkpointing | Periodically saving model state (weights, optimizer, scheduler, etc.) to disk during training. | Allows you to resume training after interruptions, scale experiments across runs, and keep long jobs safe from preemptions. |
GPU-backed @app.function | A Modal function decorated with resource specs (like gpu="A10G") that runs in a container on Modal’s GPU pool. | Turns your Hugging Face training script into a scalable, production-grade job with logging, retries, and clear timeouts. |
How It Works (Step-by-Step)
At a high level, you’ll:
- Define a Modal Image that installs your Hugging Face dependencies.
- Create a Modal Volume to store training data and checkpoints.
- Implement a training function that:
- Checks the Volume for the latest checkpoint.
- Resumes from that checkpoint if present.
- Periodically saves new checkpoints to the Volume.
- Run the training via
modal runor deploy a reusable training job.
Let’s walk through a concrete setup.
1. Set up the Modal app, Image, and Volume
First, define the environment: base image, dependencies, and a Volume for checkpoints.
# train_hf_finetune.py
import os
from pathlib import Path
import modal
app = modal.App("hf-finetune-checkpoint-demo")
# A Volume to store checkpoints and final models.
# Create it once from the CLI if needed:
# modal volume create hf-checkpoints
CHECKPOINT_VOLUME_NAME = "hf-checkpoints"
volume = modal.Volume.from_name(CHECKPOINT_VOLUME_NAME, create_if_missing=True)
image = (
modal.Image.debian_slim()
.pip_install(
"torch", # or "torch==2.2.0" pinned
"transformers",
"datasets",
"accelerate",
)
)
This Image is your “virtual environment snapshot.” Modal will rebuild it when dependencies or base image change and reuse it for fast cold starts.
2. Implement the training function with checkpointing
Here’s the pattern: on start, inspect the Volume; choose a checkpoint if present; train; periodically write checkpoints back to the Volume.
For simplicity, we’ll fine-tune a text classification model using the Trainer API.
from typing import Optional
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)
def get_latest_checkpoint_dir(experiment_dir: Path) -> Optional[Path]:
"""Return the latest checkpoint directory, or None if none exist."""
if not experiment_dir.exists():
return None
checkpoint_dirs = sorted(
[p for p in experiment_dir.iterdir() if p.is_dir() and p.name.startswith("checkpoint-")],
key=lambda p: int(p.name.split("-")[-1]),
)
return checkpoint_dirs[-1] if checkpoint_dirs else None
def create_datasets(tokenizer, dataset_name="imdb", max_length=256):
dataset = load_dataset(dataset_name)
def tokenize(batch):
return tokenizer(
batch["text"],
truncation=True,
padding="max_length",
max_length=max_length,
)
tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.remove_columns(["text"])
tokenized.set_format("torch")
return tokenized["train"], tokenized["test"]
Now the Modal function:
@app.function(
image=image,
gpu="A10G", # or "A100", "H100", or "A100:2" for multi-GPU
volumes={"/data": volume}, # mount Volume at /data
timeout=60 * 60 * 4, # up to 4 hours; max is 24h
)
def train_hf_model():
# Inside the container, /data is backed by the Modal Volume
experiment_dir = Path("/data/experiments/imdb-bert")
experiment_dir.mkdir(parents=True, exist_ok=True)
model_name = "bert-base-uncased"
num_labels = 2
print("🔍 Looking for existing checkpoints...")
last_checkpoint_dir = get_latest_checkpoint_dir(experiment_dir)
if last_checkpoint_dir:
print(f"⚡️ Resuming training from checkpoint: {last_checkpoint_dir}")
model_path = str(last_checkpoint_dir)
else:
print("⚡️ Starting training from scratch")
model_path = model_name
tokenizer = AutoTokenizer.from_pretrained(model_name)
train_dataset, eval_dataset = create_datasets(tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(
model_path,
num_labels=num_labels,
)
output_dir = str(experiment_dir)
training_args = TrainingArguments(
output_dir=output_dir,
evaluation_strategy="epoch",
save_strategy="steps",
save_steps=500, # frequency of checkpoints
logging_steps=100,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=3,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
save_total_limit=3, # keep last 3 checkpoints to control Volume size
report_to=[], # disable WandB etc. for minimal example
fp16=True, # enable if your GPU supports it
)
def compute_metrics(eval_pred):
import numpy as np
from sklearn.metrics import accuracy_score
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {"accuracy": accuracy_score(labels, preds)}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
# Resume training if we found a checkpoint
trainer.train(resume_from_checkpoint=str(last_checkpoint_dir) if last_checkpoint_dir else None)
print("✅ Training finished. Saving final model to Volume...")
final_dir = experiment_dir / "final"
final_dir.mkdir(parents=True, exist_ok=True)
trainer.save_model(str(final_dir))
tokenizer.save_pretrained(str(final_dir))
print(f"📦 Final model saved to {final_dir}")
This is plain Hugging Face code wrapped in a Modal function. The only Modal-specific pieces are:
@app.function(...)to define infra.volumes={"/data": volume}to persist checkpoints.- Using
/data/...paths (the mounted Volume) instead of a local./checkpointsdirectory.
3. Run and iterate with Modal
Try this locally first, then ship it to Modal.
From your project directory:
# One-time: create the Volume
modal volume create hf-checkpoints
# Run training on Modal
modal run train_hf_finetune.py::train_hf_model
Watch logs in your terminal or the Modal dashboard’s apps page. You’ll see checkpoint directories appear under /data/experiments/imdb-bert/checkpoint-XXXX.
If the job is interrupted—for any reason—you can rerun the same command and it will resume from the latest checkpoint automatically.
4. Optional: parameterize experiments
You can expose model name, dataset, or hyperparameters as function arguments and call .remote() from another script or notebook:
@app.function(
image=image,
gpu="A10G",
volumes={"/data": volume},
timeout=60 * 60 * 4,
)
def train_hf_model_param(
model_name: str = "bert-base-uncased",
dataset_name: str = "imdb",
experiment_name: str = "imdb-bert",
num_train_epochs: int = 3,
):
experiment_dir = Path(f"/data/experiments/{experiment_name}")
experiment_dir.mkdir(parents=True, exist_ok=True)
last_checkpoint_dir = get_latest_checkpoint_dir(experiment_dir)
if last_checkpoint_dir:
print(f"⚡️ Resuming from {last_checkpoint_dir}")
model_path = str(last_checkpoint_dir)
else:
print("⚡️ Training from scratch")
model_path = model_name
tokenizer = AutoTokenizer.from_pretrained(model_name)
train_dataset, eval_dataset = create_datasets(tokenizer, dataset_name=dataset_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_path,
num_labels=2,
)
training_args = TrainingArguments(
output_dir=str(experiment_dir),
num_train_epochs=num_train_epochs,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
save_steps=500,
save_total_limit=3,
evaluation_strategy="epoch",
save_strategy="steps",
report_to=[],
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train(resume_from_checkpoint=str(last_checkpoint_dir) if last_checkpoint_dir else None)
trainer.save_model(str(experiment_dir / "final"))
Then, from another script:
from train_hf_finetune import train_hf_model_param
# Launch an experiment asynchronously
call = train_hf_model_param.spawn(
model_name="distilbert-base-uncased",
experiment_name="imdb-distilbert",
num_train_epochs=5,
)
# Later:
result = call.get() # blocks until training finishes
Common Mistakes to Avoid
-
Saving checkpoints to the container filesystem instead of a Volume:
Containers are ephemeral; anything written outside a Volume disappears when the job ends. Always write checkpoints and final weights into a mounted Volume path (e.g.,/data/...). -
Not resuming from the latest checkpoint on restart:
It’s easy to save checkpoints but forget to read them back. Implement a small helper (likeget_latest_checkpoint_dir) to scan the Volume directory and pass that path toTrainer.train(resume_from_checkpoint=...). -
Letting checkpoint directories grow unbounded:
If you save every 100 steps for a long run withoutsave_total_limit, your Volume will fill up and get slow to scan. Usesave_total_limitand/or periodically clean old checkpoints from the Volume. -
Hard-coding local paths or secrets:
Use Modal Volumes for data and checkpoints, and Modal Secrets for auth tokens (e.g., Hugging Face). Don’t bake keys into your Image or code.
Real-World Example
Imagine you’re fine-tuning a 7B parameter LLM with PEFT on a 4×A100 node. You’re running experiments with slightly different prompts and datasets, and each run takes 6–10 hours. On a normal cloud setup, an interruption at hour 9 means you restart everything, re-download weights, and burn another GPU-day.
On Modal, you:
- Mount a Volume at
/checkpointsand save a LoRA checkpoint every 1,000 steps. - Use
@app.function(gpu="A100:4", timeout=60 * 60 * 24)to run on a multi-GPU cluster. - Implement the exact same “find latest checkpoint and resume” helper as in the example above.
Halfway through, the job gets preempted. The next modal run looks at /checkpoints/your-experiment, finds checkpoint-18000, and resumes from there. You lose maybe a minute of duplicated work instead of an entire day, and your training code—the stuff that actually defines your model—stays identical to your local script aside from the Volume path.
Pro Tip: Treat your Volume as the “source of truth” for experiments: store checkpoints, final models, and a small JSON/YAML config file per run (learning rate, seed, dataset snapshot). That makes it trivial to reproduce a good run later or spin up a new Modal Function that only serves a particular checkpoint.
Summary
Fine-tuning a Hugging Face model on Modal with persistent checkpoints is mostly about two things: mounting a Volume and adopting a simple checkpointing pattern. You keep your normal HF Trainer or custom loop, but instead of writing to ./checkpoints, you write to a Volume path and resume from the latest directory on every run.
Because Modal is Python-first, you express the entire stack—Image, GPU type, Volumes, and training function—in code. That gives you a repeatable, debuggable pipeline where you can start small on a single A10G, scale up to multi-GPU A100/H100, and never lose progress when jobs get interrupted.