
How do I fine-tune a Hugging Face model on Modal and save checkpoints to persistent storage?
Quick Answer: You fine-tune a Hugging Face model on Modal by defining your training code as a Python
@app.functionwith a GPU spec, mounting a Modal Volume for datasets and checkpoints, and writing checkpoints to that Volume during training. On subsequent runs, your function inspects the Volume for the latest checkpoint and resumes training from there—no framework-specific Modal changes required.
Why This Matters
Fine-tuning Hugging Face models is computation-heavy and usually comes with two big headaches: getting reliable GPU capacity when you need it, and not losing training progress when jobs are interrupted or preempted. Running your training loop on Modal gives you elastic GPUs, sub-second cold starts, and a persistent, distributed filesystem (Volumes) for checkpoints, so you can safely run long jobs, pause/resume experiments, and scale out without rewriting your ML code for some bespoke infrastructure.
Key Benefits:
- Elastic GPU training without infra glue: Request exactly the GPU you need (
A10G,A100:2, etc.) and let Modal handle autoscaling, retries, and cold starts. - Durable checkpoints on a shared Volume: Save and load Hugging Face checkpoints from Modal Volumes so you don’t lose progress if a container dies or you redeploy.
- Same training code locally and in the cloud: Keep your fine-tuning loop unchanged—just wire up the dataset path and checkpoint directory to Modal’s storage primitives.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Modal Volume | A distributed filesystem that behaves like a local directory but is shared across all Modal Functions. | Stores datasets and training checkpoints durably so jobs can resume after preemption or new deployments. |
| Image & GPU spec | A Modal Image defines your environment (Python, PyTorch, Transformers), paired with a GPU spec like modal.gpu.A10G(). | Gives you reproducible training environments and access to elastic GPU capacity without manual provisioning. |
| Checkpointing pattern | Training code that periodically writes model/optimizer state to a persistent path, and on startup looks for the latest checkpoint to resume from. | Prevents wasted GPU hours by avoiding full restarts after failures and lets you run long experiments confidently. |
How It Works (Step-by-Step)
At a high level, you’ll:
- Define a Modal Volume to hold datasets and Hugging Face checkpoints.
- Build a Modal Image with PyTorch,
transformers, and any other dependencies. - Implement a fine-tuning function that:
- Loads data/model configs from Hugging Face.
- Saves checkpoints into the Volume every N steps/epochs.
- On startup, looks for the latest checkpoint and resumes if present.
- Run and deploy your training job via
modal run/modal deploy.
Let’s walk through the pattern in code.
1. Set up the Modal app, Volume, and environment
We’ll assume you’re fine-tuning a text classification model, but this applies to any Hugging Face pipeline (seq2seq, diffusion, etc.).
# train_hf_modal.py
import os
from pathlib import Path
import modal
app = modal.App("hf-finetune-with-checkpoints")
# Volume for datasets + checkpoints
CHECKPOINT_VOLUME_NAME = "hf-training-volume"
hf_volume = modal.Volume.from_name(CHECKPOINT_VOLUME_NAME, create_if_missing=True)
# Base image with PyTorch + Hugging Face
image = (
modal.Image.debian_slim()
.pip_install(
"torch",
"transformers",
"datasets",
"accelerate", # optional but recommended
)
)
This Volume is your persistent storage: it will hold checkpoints like checkpoint-1000/, checkpoint-2000/, etc., and can be accessed from any function in your app.
2. Define a GPU-backed training function
Pick a GPU spec appropriate for your model (e.g., A10G for mid-sized transformers, A100 for larger models).
from modal import gpu
@app.function(
image=image,
gpu=gpu.A10G(), # or gpu.A100(), gpu.T4(), etc.
timeout=24 * 60 * 60, # max 24 hours per run
volumes={"/data": hf_volume},
)
def train_hf_model(
model_name: str = "distilbert-base-uncased",
dataset_name: str = "imdb",
output_dir: str = "/data/experiments/imdb-run-1",
checkpoint_interval_steps: int = 1000,
):
import torch
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)
# All persistent data lives under /data (the Volume mount)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
# 1) Detect last checkpoint in the Volume
last_checkpoint = _get_last_checkpoint(output_path)
# 2) Prepare dataset + model
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = load_dataset(dataset_name)
def tokenize_fn(batch):
return tokenizer(
batch["text"],
padding="max_length",
truncation=True,
max_length=256,
)
tokenized = dataset.map(tokenize_fn, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch")
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2,
)
# 3) Training arguments: write checkpoints into /data
training_args = TrainingArguments(
output_dir=str(output_path),
evaluation_strategy="steps",
eval_steps=checkpoint_interval_steps,
save_steps=checkpoint_interval_steps,
save_total_limit=3, # keep only most recent N checkpoints
logging_steps=100,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
load_best_model_at_end=False,
report_to=[], # disable HF logging integrations
)
# 4) Trainer with the usual Hugging Face API
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
tokenizer=tokenizer,
)
if last_checkpoint:
print(f"⚡️ resuming training from the latest checkpoint: {last_checkpoint}")
trainer.train(resume_from_checkpoint=str(last_checkpoint))
else:
print("⚡️ starting training from scratch")
trainer.train()
trainer.save_model(str(output_path / "final_model"))
print("⚡️ training finished successfully")
The key detail: output_dir points into /data, which is the Modal Volume mount. Hugging Face’s Trainer writes checkpoints there just like a local disk, but now they’re durable and shared across runs.
3. Implement checkpoint discovery in the Volume
The only Modal-specific logic you need is “find the latest checkpoint directory if it exists.”
def _get_last_checkpoint(experiment_dir: Path) -> Path | None:
if not experiment_dir.exists():
return None
checkpoints = [
p
for p in experiment_dir.iterdir()
if p.is_dir() and p.name.startswith("checkpoint-")
]
if not checkpoints:
return None
# Hugging Face checkpoint directories are usually named checkpoint-<step>
checkpoints.sort(
key=lambda p: int(p.name.split("-")[-1]),
)
return checkpoints[-1]
This is the same pattern you’d use locally. Modal’s Volume behaves like a normal filesystem, so the logic is unchanged.
4. Run the fine-tuning job
Run this from your terminal:
modal run train_hf_modal.py::train_hf_model
You can override parameters on the CLI:
modal run train_hf_modal.py::train_hf_model \
--model-name bert-base-uncased \
--dataset-name ag_news \
--output-dir /data/experiments/ag-news-run-1 \
--checkpoint-interval-steps 500
Watch logs in the Modal dashboard under the “Apps” page. You’ll see the Trainer printing Saving model checkpoint to /data/experiments/.../checkpoint-500, etc.
If the job is interrupted (preemption, deploy change, manual stop), re-running the same call with the same output_dir will detect the last checkpoint in the Volume and resume from there.
Common Mistakes to Avoid
-
Saving checkpoints to ephemeral
/tmpor working dirs:
If you write to non-mounted paths (like/tmpor the defaultoutput_diroutside/data), your checkpoints vanish when the container exits. Always pointoutput_dirat a Volume mount like/data/.... -
Changing
output_dirbetween runs:
If every run uses a differentoutput_dir, your_get_last_checkpointlogic will never find anything to resume from. Use a stable path per experiment (/data/experiments/imdb-run-1) and only change it when you intentionally want a fresh run.
Real-World Example
Imagine you’re fine-tuning a Llama-2-style instruction model for your internal support chat using a proprietary dataset. You mount a Modal Volume at /data, sync your dataset once into /data/datasets/support-conversations/, and then kick off a training run on A100:2 GPUs.
Your training function writes checkpoints every 500 steps to /data/experiments/support-llama-v1/. Halfway through, you bump your training script to fix a preprocessing bug and redeploy. With the checkpointing pattern above, your next modal run picks up the latest checkpoint directory from the Volume, resumes at step N+1 on the updated code, and continues training—no manual copy from S3, no fighting with GPU reservations.
Pro Tip: Treat Modal Volumes as your “single source of truth” for experiments—write a small experiment ID into each run’s directory (e.g.,
config.json,metrics.json), and use stableoutput_dirnames so you can rerun or extend training without touching your Hugging Face training code.
Summary
Fine-tuning a Hugging Face model on Modal is just your normal training loop, wrapped in a Modal function that mounts a Volume and requests a GPU. Hugging Face’s Trainer writes checkpoints to a path like /data/experiments/…, and a tiny bit of Python discovers the latest checkpoint and passes it to trainer.train(resume_from_checkpoint=...). Because Modal Volumes are distributed and durable, you can safely run long jobs, retry after preemption, and iterate on your training code without sacrificing progress.