How do I fine-tune a Hugging Face model on Modal and save checkpoints to persistent storage?

Most teams discover the hard way that fine-tuning a Hugging Face model in the cloud is the easy part; keeping that training durable, resumable, and fast under real workloads is where things fall apart. On Modal, you get Python-first, elastic GPU infrastructure plus persistent Volumes, so you can fine-tune, checkpoint, resume, and scale without rewriting your training loop or bolting on extra storage glue.

Quick Answer: Fine-tune a Hugging Face model on Modal by wrapping your existing training loop in a Modal @app.function that runs on GPUs, and mount a Modal Volume to store checkpoints and model weights. On each run, your function checks the Volume for the latest checkpoint, resumes if it exists, and periodically writes new checkpoints so training remains durable and resumable across runs and preemptions.

Why This Matters

When you’re fine-tuning large models, you don’t want to restart from scratch every time a job times out, a GPU gets preempted, or you change code. Durable checkpoints and persistent storage turn fine-tuning from “hope this run doesn’t die” into “resume from last good step,” which is critical when runs last hours, not minutes.

On Modal, you define that durability in Python: a Volume acts as a distributed filesystem for checkpoints, and a GPU-backed @app.function runs your existing Hugging Face training code without modification. You keep your feedback loop tight, your GPU bill sane, and your training jobs resilient.

Key Benefits:

Durable, resumable training: Save checkpoints to a Modal Volume so you can resume from the last completed step instead of restarting entire runs.
Elastic GPU fine-tuning: Run on A10G/A100/H100 (single or multi-GPU) with Modal’s autoscaling, no quota juggling or manual cluster orchestration.
Python-defined infra: Express environment, hardware, scaling, and storage in code—no YAML, no brittle bash scripts—using the same patterns you deploy to production.

Core Concepts & Key Points

Concept	Definition	Why it's important
Modal Volume	A distributed filesystem you mount into Modal Functions and Classes, accessible like a local directory.	Stores training data, checkpoints, and final weights so they persist across runs and can be shared by multiple functions.
Checkpointing	Periodically saving model state (weights, optimizer, scheduler, etc.) to disk during training.	Allows you to resume training after interruptions, scale experiments across runs, and keep long jobs safe from preemptions.
GPU-backed `@app.function`	A Modal function decorated with resource specs (like `gpu="A10G"`) that runs in a container on Modal’s GPU pool.	Turns your Hugging Face training script into a scalable, production-grade job with logging, retries, and clear timeouts.

How It Works (Step-by-Step)

At a high level, you’ll:

Define a Modal Image that installs your Hugging Face dependencies.
Create a Modal Volume to store training data and checkpoints.
Implement a training function that:
- Checks the Volume for the latest checkpoint.
- Resumes from that checkpoint if present.
- Periodically saves new checkpoints to the Volume.
Run the training via modal run or deploy a reusable training job.

Let’s walk through a concrete setup.

1. Set up the Modal app, Image, and Volume

First, define the environment: base image, dependencies, and a Volume for checkpoints.

# train_hf_finetune.py
import os
from pathlib import Path

import modal

app = modal.App("hf-finetune-checkpoint-demo")

# A Volume to store checkpoints and final models.
# Create it once from the CLI if needed:
#   modal volume create hf-checkpoints
CHECKPOINT_VOLUME_NAME = "hf-checkpoints"

volume = modal.Volume.from_name(CHECKPOINT_VOLUME_NAME, create_if_missing=True)

image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch",            # or "torch==2.2.0" pinned
        "transformers",
        "datasets",
        "accelerate",
    )
)

This Image is your “virtual environment snapshot.” Modal will rebuild it when dependencies or base image change and reuse it for fast cold starts.

2. Implement the training function with checkpointing

Here’s the pattern: on start, inspect the Volume; choose a checkpoint if present; train; periodically write checkpoints back to the Volume.

For simplicity, we’ll fine-tune a text classification model using the Trainer API.

from typing import Optional

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)


def get_latest_checkpoint_dir(experiment_dir: Path) -> Optional[Path]:
    """Return the latest checkpoint directory, or None if none exist."""
    if not experiment_dir.exists():
        return None

    checkpoint_dirs = sorted(
        [p for p in experiment_dir.iterdir() if p.is_dir() and p.name.startswith("checkpoint-")],
        key=lambda p: int(p.name.split("-")[-1]),
    )
    return checkpoint_dirs[-1] if checkpoint_dirs else None


def create_datasets(tokenizer, dataset_name="imdb", max_length=256):
    dataset = load_dataset(dataset_name)

    def tokenize(batch):
        return tokenizer(
            batch["text"],
            truncation=True,
            padding="max_length",
            max_length=max_length,
        )

    tokenized = dataset.map(tokenize, batched=True)
    tokenized = tokenized.remove_columns(["text"])
    tokenized.set_format("torch")
    return tokenized["train"], tokenized["test"]

Now the Modal function:

@app.function(
    image=image,
    gpu="A10G",                      # or "A100", "H100", or "A100:2" for multi-GPU
    volumes={"/data": volume},       # mount Volume at /data
    timeout=60 * 60 * 4,             # up to 4 hours; max is 24h
)
def train_hf_model():
    # Inside the container, /data is backed by the Modal Volume
    experiment_dir = Path("/data/experiments/imdb-bert")
    experiment_dir.mkdir(parents=True, exist_ok=True)

    model_name = "bert-base-uncased"
    num_labels = 2

    print("🔍 Looking for existing checkpoints...")
    last_checkpoint_dir = get_latest_checkpoint_dir(experiment_dir)

    if last_checkpoint_dir:
        print(f"⚡️ Resuming training from checkpoint: {last_checkpoint_dir}")
        model_path = str(last_checkpoint_dir)
    else:
        print("⚡️ Starting training from scratch")
        model_path = model_name

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    train_dataset, eval_dataset = create_datasets(tokenizer)

    model = AutoModelForSequenceClassification.from_pretrained(
        model_path,
        num_labels=num_labels,
    )

    output_dir = str(experiment_dir)
    training_args = TrainingArguments(
        output_dir=output_dir,
        evaluation_strategy="epoch",
        save_strategy="steps",
        save_steps=500,                 # frequency of checkpoints
        logging_steps=100,
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        num_train_epochs=3,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        save_total_limit=3,            # keep last 3 checkpoints to control Volume size
        report_to=[],                  # disable WandB etc. for minimal example
        fp16=True,                     # enable if your GPU supports it
    )

    def compute_metrics(eval_pred):
        import numpy as np
        from sklearn.metrics import accuracy_score

        logits, labels = eval_pred
        preds = np.argmax(logits, axis=-1)
        return {"accuracy": accuracy_score(labels, preds)}

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    # Resume training if we found a checkpoint
    trainer.train(resume_from_checkpoint=str(last_checkpoint_dir) if last_checkpoint_dir else None)

    print("✅ Training finished. Saving final model to Volume...")
    final_dir = experiment_dir / "final"
    final_dir.mkdir(parents=True, exist_ok=True)
    trainer.save_model(str(final_dir))
    tokenizer.save_pretrained(str(final_dir))

    print(f"📦 Final model saved to {final_dir}")

This is plain Hugging Face code wrapped in a Modal function. The only Modal-specific pieces are:

@app.function(...) to define infra.
volumes={"/data": volume} to persist checkpoints.
Using /data/... paths (the mounted Volume) instead of a local ./checkpoints directory.

3. Run and iterate with Modal

Try this locally first, then ship it to Modal.

From your project directory:

# One-time: create the Volume
modal volume create hf-checkpoints

# Run training on Modal
modal run train_hf_finetune.py::train_hf_model

Watch logs in your terminal or the Modal dashboard’s apps page. You’ll see checkpoint directories appear under /data/experiments/imdb-bert/checkpoint-XXXX.

If the job is interrupted—for any reason—you can rerun the same command and it will resume from the latest checkpoint automatically.

4. Optional: parameterize experiments

You can expose model name, dataset, or hyperparameters as function arguments and call .remote() from another script or notebook:

@app.function(
    image=image,
    gpu="A10G",
    volumes={"/data": volume},
    timeout=60 * 60 * 4,
)
def train_hf_model_param(
    model_name: str = "bert-base-uncased",
    dataset_name: str = "imdb",
    experiment_name: str = "imdb-bert",
    num_train_epochs: int = 3,
):
    experiment_dir = Path(f"/data/experiments/{experiment_name}")
    experiment_dir.mkdir(parents=True, exist_ok=True)

    last_checkpoint_dir = get_latest_checkpoint_dir(experiment_dir)
    if last_checkpoint_dir:
        print(f"⚡️ Resuming from {last_checkpoint_dir}")
        model_path = str(last_checkpoint_dir)
    else:
        print("⚡️ Training from scratch")
        model_path = model_name

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    train_dataset, eval_dataset = create_datasets(tokenizer, dataset_name=dataset_name)

    model = AutoModelForSequenceClassification.from_pretrained(
        model_path,
        num_labels=2,
    )

    training_args = TrainingArguments(
        output_dir=str(experiment_dir),
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        save_steps=500,
        save_total_limit=3,
        evaluation_strategy="epoch",
        save_strategy="steps",
        report_to=[],
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
    )

    trainer.train(resume_from_checkpoint=str(last_checkpoint_dir) if last_checkpoint_dir else None)
    trainer.save_model(str(experiment_dir / "final"))

Then, from another script:

from train_hf_finetune import train_hf_model_param

# Launch an experiment asynchronously
call = train_hf_model_param.spawn(
    model_name="distilbert-base-uncased",
    experiment_name="imdb-distilbert",
    num_train_epochs=5,
)

# Later:
result = call.get()  # blocks until training finishes

Common Mistakes to Avoid

Saving checkpoints to the container filesystem instead of a Volume:
Containers are ephemeral; anything written outside a Volume disappears when the job ends. Always write checkpoints and final weights into a mounted Volume path (e.g., /data/...).
Not resuming from the latest checkpoint on restart:
It’s easy to save checkpoints but forget to read them back. Implement a small helper (like get_latest_checkpoint_dir) to scan the Volume directory and pass that path to Trainer.train(resume_from_checkpoint=...).
Letting checkpoint directories grow unbounded:
If you save every 100 steps for a long run without save_total_limit, your Volume will fill up and get slow to scan. Use save_total_limit and/or periodically clean old checkpoints from the Volume.
Hard-coding local paths or secrets:
Use Modal Volumes for data and checkpoints, and Modal Secrets for auth tokens (e.g., Hugging Face). Don’t bake keys into your Image or code.

Real-World Example

Imagine you’re fine-tuning a 7B parameter LLM with PEFT on a 4×A100 node. You’re running experiments with slightly different prompts and datasets, and each run takes 6–10 hours. On a normal cloud setup, an interruption at hour 9 means you restart everything, re-download weights, and burn another GPU-day.

On Modal, you:

Mount a Volume at /checkpoints and save a LoRA checkpoint every 1,000 steps.
Use @app.function(gpu="A100:4", timeout=60 * 60 * 24) to run on a multi-GPU cluster.
Implement the exact same “find latest checkpoint and resume” helper as in the example above.

Halfway through, the job gets preempted. The next modal run looks at /checkpoints/your-experiment, finds checkpoint-18000, and resumes from there. You lose maybe a minute of duplicated work instead of an entire day, and your training code—the stuff that actually defines your model—stays identical to your local script aside from the Volume path.

Pro Tip: Treat your Volume as the “source of truth” for experiments: store checkpoints, final models, and a small JSON/YAML config file per run (learning rate, seed, dataset snapshot). That makes it trivial to reproduce a good run later or spin up a new Modal Function that only serves a particular checkpoint.

Summary

Fine-tuning a Hugging Face model on Modal with persistent checkpoints is mostly about two things: mounting a Volume and adopting a simple checkpointing pattern. You keep your normal HF Trainer or custom loop, but instead of writing to ./checkpoints, you write to a Volume path and resume from the latest directory on every run.

Because Modal is Python-first, you express the entire stack—Image, GPU type, Volumes, and training function—in code. That gives you a repeatable, debuggable pipeline where you can start small on a single A10G, scale up to multi-GPU A100/H100, and never lose progress when jobs get interrupted.

Next Step

Get Started