How do I fine-tune a Hugging Face model on Modal and save checkpoints to persistent storage?
Platform as a Service (PaaS)

How do I fine-tune a Hugging Face model on Modal and save checkpoints to persistent storage?

7 min read

Quick Answer: You fine-tune a Hugging Face model on Modal by defining your training code as a Python @app.function with a GPU spec, mounting a Modal Volume for datasets and checkpoints, and writing checkpoints to that Volume during training. On subsequent runs, your function inspects the Volume for the latest checkpoint and resumes training from there—no framework-specific Modal changes required.

Why This Matters

Fine-tuning Hugging Face models is computation-heavy and usually comes with two big headaches: getting reliable GPU capacity when you need it, and not losing training progress when jobs are interrupted or preempted. Running your training loop on Modal gives you elastic GPUs, sub-second cold starts, and a persistent, distributed filesystem (Volumes) for checkpoints, so you can safely run long jobs, pause/resume experiments, and scale out without rewriting your ML code for some bespoke infrastructure.

Key Benefits:

  • Elastic GPU training without infra glue: Request exactly the GPU you need (A10G, A100:2, etc.) and let Modal handle autoscaling, retries, and cold starts.
  • Durable checkpoints on a shared Volume: Save and load Hugging Face checkpoints from Modal Volumes so you don’t lose progress if a container dies or you redeploy.
  • Same training code locally and in the cloud: Keep your fine-tuning loop unchanged—just wire up the dataset path and checkpoint directory to Modal’s storage primitives.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Modal VolumeA distributed filesystem that behaves like a local directory but is shared across all Modal Functions.Stores datasets and training checkpoints durably so jobs can resume after preemption or new deployments.
Image & GPU specA Modal Image defines your environment (Python, PyTorch, Transformers), paired with a GPU spec like modal.gpu.A10G().Gives you reproducible training environments and access to elastic GPU capacity without manual provisioning.
Checkpointing patternTraining code that periodically writes model/optimizer state to a persistent path, and on startup looks for the latest checkpoint to resume from.Prevents wasted GPU hours by avoiding full restarts after failures and lets you run long experiments confidently.

How It Works (Step-by-Step)

At a high level, you’ll:

  1. Define a Modal Volume to hold datasets and Hugging Face checkpoints.
  2. Build a Modal Image with PyTorch, transformers, and any other dependencies.
  3. Implement a fine-tuning function that:
    • Loads data/model configs from Hugging Face.
    • Saves checkpoints into the Volume every N steps/epochs.
    • On startup, looks for the latest checkpoint and resumes if present.
  4. Run and deploy your training job via modal run / modal deploy.

Let’s walk through the pattern in code.

1. Set up the Modal app, Volume, and environment

We’ll assume you’re fine-tuning a text classification model, but this applies to any Hugging Face pipeline (seq2seq, diffusion, etc.).

# train_hf_modal.py
import os
from pathlib import Path

import modal

app = modal.App("hf-finetune-with-checkpoints")

# Volume for datasets + checkpoints
CHECKPOINT_VOLUME_NAME = "hf-training-volume"
hf_volume = modal.Volume.from_name(CHECKPOINT_VOLUME_NAME, create_if_missing=True)

# Base image with PyTorch + Hugging Face
image = (
    modal.Image.debian_slim()
    .pip_install(
        "torch",
        "transformers",
        "datasets",
        "accelerate",   # optional but recommended
    )
)

This Volume is your persistent storage: it will hold checkpoints like checkpoint-1000/, checkpoint-2000/, etc., and can be accessed from any function in your app.

2. Define a GPU-backed training function

Pick a GPU spec appropriate for your model (e.g., A10G for mid-sized transformers, A100 for larger models).

from modal import gpu

@app.function(
    image=image,
    gpu=gpu.A10G(),  # or gpu.A100(), gpu.T4(), etc.
    timeout=24 * 60 * 60,  # max 24 hours per run
    volumes={"/data": hf_volume},
)
def train_hf_model(
    model_name: str = "distilbert-base-uncased",
    dataset_name: str = "imdb",
    output_dir: str = "/data/experiments/imdb-run-1",
    checkpoint_interval_steps: int = 1000,
):
    import torch
    from datasets import load_dataset
    from transformers import (
        AutoTokenizer,
        AutoModelForSequenceClassification,
        TrainingArguments,
        Trainer,
    )

    # All persistent data lives under /data (the Volume mount)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    # 1) Detect last checkpoint in the Volume
    last_checkpoint = _get_last_checkpoint(output_path)

    # 2) Prepare dataset + model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    dataset = load_dataset(dataset_name)

    def tokenize_fn(batch):
        return tokenizer(
            batch["text"],
            padding="max_length",
            truncation=True,
            max_length=256,
        )

    tokenized = dataset.map(tokenize_fn, batched=True)
    tokenized = tokenized.rename_column("label", "labels")
    tokenized.set_format("torch")

    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2,
    )

    # 3) Training arguments: write checkpoints into /data
    training_args = TrainingArguments(
        output_dir=str(output_path),
        evaluation_strategy="steps",
        eval_steps=checkpoint_interval_steps,
        save_steps=checkpoint_interval_steps,
        save_total_limit=3,  # keep only most recent N checkpoints
        logging_steps=100,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        load_best_model_at_end=False,
        report_to=[],  # disable HF logging integrations
    )

    # 4) Trainer with the usual Hugging Face API
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized["train"],
        eval_dataset=tokenized["test"],
        tokenizer=tokenizer,
    )

    if last_checkpoint:
        print(f"⚡️ resuming training from the latest checkpoint: {last_checkpoint}")
        trainer.train(resume_from_checkpoint=str(last_checkpoint))
    else:
        print("⚡️ starting training from scratch")
        trainer.train()

    trainer.save_model(str(output_path / "final_model"))
    print("⚡️ training finished successfully")

The key detail: output_dir points into /data, which is the Modal Volume mount. Hugging Face’s Trainer writes checkpoints there just like a local disk, but now they’re durable and shared across runs.

3. Implement checkpoint discovery in the Volume

The only Modal-specific logic you need is “find the latest checkpoint directory if it exists.”

def _get_last_checkpoint(experiment_dir: Path) -> Path | None:
    if not experiment_dir.exists():
        return None

    checkpoints = [
        p
        for p in experiment_dir.iterdir()
        if p.is_dir() and p.name.startswith("checkpoint-")
    ]
    if not checkpoints:
        return None

    # Hugging Face checkpoint directories are usually named checkpoint-<step>
    checkpoints.sort(
        key=lambda p: int(p.name.split("-")[-1]),
    )
    return checkpoints[-1]

This is the same pattern you’d use locally. Modal’s Volume behaves like a normal filesystem, so the logic is unchanged.

4. Run the fine-tuning job

Run this from your terminal:

modal run train_hf_modal.py::train_hf_model

You can override parameters on the CLI:

modal run train_hf_modal.py::train_hf_model \
  --model-name bert-base-uncased \
  --dataset-name ag_news \
  --output-dir /data/experiments/ag-news-run-1 \
  --checkpoint-interval-steps 500

Watch logs in the Modal dashboard under the “Apps” page. You’ll see the Trainer printing Saving model checkpoint to /data/experiments/.../checkpoint-500, etc.

If the job is interrupted (preemption, deploy change, manual stop), re-running the same call with the same output_dir will detect the last checkpoint in the Volume and resume from there.


Common Mistakes to Avoid

  • Saving checkpoints to ephemeral /tmp or working dirs:
    If you write to non-mounted paths (like /tmp or the default output_dir outside /data), your checkpoints vanish when the container exits. Always point output_dir at a Volume mount like /data/....

  • Changing output_dir between runs:
    If every run uses a different output_dir, your _get_last_checkpoint logic will never find anything to resume from. Use a stable path per experiment (/data/experiments/imdb-run-1) and only change it when you intentionally want a fresh run.


Real-World Example

Imagine you’re fine-tuning a Llama-2-style instruction model for your internal support chat using a proprietary dataset. You mount a Modal Volume at /data, sync your dataset once into /data/datasets/support-conversations/, and then kick off a training run on A100:2 GPUs.

Your training function writes checkpoints every 500 steps to /data/experiments/support-llama-v1/. Halfway through, you bump your training script to fix a preprocessing bug and redeploy. With the checkpointing pattern above, your next modal run picks up the latest checkpoint directory from the Volume, resumes at step N+1 on the updated code, and continues training—no manual copy from S3, no fighting with GPU reservations.

Pro Tip: Treat Modal Volumes as your “single source of truth” for experiments—write a small experiment ID into each run’s directory (e.g., config.json, metrics.json), and use stable output_dir names so you can rerun or extend training without touching your Hugging Face training code.


Summary

Fine-tuning a Hugging Face model on Modal is just your normal training loop, wrapped in a Modal function that mounts a Volume and requests a GPU. Hugging Face’s Trainer writes checkpoints to a path like /data/experiments/…, and a tiny bit of Python discovers the latest checkpoint and passes it to trainer.train(resume_from_checkpoint=...). Because Modal Volumes are distributed and durable, you can safely run long jobs, retry after preemption, and iterate on your training code without sacrificing progress.


Next Step

Get Started