How do I fine-tune on together.ai (SFT vs DPO, LoRA vs full) and estimate token-based training cost before I run it?
Foundation Model Platforms

How do I fine-tune on together.ai (SFT vs DPO, LoRA vs full) and estimate token-based training cost before I run it?

12 min read

Fine-tuning on together.ai comes down to two choices you need to get right early: (1) how you shape behavior (SFT vs DPO) and (2) how much of the model you actually train (LoRA vs full). Once those are clear, estimating token-based training cost before you launch a job is straightforward—you can do it from the UI or CLI so there are no surprises.

Quick Answer: together.ai lets you fine-tune open‑source models using Supervised Fine-Tuning (SFT) or preference‑based methods like DPO, with either lightweight LoRA adapters or full‑model training. You can estimate token-based training cost in advance from the console or CLI by specifying your dataset size and training schedule before any GPUs are allocated.


The Quick Overview

  • What It Is: A fine-tuning and deployment stack on the AI Native Cloud that lets you train SFT or DPO variants of open‑source models using LoRA or full fine‑tuning, then deploy them to Serverless Inference, Dedicated Model Inference, Dedicated Container Inference, or GPU Clusters.
  • Who It Is For: Teams building production AI features—chat, agents, RAG, code, vision, or multimodal—who want better accuracy, fewer hallucinations, and tighter control than off‑the‑shelf base models, without managing training infrastructure.
  • Core Problem Solved: You get production‑grade model shaping (SFT/DPO, LoRA/full) with predictable cost, strong latency/throughput, and full ownership of your weights and data, without touching Kubernetes, CUDA, or training pipelines.

How It Works

Under the hood, together.ai’s fine-tuning platform is a training and serving layer built on the same research that powers its inference stack (Together Kernel Collection, FlashAttention variants, runtime optimizations like ATLAS and CPD). You bring a dataset and a base model; the platform handles distributed training, monitoring, checkpoints, and deployment onto high‑performance inference endpoints.

At a high level:

  1. Choose strategy (SFT vs DPO):
    Decide whether you’re doing supervised learning from labeled input–output pairs (SFT) or preference‑based optimization (e.g., DPO) from ranked responses or “chosen vs rejected” pairs.

  2. Choose capacity (LoRA vs full fine‑tuning):
    For small/medium datasets and faster iteration, attach LoRA adapters. For large datasets or deep behavior shifts, train all model weights on dedicated infrastructure.

  3. Estimate cost and launch:
    From the UI or CLI, you specify base model, tokens, batch size/epochs, and fine‑tuning method. The platform computes an estimated token‑based cost before the job starts so you can adjust configuration to hit your budget.

Once a job completes, you get a versioned fine‑tuned model that you can deploy to:

  • Serverless Inference for variable or bursty real‑time traffic.
  • Dedicated Model Inference for steady, latency‑sensitive workloads.
  • Dedicated Container Inference when you need custom runtimes or extra dependencies.
  • GPU Clusters when you want full cluster control (Kubernetes/Slurm) for custom training loops.

All of this runs with tenant‑level isolation, encryption in transit/at rest, SOC 2 Type II controls, and a clear ownership boundary: your data and models remain fully under your ownership.


SFT vs DPO on together.ai

When to use SFT (Supervised Fine-Tuning)

SFT is the default starting point: you train a base model on (input → target output) pairs.

Best for:

  • Instruction following (“Given X, respond in format Y”)
  • Domain adaptation (e.g., support logs, codebase, financial text)
  • Style/voice alignment (brand tone, safety guidelines)
  • Early production models where you can label “correct” outputs

Why it works well on together.ai:

  • Optimized pipelines for large token throughput
  • Training infrastructure built on research (UPipe, FFT optimizer, FPDT/ALST where applicable) for high throughput per GPU
  • Easy iteration: update datasets, rerun SFT, roll back to previous checkpoints

You should pick SFT when:

  • You can define ground truth outputs for most examples.
  • You want faster time‑to‑production with straightforward data preparation.
  • You’re building the first “v1” of a product model and will refine with preferences later.

When to use DPO (Direct Preference Optimization) or preference training

DPO and related preference‑based methods optimize the model to prefer “chosen” responses over “rejected” ones using pairwise preferences instead of absolute labels.

Best for:

  • “A vs B” quality tuning (e.g., human rater chose response A over B)
  • Safety/harms reduction via curated preference data
  • Resolving ambiguous tasks where many answers can be correct but some are clearly better

Why it works well on together.ai:

  • You can run preference training on top of an SFT‑tuned base, stacking SFT → DPO.
  • Same training infrastructure: you still get high throughput and cost visibility.
  • Easy to iterate: collect more preference pairs, rerun DPO, and keep SFT weights fixed.

You should pick DPO when:

  • You already have a decent SFT model and want to fine‑tune “taste” (helpfulness/safety/style).
  • You’re running human eval pipelines (or synthetic judges) that output ranked responses.
  • You care about patching edge‑case behavior more than broad knowledge coverage.

Pattern that works in practice:

  1. Start with SFT on curated, high‑quality supervised data.
  2. Layer DPO on top using human or model preferences.
  3. Deploy the DPO variant for traffic; keep SFT as a stable fallback.

LoRA vs Full Fine-Tuning on together.ai

LoRA fine-tuning

LoRA fine-tuning on together.ai adds low‑rank adapter weights while freezing the base model.

Best for:

  • Small to medium datasets (e.g., 50k–5M tokens)
  • Cost‑sensitive or early‑stage production workloads
  • Rapid iteration with many model variants per team or per customer

Operational benefits:

  • Fast training & deployment: shorter jobs, faster experiments.
  • Lower cost: fewer trainable parameters → fewer GPU hours.
  • Easy to update or roll back: adapters can be swapped without touching the base weights.

Pick LoRA when:

  • You need many specialized models (per tenant, per vertical) with similar base.
  • You’re still exploring prompts, specs, and data quality.
  • You want to keep infrastructure spend low while learning what works.

Full fine-tuning

Full fine-tuning trains all model parameters on your dataset using dedicated infrastructure.

Best for:

  • Large or complex datasets (millions to tens of millions of tokens and beyond)
  • Deeper behavior changes (e.g., domain‑specific reasoning, code generation across an entire monorepo)
  • Teams that want maximal control and are converging on a long‑term foundation model

Operational benefits:

  • Maximum control and quality: the model internalizes your domain, not just surface patterns.
  • Better for extensive distribution shifts: e.g., non‑English, proprietary notation, specialized scientific/financial text.
  • Dedicated infrastructure: consistent throughput, predictable SLOs.

Pick full fine-tuning when:

  • You’ve validated the direction with LoRA and want to “lock in” gains at base weights.
  • You have strong datasets and clear evaluation metrics.
  • You’re building a long‑lived foundation for many downstream products.

Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
SFT & DPO training optionsLets you choose supervised or preference‑based fine‑tuning pipelinesAligns models to your product’s instructions and your users’ preferences
LoRA & full fine-tuning modesSupports lightweight adapters and full‑weight training on dedicated infraMatch cost vs control to your dataset size and deployment-critical behavior
Upfront cost estimationEstimates token-based training costs from UI/CLI before jobs launchEliminates budget surprises; lets you tune epochs, tokens, and batch size before spending

How to Fine-Tune on together.ai (SFT vs DPO, LoRA vs Full)

1. Pick a base model and deployment target

  • Choose an open‑source base from together.ai (e.g., Llama, Qwen, Code models, multimodal where supported).
  • Decide where this model will run after training:
    • Serverless Inference: variable or unpredictable traffic.
    • Dedicated Model Inference: steady, latency‑sensitive workloads.
    • Dedicated Container Inference / GPU Clusters: if you need custom runtimes or additional libraries.

Your deployment target informs how aggressive you can be with model size vs latency/cost.

2. Choose SFT vs DPO based on your data

  • If your data is input + target output, start with SFT.
  • If you have chosen vs rejected responses or ranked outputs, configure a DPO/preference run, typically on top of a prior SFT model.

For many teams, the first pass is:

  1. SFT on ~100k–5M tokens of carefully curated data.
  2. Evaluate on held‑out test sets and red‑team prompts.
  3. Add DPO once you have preference labels from humans or a judge model.

3. Choose LoRA vs full fine-tuning

Use LoRA when:

  • Dataset is small/medium.
  • You need multiple variants (A/B experiments, per‑customer models).
  • Budget constraints are tight and you want to iterate quickly.

Use full fine-tuning when:

  • Dataset is large or highly specialized.
  • You want one or a few “canonical” models for the organization.
  • You’re optimizing for long‑term quality, not just experimentation speed.

You can also combine them over time: LoRA for exploration → full fine‑tune once you converge.

4. Prepare your dataset and tokenize

You’ll typically:

  • Normalize your instruction/response or chosen/rejected pair format.
  • Apply any safety filters or redactions required by your org.
  • Upload to together.ai storage or point to a cloud bucket.
  • Tokenization is handled by the platform using the base model’s tokenizer; cost is driven by token count, not raw characters.

Estimating Token-Based Training Cost Before Running

together.ai is designed so you can estimate training costs up front—no hidden GPU surprises.

From the UI

In the fine-tuning UI:

  1. Select base model, fine-tuning type (SFT or preference/DPO), and mode (LoRA or full).
  2. Provide dataset size (or upload and let the system compute token counts).
  3. Configure training schedule:
    • Number of epochs or target tokens
    • Batch size/sequence length
    • Optional evaluation cadence/checkpoints

The UI will then:

  • Calculate an estimated number of training tokens (e.g., tokens per epoch × epochs).
  • Map that to an estimated cost based on your chosen configuration and model size.
  • Let you adjust hyperparameters before launching to hit your budget target.

You can iterate: reduce epochs, shrink the dataset, or move from full to LoRA until the estimate fits your cost envelope.

From the CLI

From the CLI, the flow mirrors the UI:

  1. Specify your job config (model, mode, dataset, epochs, etc.).
  2. Use the cost estimation command (documented in the fine‑tuning CLI docs) to:
    • Compute token counts.
    • Get an estimated training cost before submission.

This is ideal for CI/CD pipelines: you can enforce “budget guardrails” that fail a job preview if estimated cost exceeds a threshold.

Why estimates are credible

Costs are estimated off:

  • Token throughput achievable with together.ai’s fine-tuning infrastructure, built on:
    • UPipe and custom kernels from the Together Kernel Collection.
    • Memory optimizations like FFT‑based attention and efficient activations.
  • Model size and configuration (e.g., LoRA rank, full‑model parameter count).

The infrastructure is optimized for scale and production performance, so estimates align closely with actual GPU utilization in practice.


Ideal Use Cases

  • Best for domain-aligned assistants (SFT + LoRA):
    Because you can train on small to medium datasets from your docs, tickets, or codebase with LoRA adapters, you get fast iteration, lower cost, and controlled behavior without retraining the whole model.

  • Best for production “house models” (SFT → DPO + full):
    Because you can start with SFT on large proprietary corpora, then layer DPO with human preferences and train full weights on dedicated infrastructure, you can build deeply aligned, high‑quality models that underpin multiple products.


Limitations & Considerations

  • Data quality is still the main bottleneck:
    together.ai removes infrastructure overhead, but SFT/DPO outcomes are only as good as your labeling and curation. Plan for eval sets and continuous data cleanup.

  • Not every workload needs full fine-tuning:
    For some tasks, prompt engineering plus RAG on Serverless Inference is cheaper and simpler. Use fine-tuning when you need consistent behavior, lower tokens per request, or you’ve hit the limits of prompting/RAG.


Pricing & Plans

Fine-tuning costs on together.ai are primarily a function of:

  • Model size and type (7B vs 70B, text vs multimodal).
  • Fine-tuning mode (LoRA vs full).
  • Total training tokens (dataset size × epochs).

The platform lets you estimate training costs before launching any job from the UI or CLI so you can evaluate resource requirements upfront and eliminate budget surprises.

You can then choose how to host your fine‑tuned model:

  • Serverless Inference: Best for teams with variable or bursty traffic needing no‑ops deployments and “pay for what you use” economics.
  • Dedicated Model Inference / Dedicated Container Inference: Best for teams with steady, high‑throughput workloads that need lower p95 latency, reserved capacity, and strong cost control.

Frequently Asked Questions

How do I decide between SFT and DPO for my first fine-tune on together.ai?

Short Answer: Start with SFT on high‑quality supervised data; add DPO later if you have preference labels and need to tune behavior beyond simple instruction following.

Details:
SFT is more data‑efficient when you can define target outputs. It’s ideal for the first production version of a model: you get deterministic behavior improvements from relatively small datasets. Once you deploy and collect real user interactions or human ratings, you’ll see where the SFT model falls short. That’s when DPO shines: you keep the SFT weights and “reshape” behavior on top using chosen‑vs‑rejected examples, especially for subtle quality and safety preferences. together.ai supports both, and they’re complementary, not mutually exclusive.


How accurate are the token-based training cost estimates, and can I keep jobs within a fixed budget?

Short Answer: Estimates are designed to be close to actual costs, and you can tune epochs, tokens, and configuration until the estimate fits your budget before launching.

Details:
When you configure a fine-tuning job, together.ai calculates expected tokens processed and maps that to an estimated cost. Because the platform’s training stack is built on performance research (e.g., UPipe and related throughput optimizers), token‑to‑cost conversion is stable across runs. In practice, you use this loop:

  1. Upload or point to your dataset.
  2. Let the system compute token counts.
  3. Adjust epochs, batch size, or fine-tuning mode (LoRA vs full) until the estimate is acceptable.
  4. Optionally script this in the CLI to enforce hard budget caps in CI/CD.

This way, you never “discover” cost ex post; you commit to a configuration only once you’re comfortable with the estimate.


Summary

Fine-tuning on together.ai is structured around two decisions that heavily influence cost and quality: SFT vs DPO for how you shape behavior, and LoRA vs full fine‑tuning for how much of the model you train. SFT gives you fast, supervised alignment; DPO refines preferences on top. LoRA gives low‑cost, fast iteration; full fine‑tuning gives maximal control on dedicated infrastructure.

The key difference from rolling your own stack is that you get research‑grade training infrastructure (UPipe, optimized kernels) plus upfront token‑based cost estimation directly in the UI and CLI. You can predict spend before any GPUs spin up, then deploy the resulting model onto serverless or dedicated inference with strong latency, throughput, and ownership guarantees.


Next Step

Get Started