How do I fine-tune on together.ai (SFT vs DPO, LoRA vs full) and estimate token-based training cost before I run it?
Foundation Model Platforms

How do I fine-tune on together.ai (SFT vs DPO, LoRA vs full) and estimate token-based training cost before I run it?

11 min read

Most teams hit the same wall once they get past prototypes: the base model is “almost right,” but you need tighter control over behavior, lower hallucinations, and predictable unit economics. On together.ai, that’s exactly what fine-tuning and model shaping are for—and you can estimate token-based costs before you ever launch a job.

Quick Answer: Together’s fine-tuning stack lets you run supervised fine-tuning (SFT) and preference-based methods like DPO, using either LoRA adapters or full-model training. You can estimate training cost from the UI or CLI before a run by plugging in dataset tokens, epochs, and training config, so there are no surprises.


The Quick Overview

  • What It Is: A production-grade fine-tuning platform for open-source and partner models on the AI Native Cloud, with options for SFT vs DPO and LoRA vs full-parameter training.
  • Who It Is For: AI product teams that need models shaped to their domain, with better accuracy and behavior control, without running their own GPU training infrastructure.
  • Core Problem Solved: Eliminates the complexity of training pipelines, hardware management, and cost-guessing, so you can focus on data and evaluation instead of orchestration.

How Together’s Fine-Tuning Pipeline Works

At a high level, every fine-tuning run on together.ai follows the same pattern: you pick a base model, choose a training method (SFT or DPO), decide between LoRA and full fine-tuning, upload or reference your dataset, and then estimate and launch the job. The underlying stack (UPipe, FFT Optimizer, and throughput-aware schedulers) takes care of memory efficiency, throughput, and scaling.

  1. Define Objective (SFT vs DPO):

    • For “make the model follow my instructions on this data,” use SFT.
    • For “make the model prefer these outputs over those,” use DPO or other preference optimization.
  2. Choose Capacity (LoRA vs Full):

    • LoRA: cheaper, faster iteration, and easy rollback—ideal for small to medium datasets or frequent updates.
    • Full fine-tuning: more control for large/complex datasets or deeply customized assistants.
  3. Estimate & Launch:

    • Provide dataset size (in tokens), training steps/epochs, and configuration.
    • Use the UI or CLI to get a cost estimate before running the job.
    • Once training completes, you get a deployable model that can be hosted on serverless, Dedicated Model Inference, or Dedicated Container Inference.

Step 1: SFT vs DPO on together.ai

When to use Supervised Fine-Tuning (SFT)

Use SFT when you have input → output pairs and you want the model to imitate your ground-truth responses.

Best for:

  • Domain-specific chat agents (e.g., support, legal, medical triage with carefully reviewed answers).
  • Code assistants tuned to your repositories and style guides.
  • Structured generation (contracts, data schemas, templates) where “gold” completions exist.

How SFT fits the pipeline:

  • Data format: typically {prompt, response} or {input, output} pairs in JSONL or similar.
  • Objective: maximize likelihood of your reference outputs.
  • Behavior: strongest effect on style, formatting, and domain knowledge recall.

If you’re new to fine-tuning on together.ai, SFT is the first lever to pull.

When to use DPO (Direct Preference Optimization) or Similar Preference Methods

Use DPO when you care more about relative quality than exact ground truth, and you can label “good vs bad” responses or train from pairwise comparisons.

Best for:

  • Aligning tone and safety (“prefer helpful, harmless, honest” outputs).
  • RLHF-like refinement after an SFT base run.
  • Search / rerank or summarization quality where ranking is easier than labeling a single gold answer.

How DPO fits the pipeline:

  • Data format: (prompt, chosen_response, rejected_response) triplets or a close equivalent.
  • Objective: increase preference score for chosen over rejected outputs.
  • Behavior: more impact on subtle preference and safety characteristics than on core knowledge.

Practical playbook:

  • Start with SFT on curated instruction data.
  • Layer DPO on top for nuance: style, harmlessness, and preference ordering.
  • Use together.ai’s inference endpoints to A/B real traffic before committing.

Step 2: LoRA vs Full Fine-Tuning

Together’s platform supports different training modes so you can match cost and control to your data.

LoRA Fine-Tuning

Lightweight rank adapters added to the base model; only a small fraction of parameters are trained.

Best for:

  • Small to medium datasets.
  • Cost-sensitive or early-stage production workloads.
  • Fast iteration cycles where you expect to update the model frequently.

Characteristics:

  • Fast training & deployment: Shorter wall-clock time, faster time-to-first-eval.
  • Lower cost: Far fewer trainable parameters → lower GPU hours.
  • Easy rollback: Remove or swap adapters to revert behavior.

Combine LoRA with SFT for most early-stage deployments and A/B experiments. You can stack multiple LoRA heads for different domains if needed.

Full Fine-Tuning

Updates all model weights; requires more compute but gives maximum control.

Best for:

  • Large or complex datasets.
  • Deep behavior changes where you’re effectively creating a new variant (e.g., heavy domain specialization).
  • High-traffic production workloads where you want a single unified model (no adapter stitching).

Characteristics:

  • Maximum control and quality (given good data).
  • Dedicated infrastructure: Usually paired with Dedicated Model Inference for predictable latency and throughput.
  • Higher training cost but better “baked-in” behavior for long-term, stable workloads.

Migration pattern I recommend:

  1. Start with LoRA + SFT to prove value and refine data.
  2. Optionally add LoRA + DPO for preference alignment.
  3. Once the dataset and behavior stabilize and traffic is high, move to full fine-tuning for better performance and simpler runtime.

Step 3: Estimating Token-Based Training Cost Before You Run

Together’s fine-tuning UI and CLI are designed to avoid budget surprises. You can get a cost estimate before starting any job by providing token and configuration details.

What drives training cost?

At a high level:

  • Total training tokens = (Number of tokens in your dataset) × (Number of epochs).
  • Model size (e.g., 7B vs 70B) and context length.
  • Training mode: LoRA vs full fine-tuning.
  • Throughput optimizations: UPipe, FFT Optimizer, and other stack elements maximize throughput (tokens/sec), which reduces GPU time and cost.

Internally, the training stack uses research-backed systems like:

  • UPipe: Data and compute pipeline that reduces memory usage by up to 82.5% vs other SOTA approaches, enabling larger batches or longer sequences on the same hardware.
  • FFT Optimizer and other schedulers (FPDT, ALST): Improve throughput (TPS) and GPU utilization at scale.

You don’t need to tune these directly—they’re part of the platform—but they directly affect your cost per 1M tokens.

How to estimate cost in the UI

  1. Select a base model

    • From the Together console, pick an open-source or partner model (e.g., Llama-family, Qwen, etc.) you want to fine-tune.
  2. Choose fine-tuning type

    • Select LoRA or Full fine-tuning.
    • Pick SFT or Preference-based (DPO) objective type if surfaced in the workflow.
  3. Attach your dataset

    • Upload a file or point to storage; the platform counts tokens or lets you specify them.
    • Confirm dataset size (tokens) in the summary pane.
  4. Set training config

    • Epochs or total training steps.
    • Optional: batch size, learning rate, max sequence length.
  5. View cost estimate

    • The UI displays a projected cost range before you click “Start.”
    • You can iterate on epochs, batch size, or even LoRA vs full to see how the estimate changes.
  6. Confirm and launch only once the budget aligns with your constraints.

How to estimate cost via CLI

The same logic is accessible programmatically:

  1. Prepare a config JSON/YAML with:

    • model: base model name.
    • training_type: lora or full.
    • objective: sft or dpo.
    • dataset_tokens: token count (or dataset path so the system can compute this).
    • epochs: integer.
    • Other hyperparameters as needed.
  2. Call a cost-estimate command or endpoint (pattern):

    together fine-tune estimate \
      --config config.json
    

    or via API:

    import together
    
    together.api_key = "YOUR_API_KEY"
    
    estimate = together.FineTune.estimate(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        training_type="lora",
        objective="sft",
        dataset_tokens=50_000_000,
        epochs=3,
    )
    print(estimate)
    
  3. Read the returned estimate

    • Expect fields like estimated_tokens, estimated_gpu_hours, and estimated_cost.
    • Adjust dataset size or epochs to hit your target budget, then launch the real job.

The key practice: treat the estimate call as a mandatory pre-flight check. Don’t launch large full-finetunes without running this once.


Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
LoRA & Full Fine-TuningLets you choose between adapter-based and full-parameter trainingMatch cost and control to dataset and workload
SFT & Preference Training (DPO)Supports instruction-following and preference-based objectivesBetter alignment: accuracy + behavior shaping
Pre-Run Cost EstimationEstimates training cost from tokens, epochs, and config via UI/CLINo surprises; predictable unit economics
Research-Optimized Stack (UPipe, FFT, etc.)Maximizes memory efficiency and throughput during trainingLarger jobs, up to 82.5% less memory vs SOTA
Seamless Deployment on AI Native CloudDeploys fine-tuned models to Serverless, Batch, Dedicated Model or ContainerFaster path from experiment to production

Ideal Use Cases

  • Best for domain-specific copilots: Because SFT with LoRA on your instruction data gives you a fast, low-cost path to a domain-tuned assistant you can deploy to Serverless Inference and iterate on weekly.
  • Best for high-traffic, latency-sensitive AI products: Because full fine-tuning plus Dedicated Model Inference lets you bake behavior into the base weights, then serve with predictable latency and best price-performance.
  • Best for safety & preference alignment: Because DPO-style training on preference datasets lets you systematically favor safe, high-quality responses without rebuilding your SFT pipeline.

Limitations & Considerations

  • Data quality is the real bottleneck: together.ai handles infrastructure, but low-quality or noisy datasets will still yield poor models. Plan for multiple curation cycles with small LoRA runs first.
  • Not every task needs full fine-tuning: For some workloads (e.g., retrieval-augmented QA), prompt engineering + RAG + serverless inference may be cheaper and more flexible than training. Use fine-tuning where it clearly moves your latency, quality, or cost.

From a deployment standpoint:

  • SFT + LoRA fits best when you expect frequent updates and moderate traffic.
  • Full fine-tuning makes most sense when you have stable requirements and high traffic that justify the training investment.

Pricing & Plans Mindset

Fine-tuning jobs are billed based on compute and tokens processed rather than a separate “plan,” and you can mix them with any inference mode:

  • Serverless Inference: Best for variable or unpredictable traffic and early-stage products consuming your fine-tuned model with no infrastructure to manage.
  • Batch Inference: Best for large offline jobs (classifying or summarizing big corpora, generating synthetic data) consuming your fine-tuned model at up to 50% less cost, scaling to 30 billion tokens.
  • Dedicated Model Inference: Best for steady, latency-sensitive workloads where you want an isolated endpoint backed by reserved compute and the Together inference engine.
  • Dedicated Container Inference / GPU Clusters: Best for teams that need maximum control over the runtime environment or want to mix training + inference on their own cluster while still leveraging Together’s kernels and runtime.

For detailed per-token training and inference rates, you’d align with the current together.ai pricing page or contact sales, then use the estimate tools (UI/CLI) to plug in your training configs.


Frequently Asked Questions

How do I choose between SFT and DPO for my first fine-tune?

Short Answer: Start with SFT on clean instruction data; add DPO later if you need nuanced preference or safety alignment.

Details:
SFT is more forgiving and works directly with input → output pairs, making it ideal for the first iteration of a domain-specific assistant or generator. Only once you’ve stabilized that behavior and collected “preferred vs rejected” outputs from humans or logs does DPO become the right tool, fine-tuning the model to rank good completions higher than bad ones. On together.ai, you can chain these: SFT to get a good baseline, then DPO to refine.


When should I use LoRA instead of full fine-tuning?

Short Answer: Use LoRA for small–medium datasets, frequent updates, and budget-sensitive experiments; use full fine-tuning when you have large datasets, stable requirements, and high-traffic workloads.

Details:
LoRA dramatically reduces trainable parameters, which cuts memory and GPU hours—perfect for iteration and early-stage products. You can test multiple variants quickly and roll back by disabling adapters. Once you know your data and behavior are correct and your traffic warrants the investment, full fine-tuning unlocks deeper behavior changes and a simpler serving path (no adapter routing) with best-possible performance. Together’s cost-estimation tools make it easy to sanity check the budget of both modes before running them.


Summary

Fine-tuning on together.ai is built for production: you choose SFT or DPO based on your objective, pick LoRA or full fine-tuning based on dataset size and economics, and then estimate your token-based training cost before you commit. Under the hood, research-grade systems like UPipe, FFT Optimizer, and throughput-aware schedulers give you up to 82.5% memory savings versus other SOTA approaches, and the resulting models can be deployed across Serverless, Batch, or Dedicated Inference with the same OpenAI-compatible API.

If latency is a product feature and cost per 1M tokens is your moat, this workflow is designed to help you make those tradeoffs explicitly instead of guessing.

Next Step

Get Started