How do I create a together.ai Instant GPU Cluster, pick reserved vs on-demand billing, and set guardrails to avoid surprise charges?
Foundation Model Platforms

How do I create a together.ai Instant GPU Cluster, pick reserved vs on-demand billing, and set guardrails to avoid surprise charges?

11 min read

Most teams don’t lose money on GPUs because the hardware is slow; they lose it because clusters sit idle, billing modes don’t match workloads, and there are no guardrails when experiments spike. together.ai’s Instant GPU Clusters are designed to fix that: bare‑metal performance you can spin up in minutes, with clear choices between on‑demand and reserved capacity and predictable billing behavior.

The Quick Overview

  • What It Is: Instant, self‑serve GPU Clusters on the together.ai AI Native Cloud, with InfiniBand networking, managed orchestration, and flexible on‑demand or reserved pricing.
  • Who It Is For: AI research and product teams training, fine‑tuning, or running large‑scale inference jobs that need performance and scale (8–4,000+ GPUs) without running their own GPU infrastructure.
  • Core Problem Solved: Getting reliable, cost‑efficient GPU capacity without DevOps overhead or surprise cloud bills — especially for long‑running or large‑batch workloads.

How It Works

Instant GPU Clusters give you bare‑metal NVIDIA HGX hardware (H100, B200, GB200 NVL72) wired with InfiniBand, plus managed orchestration. You pick your hardware, GPU count, and pricing model (on‑demand vs reserved), then connect via your preferred scheduler (Kubernetes, Slurm, or custom). together.ai handles provisioning, recovery, and scale so you can focus on training loops, inference pipelines, and model shaping.

  1. Cluster Creation: From the Together console, you choose the GPU type (e.g., H100 SXM), cluster size (8–4,000+ GPUs), region, and orchestration options, then create the cluster in minutes.
  2. Billing Mode Selection: For each cluster, you decide between on‑demand (maximum flexibility, pay by the hour) and reserved (capacity locked in for a term, lower hourly rate, better economics at scale).
  3. Guardrails & Controls: You set limits around cluster size, runtime, and usage patterns, and you monitor cost via billing visibility and alerts so clusters don’t run longer or larger than intended.

Step‑by‑Step: Creating an Instant GPU Cluster

Below is the typical flow an engineering team would follow. Interfaces evolve, but the control points and trade‑offs remain the same.

1. Sign in and navigate to GPU Clusters

  • Log into your together.ai account at https://www.together.ai.
  • In the console sidebar, select GPU Clusters.
  • Click Create cluster for on‑demand capacity, or Reserve capacity if you already know you want a reserved block.

2. Choose your hardware

Together GPU Clusters are built on AI‑optimized NVIDIA systems:

  • NVIDIA HGX H100 SXM (80GB)

    • On‑Demand: $2.99/hr per GPU
    • Reserved: Starting at $1.75/hr per GPU
    • Scale: 8 to 256 GPUs (self‑serve); up to 4,000+ with reservations
  • NVIDIA HGX B200

    • On‑Demand: $5.50/hr per GPU
    • Reserved: Starting at $4.00/hr per GPU
    • Scale: 256 to 1,000+ GPUs
  • NVIDIA GB200 NVL72

    • On‑Demand: Contact together.ai for pricing
    • Reserved: Contact together.ai for pricing
    • Scale: 512 to 1,000+ GPUs

Choose hardware based on:

  • Model size and memory demands (e.g., 70B+ parameter models, long‑context training).
  • Interconnect needs (multi‑node training often benefits from InfiniBand and larger node counts).
  • Budget and utilization patterns (H100 vs B200 vs GB200 trade‑offs).

3. Select cluster size and topology

  • Set the number of GPUs (e.g., 8, 16, 64, 256, or more).
  • Confirm the node layout (e.g., 8 GPUs per node) if exposed in the UI.
  • Ensure the cluster size maps to your training/inference configuration (data‑parallel, tensor‑parallel, pipeline‑parallel).

Keep in mind:

  • On‑demand clusters typically scale up to 256 GPUs based on real‑time availability.
  • Reserved clusters can scale from 256 to 4,000+ GPUs with capacity locked for your term.

4. Pick your pricing model: on‑demand vs reserved

At cluster creation, you choose between:

  • On‑Demand

    • Commitment: None — pay hourly, terminate anytime.
    • Best for: Starting with flexibility, bursty or unpredictable workloads, short experiments.
    • Capacity: Based on real‑time availability.
    • Scale: Up to 256 GPUs via self‑serve.
  • Reserved

    • Commitment: Up to 6 months, paid upfront.
    • Best for: Guaranteed access with better economics for known, sustained workloads.
    • Capacity: Locked in for the duration.
    • Scale: Up to 4,000+ GPUs.

You’ll see an estimated hourly cost based on GPU type × GPU count × pricing mode before you create the cluster, which is your first guardrail against unexpected spend.

5. Configure orchestration and access

Depending on options available in the console:

  • Choose your orchestration style (e.g., Kubernetes or Slurm).
  • Retrieve connection details and credentials.
  • Set up your CI/CD or job submission pipeline to target this cluster (training scripts, fine‑tuning, batch inference).

The cluster exposes bare‑metal performance with InfiniBand and managed orchestration, so you get high throughput and low latency without managing underlying infrastructure.

6. Launch, monitor, and terminate

  • Start your training, fine‑tuning, or inference workloads.
  • Monitor:
    • GPU utilization (to avoid under‑utilizing an expensive cluster),
    • Job completion times,
    • Cost trends against your budget.
  • Terminate on‑demand clusters as soon as workloads complete to stop billing.
  • For reserved capacity, you can still shut down clusters, but your underlying reservation remains active for the term you selected.

On‑Demand vs Reserved: How to Choose for Your Workload

When on‑demand is the right choice

On‑demand clusters are typically best for:

  • Exploratory research & sandboxing

    • You’re testing hypotheses or iterating on training code.
    • You don’t know how many GPUs you’ll need next week.
    • You want the option to spin up a cluster for an afternoon, then shut it down.
  • Bursty or irregular workloads

    • Example: A new model launch requires a week of intense fine‑tuning, then quiets down.
    • You want capacity when you need it, no commitments when you don’t.
  • Teams just starting with together.ai

    • You want to validate performance and integration.
    • You’re not ready to commit to a 6‑month reservation.

Trade‑offs:

  • Higher effective hourly price vs reserved.
  • Capacity depends on current availability.
  • Perfect for testing, less ideal as a long‑term home for steady workloads.

When reserved capacity is the right choice

Reserved clusters are designed for:

  • Production training or recurring workloads

    • You train/fine‑tune large models every week or month.
    • You run large batch inference jobs on a schedule.
    • You can reasonably predict GPU needs for the next 3–6 months.
  • Sustained high‑scale experiments

    • 256–4,000+ GPUs for multi‑week runs.
    • You want to lock in capacity and price.
  • Budget‑sensitive teams with stable demand

    • You’re optimizing cost per 1M tokens or per training run.
    • You prefer upfront commitment for lower hourly rates.

Trade‑offs:

  • Upfront commitment (up to 6 months).
  • Less flexibility if your workload drops sharply.
  • Best economics when utilization is high.

Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
Self‑Serve GPU ClustersSpin up AI‑ready GPU clusters in minutes with InfiniBand and managed controlGo from zero to training or inference quickly
Flexible On‑Demand & ReservedOffers hourly on‑demand pricing and discounted reserved capacityMatch cost structure to workload patterns
Scale from 8 to 4,000+ GPUsSupports both small experiments and very large distributed jobsRun everything from POC to production mega‑runs

Ideal Use Cases

  • Best for training and fine‑tuning large models: Because you can spin up H100/B200/GB200 clusters with InfiniBand, scale to thousands of GPUs, and use reserved pricing to keep multi‑week runs predictable.
  • Best for large‑scale batch inference or evaluation: Because batch jobs can run on dedicated clusters with no interference from other tenants, and on‑demand clusters can be torn down immediately after the processing window.

Guardrails: How to Avoid Surprise Charges

Instant compute is powerful, but it’s only safe when paired with control. Here’s a practical set of guardrails to implement around together.ai GPU Clusters.

1. Align cluster type to workload duration

  • Short, exploratory jobs: Use smaller on‑demand clusters (8–32 GPUs), and terminate as soon as tests complete.
  • Long, predictable runs: Move to reserved clusters when you know your GPU needs and job cadence.

This avoids running high‑priced on‑demand clusters for what should be steady, reserved workloads.

2. Set internal limits on cluster size and count

Within your team:

  • Define a maximum cluster size engineers can create without approval (e.g., <64 GPUs).
  • Require review for large or long‑lived clusters (e.g., >128 GPUs or >7 days).
  • Establish environment‑level guardrails (dev vs prod) with different limits and approval paths.

Practically, this looks like:

  • A lightweight RFC or ticket before creating big reserved clusters.
  • A policy that “experiments run on ≤32 GPUs; anything larger must be scheduled and reviewed.”

3. Treat clusters as ephemeral by default

  • For on‑demand:

    • Build your orchestration so that job completion triggers cluster teardown or at least sends a strong alert.
    • Avoid “just leave it running” patterns; every idle hour burns budget.
  • For reserved:

    • Even though capacity is prepaid, aggressively shut down workloads that aren’t doing useful work.
    • Use autoscaling patterns inside the cluster to match actual load.

4. Monitor usage and cost trends

Use together.ai’s billing and monitoring views (and your own observability stack) to:

  • Track GPU‑hours per project or team.
  • Compare actual utilization vs booked capacity for reserved clusters.
  • Watch for unexpected spikes in GPU count or runtime, and set alerts.

A simple rule: if a cluster’s average GPU utilization is consistently below a threshold (e.g., 50%), re‑evaluate cluster size or whether that workload should be on a smaller or different cluster.

5. Start on‑demand, then graduate to reserved

As a pattern:

  1. Prototype on on‑demand clusters at smaller scale to nail down:

    • Model architecture,
    • Training hyperparameters,
    • Throughput and token‑generation benchmarks.
  2. Once stable:

    • Calculate the GPU‑hours required for your full run (e.g., 256 GPUs × N hours).
    • If this becomes a recurring pattern, move that workload to reserved capacity to lock in better economics.

This protects you from committing to a large reservation before you understand your actual runtime and utilization.


Limitations & Considerations

  • On‑demand capacity is subject to real‑time availability: For critical launches or trainings where you can’t risk queueing, use reserved clusters to guarantee access.
  • Reserved capacity requires upfront commitment: You get better hourly economics, but you should only reserve the capacity you can realistically use over the term (up to 6 months).

Pricing & Plans

Together GPU Clusters use transparent, self‑serve pricing with two main models:

  • On‑Demand

    • Standard hourly rate per GPU.
    • No commitment: Pay hourly, terminate anytime.
    • Best for: Flexibility, unpredictable workloads, initial experimentation.
    • Capacity: Based on real‑time availability.
    • Scale: Up to 256 GPUs.
  • Reserved

    • Lower hourly rate per GPU, paid upfront.
    • Commitment: Up to 6 months.
    • Best for: Guaranteed access and best economics for sustained workloads.
    • Capacity: Locked in for your duration.
    • Scale: 256 to 4,000+ GPUs.

Example pricing snapshot (subject to change; always confirm in the console or with sales):

  • H100 SXM (80GB):

    • On‑Demand: $2.99/hr per GPU
    • Reserved: Starting at $1.75/hr per GPU
    • Scale: 8–256 GPUs (self‑serve; up to 4,000+ reserved)
  • B200:

    • On‑Demand: $5.50/hr per GPU
    • Reserved: Starting at $4.00/hr per GPU
    • Scale: 256–1,000+ GPUs
  • GB200 NVL72:

    • On‑Demand: Contact together.ai
    • Reserved: Contact together.ai
    • Scale: 512–1,000+ GPUs

In practice, teams often:

  • Start with on‑demand H100 clusters for early training and fine‑tuning.
  • Move production workloads and recurring training jobs to reserved H100 or B200 clusters once requirements are clear.

Frequently Asked Questions

How do I decide between on‑demand and reserved GPU Clusters for my first project?

Short Answer: Start on‑demand to learn your workload, then move to reserved once the pattern is stable and you’re confident you’ll use the capacity.

Details: For a new project, you rarely know the exact GPU count, training schedule, or how many experiments you’ll run. On‑demand lets you:

  • Iterate on model design and training code,
  • Measure tokens/sec, time‑to‑convergence, and GPU utilization,
  • Get a real cost profile (GPU‑hours per run).

Once you see that you’re consistently using, say, 256 H100 GPUs for multi‑day runs every month, a 6‑month reserved block at a lower hourly rate will usually yield better overall economics and guaranteed capacity. The key is to treat on‑demand as your learning and tuning phase, and reserved as your production phase.

What specific steps should my team take to avoid surprise GPU charges?

Short Answer: Enforce internal limits on cluster size/runtime, treat on‑demand clusters as ephemeral, and actively monitor GPU‑hours and utilization.

Details: Practically, this looks like:

  • Policy guardrails:

    • Require approvals for clusters above a threshold (e.g., >64 GPUs or >3‑day runs).
    • Separate “experimental” vs “production” clusters with different limits.
  • Lifecycle management:

    • Tie cluster creation and termination into your CI/CD or job orchestration, not manual clicks.
    • Use alerts when a cluster has been idle or under‑utilized for a defined period.
  • Budget visibility:

    • Track GPU‑hours per team and per project.
    • Regularly compare actual spend to planned spend and adjust cluster sizes and pricing model (on‑demand vs reserved) accordingly.

Using these practices on top of together.ai’s transparent per‑GPU hourly pricing and capacity controls is usually enough to prevent surprise bills, even for large‑scale jobs.


Summary

Instant GPU Clusters on together.ai give your team the ability to go from zero to large‑scale training, fine‑tuning, and batch inference in minutes, with bare‑metal performance, InfiniBand networking, and self‑serve orchestration. On‑demand pricing gives you maximum flexibility for exploration; reserved capacity gives you the best economics and guaranteed access once workloads are predictable. By combining the right billing model with lightweight guardrails — cluster size limits, time‑bounded runs, and active monitoring of GPU‑hours — you can run serious GPU workloads without surprise charges.

Next Step

Get Started