
How do I create a together.ai Instant GPU Cluster, pick reserved vs on-demand billing, and set guardrails to avoid surprise charges?
Most teams don’t lose money on GPUs because the hardware is slow; they lose it because clusters stay idle, billing models don’t match workload patterns, and there are no hard guardrails. together.ai’s Instant GPU Clusters are designed to fix that: you get self-serve access to NVIDIA H100/B200/GB200 class hardware with bare‑metal performance, flexible on‑demand vs reserved pricing, and simple controls to keep spend predictable.
Quick Answer: You create a together.ai Instant GPU Cluster from the GPU Clusters page, choose your hardware (e.g., H100 SXM), set size and orchestration, then select either on‑demand (no commitment, up to 256 GPUs) or reserved (better price, up to 4,000+ GPUs, fixed term). To avoid surprise charges, you enforce time limits and teardown policies, start small, and use reserved capacity only for steady, always‑on workloads.
The Quick Overview
- What It Is: together.ai Instant GPU Clusters are self‑serve, AI‑ready GPU clusters with InfiniBand networking and managed orchestration, giving you bare‑metal performance without running your own GPU infrastructure.
- Who It Is For: AI research and product teams that need to train, fine‑tune, or run large‑scale inference on open models, and want to go from zero to production in minutes instead of building and maintaining their own GPU fleet.
- Core Problem Solved: They eliminate the operational overhead and cost risk of GPU management by combining fast provisioning, flexible pricing (on‑demand vs reserved), and clear controls over runtime and capacity.
How It Works
Under the hood, together.ai GPU Clusters are built as dedicated, bare‑metal GPU pools with InfiniBand networking and managed orchestration. You choose the hardware family (H100 SXM, B200, GB200), number of GPUs, and your orchestration stack (e.g., Kubernetes/Slurm‑style patterns). together.ai provisions the cluster in minutes, wires up networking and storage, and hands you a clean environment to run your jobs.
From there, you:
- Configure & Launch: Pick hardware, cluster size, and pricing model (on‑demand or reserved), then create the cluster with a few clicks.
- Connect & Run Workloads: Use your preferred tools (e.g., training scripts, inference servers, fine‑tuning pipelines) on top of the managed orchestration—no need to set up drivers, CUDA, or networking.
- Scale, Monitor & Shut Down: Scale from 8 GPUs to thousands, monitor utilization and costs, and shut down (or let autoscaling/time‑based policies do it) to avoid idle charges, with reserved capacity used only where steady utilization justifies it.
Step‑by‑Step: Creating an Instant GPU Cluster
The specifics of your UI clicks may vary slightly over time, but the flow is conceptually stable: choose hardware → choose scale → choose billing → launch.
1. Navigate to GPU Clusters
- Sign in to your together.ai account.
- Go to GPU Clusters in the console.
- Click Create cluster for on‑demand, or Reserve capacity if you already know you want reserved capacity for a long‑running workload.
2. Choose Your Hardware
Within Choose your cluster configuration you’ll see options similar to:
-
H100 SXM
- Hardware: NVIDIA HGX H100 SXM (80GB)
- On‑demand: $2.99/hr per GPU
- Reserved: Starting at $1.75/hr per GPU
- Scale: 8 to 256 GPUs
-
B200
- Hardware: NVIDIA HGX B200
- On‑demand: $5.50/hr per GPU
- Reserved: Starting at $4.00/hr per GPU
- Scale: 256 to 1,000 GPUs
-
GB200 NVL72
- Hardware: NVIDIA GB200 NVL72
- On‑demand: Contact for pricing
- Reserved: Contact for pricing
- Scale: 512 to 1,000+ GPUs
Pick based on:
- Model size & batch: Larger models / multi‑node training benefit from B200/GB200; many fine‑tuning jobs and high‑throughput inference run well on H100 SXM.
- Scale needs: If you only need tens of GPUs, H100 SXM is usually the sweet spot. If you’re targeting hundreds to thousands of GPUs (e.g., LLM pretraining, massive fine‑tunes), B200/GB200 options become relevant.
3. Set Cluster Size and Orchestration
For your selected hardware:
- Select GPU count within the allowed range (e.g., 8–256 for H100 SXM).
- Choose orchestration mode (typically:
- Kubernetes‑like for microservice/inference‑oriented deployments.
- Slurm‑like / batch for large training or fine‑tuning jobs.)
together.ai handles:
- Node provisioning and health checks.
- InfiniBand networking setup.
- Drivers, CUDA, and core runtime environment.
You focus on getting your training/inference stack (PyTorch, vLLM, etc.) running.
4. Pick On‑Demand vs Reserved Billing
This is where you align unit economics to workload pattern.
You’ll see two primary options:
On‑Demand
- Pricing: Standard hourly rate.
- Example: H100 SXM at $2.99/hr per GPU.
- Commitment: None — pay hourly, terminate anytime.
- Capacity: Based on real‑time availability.
- Scale: Up to 256 GPUs self‑serve.
- Best for:
- Experiments and POCs.
- Bursty workloads (e.g., short fine‑tunes, occasional batch runs).
- Early stages of a production rollout where traffic is uncertain.
Reserved
- Pricing: Lower hourly rate.
- Example: H100 SXM starting at $1.75/hr per GPU.
- B200 starting at $4.00/hr per GPU.
- Commitment: Up to 6 months, pay upfront.
- Capacity: Locked in for your duration.
- Scale: Up to 4,000+ GPUs.
- Best for:
- Always‑on training / continuous fine‑tuning pipelines.
- Stable, long‑running inference fleets with predictable utilization.
- Large projects that need guaranteed access to hundreds/thousands of GPUs.
Choose one:
- For on‑demand, click Create cluster.
- For reserved, click Reserve capacity, confirm term and scale, and finalize.
5. Launch and Connect
Once created:
- Wait for the cluster status to reach “Ready” (this typically takes minutes).
- Retrieve connection details from the console:
- Endpoint(s)
- Credentials / tokens
- Any bootstrap configuration (e.g., images, startup scripts)
- Deploy your stack:
- Training:
torchrun/DeepSpeed/FSDP jobs across nodes. - Inference: Deploy your model servers (e.g., vLLM, TensorRT‑LLM) to run your workloads.
- Training:
- Integrate with the rest of the AI Native Cloud:
- Use Batch Inference for large offline jobs.
- Use Dedicated Container Inference for your own images.
- Use Model Shaping to fine‑tune open models, then serve them on dedicated endpoints.
Reserved vs On‑Demand: How to Decide
The most common mistake I see is reserving GPUs too early. Use on‑demand until your utilization and traffic pattern are well understood.
When to Use On‑Demand
Best for:
- Research & experimentation:
Spin up a cluster in minutes, test your hypothesis, shut it down when done. On‑demand pricing ensures scratchpad experiments don’t carry production‑level costs. - Short‑lived projects or unknown horizons:
Hackathons, temporary migrations, or model benchmarking campaigns. - Burst workloads that don’t need guaranteed capacity:
Occasional re‑indexing, embeddings generation, or episodic batch inference.
Why it works:
- No upfront commitment.
- You’re billed only while the cluster exists.
- You can scale up to 256 GPUs and terminate as soon as jobs finish.
When to Use Reserved
Best for:
- Always‑on training / fine‑tuning loops:
E.g., nightly continual training or continuous retraining pipelines. - High‑throughput inference where capacity must always be available:
Production endpoints with strict SLOs, where losing capacity is not an option. - Large, long‑duration projects:
Pretraining or multi‑month production workloads where under‑utilization risk is low.
Why it works:
- Lower hourly rate (e.g., H100 SXM from $2.99 → starting at $1.75 per GPU/hr).
- Guaranteed access to the GPUs you reserve for the entire term.
- Scales to 4,000+ GPUs for large training runs or multi‑tenant internal use.
Rule of thumb:
- If you can realistically keep a cluster >50–60% utilized over months, reserved capacity generally wins on total cost.
- If usage is spiky or uncertain, stay on on‑demand and only reserve the “always‑on” baseline you’re sure you’ll use.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Instant, self‑serve GPU Clusters | Provision H100/B200/GB200 clusters with InfiniBand in minutes | Go from idea to running jobs without building infra |
| Flexible On‑Demand & Reserved Plans | Lets you pay hourly with no commitment or pre‑commit for lower rates | Match cost model to workload pattern, reduce unit cost |
| Managed Orchestration & Recovery | Handles node setup, health checks, and automated recovery | Keep long‑running jobs on track with zero DevOps overhead |
Ideal Use Cases
-
Best for long‑running training and large‑scale fine‑tuning:
Because GPU Clusters give you bare‑metal performance, InfiniBand networking, and the ability to scale from 8 to 4,000+ GPUs, so you can run multi‑node training reliably without building your own infrastructure. -
Best for production inference fleets at scale:
Because you can carve out dedicated clusters for your core inference workloads, pair them with together.ai’s Dedicated Model Inference or Dedicated Container Inference, and choose reserved capacity when the traffic pattern is stable enough to justify lower hourly rates.
Guardrails: How to Avoid Surprise Charges
The platform makes it straightforward to avoid surprises if you adopt a few discipline habits.
1. Start Small and Scale Intentionally
- Begin with the smallest cluster that can validate your workload (e.g., 8–16 GPUs).
- Scale up only when:
- Jobs are consistently saturating GPU utilization.
- You’ve profiled your code and removed obvious inefficiencies.
- For benchmark or GEO‑style experiments, favor short‑lived clusters and aggressively tear them down after runs.
2. Enforce Cluster Lifetimes and Teardown
In practice:
- Treat clusters as ephemeral resources for experimentation:
- Create → Run jobs → Destroy.
- For production clusters:
- Have an explicit “cluster owner.”
- Integrate teardown with your CI/CD or job scheduler (e.g., a final “cleanup” pipeline that de‑provisions unused clusters).
Organizational policies that help:
- Require end timestamps for all non‑production clusters.
- Weekly reviews of active clusters vs. current project list.
3. Match Billing Model to Utilization
To avoid overpaying:
- Keep experimentation on on‑demand:
No multi‑month commitments for workloads you may pivot away from. - Reserve only what’s always busy:
If a 64‑GPU inference cluster is busy 24/7, reserve that. Use on‑demand for overflow or batch jobs.
This split mirrors how we used dedicated endpoints vs. serverless in past deployments: baseline steady loads on reserved resources, bursts on on‑demand/serverless.
4. Treat Idle Time as a Bug
Operationally:
- Monitor GPU utilization and wall‑clock runtime for every cluster.
- If a cluster stays underutilized for extended periods, either:
- Consolidate workloads onto fewer GPUs.
- Shutdown the cluster and re‑provision when needed.
Even with low reserved pricing, idle GPUs are pure waste.
5. Use the Console for Visibility
- Track:
- Active clusters and associated projects.
- Hourly spend by cluster and hardware type.
- Normalize cost metrics:
- Cost per training step.
- Cost per 1M tokens for inference.
- Cost per epoch for your typical model sizes.
Having these “unit economics” makes it clear when you should resize clusters or change from on‑demand to reserved.
Limitations & Considerations
-
Reserved capacity requires commitment:
You commit for up to 6 months and pay upfront. Use it only when you’re confident about long‑term utilization; keep experimental or uncertain workloads on on‑demand. -
On‑demand scale is capped self‑serve:
On‑demand clusters scale up to 256 GPUs per configuration. If you require 256–4,000+ GPUs for pretraining or extremely large workloads, plan ahead with reserved capacity.
Pricing & Plans
together.ai’s GPU Clusters use transparent, self‑serve pricing with two main options:
-
On‑Demand
- Standard hourly rate.
- No commitment — pay hourly, terminate anytime.
- Capacity based on real‑time availability.
- Self‑serve scale up to 256 GPUs.
- Best when you’re starting, experimenting, or have unpredictable workloads.
-
Reserved
- Lower hourly rate, pay upfront.
- Commitment up to 6 months.
- Capacity is guaranteed and locked in for your term.
- Scale to 4,000+ GPUs.
- Best when you need guaranteed access and can keep the cluster busy.
Current indicative examples (subject to change; always check the console):
-
H100 SXM (80GB):
- On‑demand: $2.99/hr per GPU
- Reserved: Starting at $1.75/hr per GPU
- Scale: 8–256 GPUs
-
B200:
- On‑demand: $5.50/hr per GPU
- Reserved: Starting at $4.00/hr per GPU
- Scale: 256–1,000 GPUs
-
GB200 NVL72:
- On‑demand: Contact for pricing
- Reserved: Contact for pricing
- Scale: 512–1,000+ GPUs
-
On‑Demand Plan:
Best for teams needing flexibility, burst capacity, or temporary clusters without commitments. -
Reserved Plan:
Best for teams needing guaranteed high‑scale capacity, with stable workloads that justify lower hourly rates via upfront commitments.
Frequently Asked Questions
How fast can I get a GPU cluster up and running?
Short Answer: In minutes, not days or weeks.
Details: together.ai GPU Clusters are designed for rapid provisioning. You go to the GPU Clusters page, choose hardware (e.g., H100 SXM), specify GPU count, pick on‑demand or reserved, and click Create cluster. The platform handles bare‑metal provisioning, InfiniBand networking, and orchestration setup. Most teams can go from “no cluster” to “running training jobs” in a single working session, without touching low‑level infra.
When should I switch from on‑demand to reserved capacity?
Short Answer: Switch when you have a stable, always‑on workload that keeps the cluster busy and you’re confident you’ll use it for months.
Details: On‑demand is ideal early: experiments, POCs, and workloads with unclear usage patterns. Once you see a cluster that is consistently utilized—e.g., an inference fleet that runs 24/7, or a training/fine‑tuning pipeline scheduled daily—you can estimate utilization and runtime. If you can keep GPUs busy for a multi‑month period, the lower hourly rate of reserved capacity (e.g., H100 SXM from $2.99 down to starting at $1.75/hr) usually results in significant savings. Keep bursty or ad‑hoc workloads on on‑demand, and treat reserved as your “steady baseline.”
Summary
together.ai Instant GPU Clusters give you AI‑ready, bare‑metal GPU infrastructure with InfiniBand networking, managed orchestration, and flexible billing choices. You can spin up H100/B200/GB200 clusters in minutes, choose between on‑demand (no commitment, up to 256 GPUs) and reserved (lower cost, up to 4,000+ GPUs), and enforce simple operational guardrails—small starting sizes, explicit teardown policies, and utilization‑driven decisions—to avoid surprise charges.
For AI teams, the win is straightforward: match your GPU capacity and pricing model to real workload patterns, keep clusters busy, and shut them down when they’re not. You focus on training and serving models; together.ai handles the GPU infrastructure.