together.ai: how do I choose between Serverless Inference, Batch Inference, and Dedicated Endpoints for my workload?

Quick Answer: Use Serverless Inference for variable, bursty traffic; Batch Inference for massive offline workloads up to 30B tokens; and Dedicated Endpoints (Dedicated Model or Dedicated Container Inference) when you have predictable traffic and tight latency or throughput SLOs.

The Quick Overview

What It Is: A set of deployment modes on together.ai’s AI Native Cloud—Serverless Inference (real-time), Batch Inference, and Dedicated Endpoints—that let you match GPU resources to your workload’s latency, scale, and cost profile.
Who It Is For: AI product teams, infra/platform engineers, and applied researchers running open-source or partner models in production, from prototypes to always-on workloads.
Core Problem Solved: Choosing one “one-size-fits-all” deployment model either wastes money or misses SLOs. together.ai lets you align each workload with the right mode so you get low latency, high throughput, and best-in-market economics without managing GPUs.

How It Works

Under the hood, all three modes run on the same AI Native Cloud foundation—Together Kernel Collection (from the FlashAttention team), ATLAS speculative decoding, and CPD long-context serving. What changes is how capacity is allocated and how your traffic is scheduled:

Serverless Inference (real-time): You call an OpenAI-compatible API endpoint. together.ai auto-scales capacity up and down based on demand. You pay per token with no long-term commitments. Ideal for variable or unpredictable traffic and early-stage production.
Batch Inference: You submit large jobs (up to 30B tokens) to be processed asynchronously. The system optimizes throughput and cost rather than interactive latency, delivering up to 50% lower cost for massive workloads.
Dedicated Endpoints (Dedicated Model / Dedicated Container Inference): You get reserved, isolated GPUs and the Together inference engine. You trade elasticity for control: predictable cost, guaranteed capacity, and tighter latency for steady or high-throughput workloads.

Traffic Profiling: Start by measuring your workload on three axes—latency sensitivity (interactive vs offline), traffic pattern (bursty vs steady), and scale (tokens/day, peak TPS).
Deployment Mapping: Map workloads: interactive & bursty → Serverless; offline & huge → Batch; steady & SLO-bound → Dedicated Endpoints.
Model Shaping & Tuning: As workloads scale, you can fine-tune and quantize models, then run them in the same modes (Serverless, Batch, Dedicated) using the same OpenAI-compatible API and Together Sandbox for iteration.

Features & Benefits Breakdown

Core Feature	What It Does	Primary Benefit
Serverless Inference (Real-time)	Auto-scales a fully managed API for variable traffic with no reservations	No capacity planning / Low friction / Best for bursty workloads
Batch Inference	Processes up to 30B tokens asynchronously at up to 50% less cost	Lowest cost for large jobs / High throughput / Offline workflows
Dedicated Endpoints (Model/Container)	Reserves isolated GPUs with Together’s inference engine for your workloads	Predictable latency / High throughput / Strong isolation & control

Ideal Use Cases

Best for Serverless Inference (Real-time): Because it handles variable or unpredictable traffic without capacity planning, ideal for chat apps, internal tools, prototypes, and early-stage features where you don’t yet know the steady-state load.
Best for Batch Inference: Because it processes massive workloads up to 30B tokens at up to 50% less cost, ideal for backfills, dataset labeling, offline summarization, and synthetic data generation.
Best for Dedicated Endpoints: Because reserved, tenant-isolated capacity plus Together’s runtime stack gives you predictable latency and throughput, ideal for production APIs with SLOs, high QPS agents, and enterprise deployments that need strong control and isolation.

How to Choose: A Practical Decision Framework

If I were designing an AI “model gateway” for a new product today, I’d ask three questions per workload:

Does a human wait on this response?
- Yes → You’re latency-sensitive → Start with Serverless Inference (real-time) or Dedicated Endpoints.
- No → You’re offline/async → Consider Batch Inference.
Is traffic predictable or spiky?
- Highly variable, seasonal, or experiment-heavy → Serverless Inference.
- Stable baselines (steady DAU, known QPS) → Dedicated Endpoints.
How large are your jobs?
- Millions to tens of billions of tokens per run → Batch Inference.
- Short, interactive prompts → Serverless or Dedicated.

Quick Mapping Table

Workload Type	Recommended Mode(s)	Why
New chat feature, unknown adoption	Serverless Inference (Real-time)	No capacity planning, elastic, cost-efficient early on
Customer support agent in production	Start Serverless → move hot paths to Dedicated Endpoint	Balance flexibility first, then lock in SLOs and cost
Large offline summarization of logs/docs	Batch Inference	Up to 50% lower cost, optimized for large jobs
Classification of a huge dataset	Batch Inference	High throughput, up to 30B tokens per job
Enterprise API with strict latency SLOs	Dedicated Endpoint (Model or Container)	Reserved GPUs, predictable performance
Voice or multimodal agent with high QPS	Dedicated Endpoint	Low latency per turn, high throughput under load
Periodic bulk embedding generation	Batch Inference	Lowest cost, throughput-optimized
Internal experimentation / A/B of models	Serverless Inference + Together Sandbox	Fast iteration, no infra changes, OpenAI-compatible API

Mode Deep Dive: When (and Why) Each Option Wins

Serverless Inference (Real-time)

Use this when you need flexibility first.

Best for: Variable or unpredictable traffic, rapid prototyping, cost-sensitive or early-stage production workloads.
Example workloads:
- New chat or copilot experiences where user growth is uncertain.
- Internal tools or experiments that may be short-lived.
- Product features you’re still tuning (prompting, routing, safety).

How it works technically

You call an OpenAI-compatible API, so migration is trivial—no code changes required.
together.ai automatically scales capacity behind the endpoint based on concurrent requests and token throughput.
Under the hood, ATLAS speculative decoding and Together Kernel Collection are applied to maximize tokens/sec per GPU. You inherit these optimizations without touching infra.

Why choose Serverless Inference

No capacity planning: You don’t need to guess how many GPUs you’ll need for launch.
No long-term commitments: You pay per token, ideal when volumes are uncertain.
Fast iteration: Combine with Together Sandbox to test prompts/models, then reuse the same API patterns when you move to Dedicated or Batch.

When to move off pure serverless

When you can roughly predict a baseline QPS and:
- You need tight P95/P99 latency.
- Your spend has grown enough that reserved capacity will be cheaper.
- You need more control over model versioning, container images, or isolation.

At that point, move the steady portion of your traffic to Dedicated Endpoints and keep spikes and experiments on Serverless.

Batch Inference

Batch is where you trade interactive latency for best-in-class throughput and cost.

Best for: Classifying large datasets, offline summarization, synthetic data generation, backfills.
Scale: Process workloads of up to 30 billion tokens asynchronously, at up to 50% less cost than interactive serving.

How it works technically

You submit a job describing the input corpus and desired outputs.
The system schedules work across GPUs optimized for throughput, not p99 latency.
CPD (prefill–decode disaggregation) and custom kernels from Together Kernel Collection keep GPUs full, especially on long-context or heavy-compute jobs.

Why choose Batch Inference

Lowest cost per 1M tokens for large jobs.
Higher throughput than interactive endpoints, since the scheduler can optimize across the entire job instead of per request.
Less sensitivity to noise: Since you’re offline, you don’t care about per-request jitter, just overall job completion time and unit economics.

Common patterns

Nightly or hourly log summarization or document digest generation.
Embeddings or classification passes across a full dataset.
Generating synthetic training data at scale before fine-tuning.

If a human never waits for the result and your total tokens per run are large, Batch Inference is almost always the correct answer.

Dedicated Endpoints (Dedicated Model / Dedicated Container Inference)

When your product is mature and you know your traffic envelope, Dedicated Endpoints give you reserved capacity, tenant-level isolation, and predictable latency.

Best for:
- Predictable or steady traffic.
- Latency-sensitive applications with clear p95/p99 SLOs.
- High-throughput production workloads where you care about cost per 1M tokens.

Two flavors:

Dedicated Model Inference
- together.ai manages the model runtime on reserved GPUs.
- Best when you want control + performance without managing containers.
Dedicated Container Inference
- You bring your own container (e.g., custom runtime, additional business logic).
- Best when you need full stack control but still want Together’s GPU infra and networking.

How it works technically

You spin up a dedicated endpoint backed by a fixed GPU pool.
ATLAS speculative decoding, CPD, and FlashAttention-based kernels are tuned for your endpoint’s model and context sizes.
Because capacity is reserved, you get predictable tokens/sec and TTFT (time-to-first-token), even during peak hours.

Why choose Dedicated Endpoints

Predictable performance: Easy to meet rigorous SLOs.
Better unit economics at scale: Once traffic is steady, reserved capacity often beats serverless pricing.
Security & compliance: Tenant-level isolation, encryption in transit/at rest, and SOC 2 Type II. Your data and models remain fully under your ownership.

Typical migrations

Start with Serverless Inference while traffic is uncertain.
Once you see consistent baselines, spin up a Dedicated Endpoint for hot paths.
Keep long-tail features, experiments, and low-traffic models on Serverless to avoid over-reserving GPUs.

Limitations & Considerations

Serverless Inference limitations:
- You don’t control the exact hardware or concurrency per model.
- For extremely tight latency budgets or very high, predictable QPS, Dedicated Endpoints will be a better fit. Use Serverless as the default for new or spiky workloads, then migrate hot paths.
Batch Inference limitations:
- Not suitable for interactive use cases; jobs are asynchronous.
- You must structure workloads into jobs and handle result retrieval. For smaller or interactive workloads, use Serverless or Dedicated instead.
Dedicated Endpoints considerations:
- Requires capacity planning: you choose instance sizing and may need to adjust as traffic grows.
- Best when you have some traffic predictability; otherwise you risk over- or under-provisioning. Keep an overflow path on Serverless for unexpected spikes.

Pricing & Plans

Pricing on together.ai is structured so you can start serverless with no commitments and then opt into reservations for steady workloads:

Serverless Inference (Real-time):
Pay per token for on-demand usage. Best for teams needing flexibility, spiky capacity, and fast time-to-market without managing GPUs.
Batch Inference:
Discounted token pricing for large asynchronous jobs, with up to 50% lower cost for massive workloads (up to 30B tokens). Best for teams that can tolerate offline processing to optimize unit economics.
Dedicated Endpoints (Model / Container):
Reserved GPU capacity billed on a time basis, with better price-performance at scale. Best for teams with:
- Stable or growing production traffic.
- Defined latency or throughput SLOs.
- Requirements around tenant-level isolation and predictable capacity.

Exact prices depend on model choice, GPU type, and reservation level. The pattern to remember:

Start on serverless, prove the workload, measure tokens and QPS.
Move steady traffic to Dedicated Endpoints for better unit economics.
Offload bulk workloads to Batch Inference for minimum cost per 1M tokens.

Frequently Asked Questions

How do I know when to move from Serverless Inference to a Dedicated Endpoint?

Short Answer: When your traffic becomes predictable and your monthly spend or latency SLOs justify reserved capacity.

Details:
Monitor three signals:

Consistent baseline traffic: If your average QPS and tokens/day are fairly stable, you can right-size a Dedicated Endpoint.
Latency SLO pressure: If you’re chasing tight p95/p99 latency in production (voice agents, critical user flows), Dedicated gives you more predictable performance.
Spend patterns: If your monthly serverless bill is dominated by a handful of endpoints, moving those to Dedicated often improves cost per 1M tokens.

Operationally, you can stand up a Dedicated Endpoint in minutes using the same OpenAI-compatible API, test it in staging, then gradually shift production traffic from Serverless to Dedicated using your gateway or routing layer.

How should I split a single product across Serverless, Batch, and Dedicated?

Short Answer: Use Serverless for experiments and long-tail features, Dedicated for hot paths with SLOs, and Batch for offline bulk processing.

Details:
In a typical AI product:

Real-time user interactions:
- Default to Serverless Inference while features are new or traffic is small.
- Promote high-volume, SLO-critical paths (e.g., main chat endpoint) to Dedicated Endpoints once behavior stabilizes.
Offline pipelines:
- Use Batch Inference for nightly log summarization, dataset labeling, or large embedding jobs. It will significantly reduce cost versus hitting real-time endpoints for the same volume.
Experimentation & research:
- Use Together Sandbox + Serverless to explore models, prompts, and fine-tunes.
- Once a model is selected and shaped, deploy it behind a Dedicated Endpoint for production traffic, while keeping a serverless fallback or canary path for new variants.

This hybrid pattern is what I’ve seen work best at scale: Serverless for elasticity, Batch for unit economics on large jobs, Dedicated for predictable latency and throughput.

Summary

Choosing between Serverless Inference, Batch Inference, and Dedicated Endpoints on together.ai is about matching deployment mode to workload reality:

Serverless Inference (Real-time): Variable traffic, unknown adoption, rapid iteration, and low operational overhead.
Batch Inference: Massive offline workloads—up to 30B tokens per job—at up to 50% less cost, optimized for throughput over interactive latency.
Dedicated Endpoints (Model / Container): Predictable or high-volume traffic with strict latency and throughput SLOs, tenant-level isolation, and better unit economics at scale.

The winning architecture usually combines all three: start everything on Serverless, move bulk workloads to Batch, and promote hot, stable paths to Dedicated Endpoints as they prove out.

Next Step

Get Started

together.ai: how do I choose between Serverless Inference, Batch Inference, and Dedicated Endpoints for my workload?

The Quick Overview

How It Works

Features & Benefits Breakdown

Ideal Use Cases

How to Choose: A Practical Decision Framework

Quick Mapping Table

Mode Deep Dive: When (and Why) Each Option Wins

Serverless Inference (Real-time)

Batch Inference

Dedicated Endpoints (Dedicated Model / Dedicated Container Inference)

Limitations & Considerations

Pricing & Plans

Frequently Asked Questions

How do I know when to move from Serverless Inference to a Dedicated Endpoint?

How should I split a single product across Serverless, Batch, and Dedicated?

Summary

Next Step

Keep Reading

More from Foundation Model Platforms

What’s the best way to make an internal “chat with company docs” tool show citations and links to sources?

Why is my streaming chat response so slow to start (high first-token latency / TTFT) and how do I fix it without changing models?

How do I create a together.ai Instant GPU Cluster, pick reserved vs on-demand billing, and set guardrails to avoid surprise charges?