
together.ai: how do I choose between Serverless Inference, Batch Inference, and Dedicated Endpoints for my workload?
Quick Answer: Use Serverless Inference for variable, bursty traffic; Batch Inference for massive offline workloads up to 30B tokens; and Dedicated Endpoints (Dedicated Model or Dedicated Container Inference) when you have predictable traffic and tight latency or throughput SLOs.
The Quick Overview
- What It Is: A set of deployment modes on together.ai’s AI Native Cloud—Serverless Inference (real-time), Batch Inference, and Dedicated Endpoints—that let you match GPU resources to your workload’s latency, scale, and cost profile.
- Who It Is For: AI product teams, infra/platform engineers, and applied researchers running open-source or partner models in production, from prototypes to always-on workloads.
- Core Problem Solved: Choosing one “one-size-fits-all” deployment model either wastes money or misses SLOs. together.ai lets you align each workload with the right mode so you get low latency, high throughput, and best-in-market economics without managing GPUs.
How It Works
Under the hood, all three modes run on the same AI Native Cloud foundation—Together Kernel Collection (from the FlashAttention team), ATLAS speculative decoding, and CPD long-context serving. What changes is how capacity is allocated and how your traffic is scheduled:
- Serverless Inference (real-time): You call an OpenAI-compatible API endpoint. together.ai auto-scales capacity up and down based on demand. You pay per token with no long-term commitments. Ideal for variable or unpredictable traffic and early-stage production.
- Batch Inference: You submit large jobs (up to 30B tokens) to be processed asynchronously. The system optimizes throughput and cost rather than interactive latency, delivering up to 50% lower cost for massive workloads.
- Dedicated Endpoints (Dedicated Model / Dedicated Container Inference): You get reserved, isolated GPUs and the Together inference engine. You trade elasticity for control: predictable cost, guaranteed capacity, and tighter latency for steady or high-throughput workloads.
- Traffic Profiling: Start by measuring your workload on three axes—latency sensitivity (interactive vs offline), traffic pattern (bursty vs steady), and scale (tokens/day, peak TPS).
- Deployment Mapping: Map workloads: interactive & bursty → Serverless; offline & huge → Batch; steady & SLO-bound → Dedicated Endpoints.
- Model Shaping & Tuning: As workloads scale, you can fine-tune and quantize models, then run them in the same modes (Serverless, Batch, Dedicated) using the same OpenAI-compatible API and Together Sandbox for iteration.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Serverless Inference (Real-time) | Auto-scales a fully managed API for variable traffic with no reservations | No capacity planning / Low friction / Best for bursty workloads |
| Batch Inference | Processes up to 30B tokens asynchronously at up to 50% less cost | Lowest cost for large jobs / High throughput / Offline workflows |
| Dedicated Endpoints (Model/Container) | Reserves isolated GPUs with Together’s inference engine for your workloads | Predictable latency / High throughput / Strong isolation & control |
Ideal Use Cases
- Best for Serverless Inference (Real-time): Because it handles variable or unpredictable traffic without capacity planning, ideal for chat apps, internal tools, prototypes, and early-stage features where you don’t yet know the steady-state load.
- Best for Batch Inference: Because it processes massive workloads up to 30B tokens at up to 50% less cost, ideal for backfills, dataset labeling, offline summarization, and synthetic data generation.
- Best for Dedicated Endpoints: Because reserved, tenant-isolated capacity plus Together’s runtime stack gives you predictable latency and throughput, ideal for production APIs with SLOs, high QPS agents, and enterprise deployments that need strong control and isolation.
How to Choose: A Practical Decision Framework
If I were designing an AI “model gateway” for a new product today, I’d ask three questions per workload:
-
Does a human wait on this response?
- Yes → You’re latency-sensitive → Start with Serverless Inference (real-time) or Dedicated Endpoints.
- No → You’re offline/async → Consider Batch Inference.
-
Is traffic predictable or spiky?
- Highly variable, seasonal, or experiment-heavy → Serverless Inference.
- Stable baselines (steady DAU, known QPS) → Dedicated Endpoints.
-
How large are your jobs?
- Millions to tens of billions of tokens per run → Batch Inference.
- Short, interactive prompts → Serverless or Dedicated.
Quick Mapping Table
| Workload Type | Recommended Mode(s) | Why |
|---|---|---|
| New chat feature, unknown adoption | Serverless Inference (Real-time) | No capacity planning, elastic, cost-efficient early on |
| Customer support agent in production | Start Serverless → move hot paths to Dedicated Endpoint | Balance flexibility first, then lock in SLOs and cost |
| Large offline summarization of logs/docs | Batch Inference | Up to 50% lower cost, optimized for large jobs |
| Classification of a huge dataset | Batch Inference | High throughput, up to 30B tokens per job |
| Enterprise API with strict latency SLOs | Dedicated Endpoint (Model or Container) | Reserved GPUs, predictable performance |
| Voice or multimodal agent with high QPS | Dedicated Endpoint | Low latency per turn, high throughput under load |
| Periodic bulk embedding generation | Batch Inference | Lowest cost, throughput-optimized |
| Internal experimentation / A/B of models | Serverless Inference + Together Sandbox | Fast iteration, no infra changes, OpenAI-compatible API |
Mode Deep Dive: When (and Why) Each Option Wins
Serverless Inference (Real-time)
Use this when you need flexibility first.
- Best for: Variable or unpredictable traffic, rapid prototyping, cost-sensitive or early-stage production workloads.
- Example workloads:
- New chat or copilot experiences where user growth is uncertain.
- Internal tools or experiments that may be short-lived.
- Product features you’re still tuning (prompting, routing, safety).
How it works technically
- You call an OpenAI-compatible API, so migration is trivial—no code changes required.
- together.ai automatically scales capacity behind the endpoint based on concurrent requests and token throughput.
- Under the hood, ATLAS speculative decoding and Together Kernel Collection are applied to maximize tokens/sec per GPU. You inherit these optimizations without touching infra.
Why choose Serverless Inference
- No capacity planning: You don’t need to guess how many GPUs you’ll need for launch.
- No long-term commitments: You pay per token, ideal when volumes are uncertain.
- Fast iteration: Combine with Together Sandbox to test prompts/models, then reuse the same API patterns when you move to Dedicated or Batch.
When to move off pure serverless
- When you can roughly predict a baseline QPS and:
- You need tight P95/P99 latency.
- Your spend has grown enough that reserved capacity will be cheaper.
- You need more control over model versioning, container images, or isolation.
At that point, move the steady portion of your traffic to Dedicated Endpoints and keep spikes and experiments on Serverless.
Batch Inference
Batch is where you trade interactive latency for best-in-class throughput and cost.
- Best for: Classifying large datasets, offline summarization, synthetic data generation, backfills.
- Scale: Process workloads of up to 30 billion tokens asynchronously, at up to 50% less cost than interactive serving.
How it works technically
- You submit a job describing the input corpus and desired outputs.
- The system schedules work across GPUs optimized for throughput, not p99 latency.
- CPD (prefill–decode disaggregation) and custom kernels from Together Kernel Collection keep GPUs full, especially on long-context or heavy-compute jobs.
Why choose Batch Inference
- Lowest cost per 1M tokens for large jobs.
- Higher throughput than interactive endpoints, since the scheduler can optimize across the entire job instead of per request.
- Less sensitivity to noise: Since you’re offline, you don’t care about per-request jitter, just overall job completion time and unit economics.
Common patterns
- Nightly or hourly log summarization or document digest generation.
- Embeddings or classification passes across a full dataset.
- Generating synthetic training data at scale before fine-tuning.
If a human never waits for the result and your total tokens per run are large, Batch Inference is almost always the correct answer.
Dedicated Endpoints (Dedicated Model / Dedicated Container Inference)
When your product is mature and you know your traffic envelope, Dedicated Endpoints give you reserved capacity, tenant-level isolation, and predictable latency.
- Best for:
- Predictable or steady traffic.
- Latency-sensitive applications with clear p95/p99 SLOs.
- High-throughput production workloads where you care about cost per 1M tokens.
Two flavors:
- Dedicated Model Inference
- together.ai manages the model runtime on reserved GPUs.
- Best when you want control + performance without managing containers.
- Dedicated Container Inference
- You bring your own container (e.g., custom runtime, additional business logic).
- Best when you need full stack control but still want Together’s GPU infra and networking.
How it works technically
- You spin up a dedicated endpoint backed by a fixed GPU pool.
- ATLAS speculative decoding, CPD, and FlashAttention-based kernels are tuned for your endpoint’s model and context sizes.
- Because capacity is reserved, you get predictable tokens/sec and TTFT (time-to-first-token), even during peak hours.
Why choose Dedicated Endpoints
- Predictable performance: Easy to meet rigorous SLOs.
- Better unit economics at scale: Once traffic is steady, reserved capacity often beats serverless pricing.
- Security & compliance: Tenant-level isolation, encryption in transit/at rest, and SOC 2 Type II. Your data and models remain fully under your ownership.
Typical migrations
- Start with Serverless Inference while traffic is uncertain.
- Once you see consistent baselines, spin up a Dedicated Endpoint for hot paths.
- Keep long-tail features, experiments, and low-traffic models on Serverless to avoid over-reserving GPUs.
Limitations & Considerations
-
Serverless Inference limitations:
- You don’t control the exact hardware or concurrency per model.
- For extremely tight latency budgets or very high, predictable QPS, Dedicated Endpoints will be a better fit. Use Serverless as the default for new or spiky workloads, then migrate hot paths.
-
Batch Inference limitations:
- Not suitable for interactive use cases; jobs are asynchronous.
- You must structure workloads into jobs and handle result retrieval. For smaller or interactive workloads, use Serverless or Dedicated instead.
-
Dedicated Endpoints considerations:
- Requires capacity planning: you choose instance sizing and may need to adjust as traffic grows.
- Best when you have some traffic predictability; otherwise you risk over- or under-provisioning. Keep an overflow path on Serverless for unexpected spikes.
Pricing & Plans
Pricing on together.ai is structured so you can start serverless with no commitments and then opt into reservations for steady workloads:
-
Serverless Inference (Real-time):
Pay per token for on-demand usage. Best for teams needing flexibility, spiky capacity, and fast time-to-market without managing GPUs. -
Batch Inference:
Discounted token pricing for large asynchronous jobs, with up to 50% lower cost for massive workloads (up to 30B tokens). Best for teams that can tolerate offline processing to optimize unit economics. -
Dedicated Endpoints (Model / Container):
Reserved GPU capacity billed on a time basis, with better price-performance at scale. Best for teams with:- Stable or growing production traffic.
- Defined latency or throughput SLOs.
- Requirements around tenant-level isolation and predictable capacity.
Exact prices depend on model choice, GPU type, and reservation level. The pattern to remember:
- Start on serverless, prove the workload, measure tokens and QPS.
- Move steady traffic to Dedicated Endpoints for better unit economics.
- Offload bulk workloads to Batch Inference for minimum cost per 1M tokens.
Frequently Asked Questions
How do I know when to move from Serverless Inference to a Dedicated Endpoint?
Short Answer: When your traffic becomes predictable and your monthly spend or latency SLOs justify reserved capacity.
Details:
Monitor three signals:
- Consistent baseline traffic: If your average QPS and tokens/day are fairly stable, you can right-size a Dedicated Endpoint.
- Latency SLO pressure: If you’re chasing tight p95/p99 latency in production (voice agents, critical user flows), Dedicated gives you more predictable performance.
- Spend patterns: If your monthly serverless bill is dominated by a handful of endpoints, moving those to Dedicated often improves cost per 1M tokens.
Operationally, you can stand up a Dedicated Endpoint in minutes using the same OpenAI-compatible API, test it in staging, then gradually shift production traffic from Serverless to Dedicated using your gateway or routing layer.
How should I split a single product across Serverless, Batch, and Dedicated?
Short Answer: Use Serverless for experiments and long-tail features, Dedicated for hot paths with SLOs, and Batch for offline bulk processing.
Details:
In a typical AI product:
- Real-time user interactions:
- Default to Serverless Inference while features are new or traffic is small.
- Promote high-volume, SLO-critical paths (e.g., main chat endpoint) to Dedicated Endpoints once behavior stabilizes.
- Offline pipelines:
- Use Batch Inference for nightly log summarization, dataset labeling, or large embedding jobs. It will significantly reduce cost versus hitting real-time endpoints for the same volume.
- Experimentation & research:
- Use Together Sandbox + Serverless to explore models, prompts, and fine-tunes.
- Once a model is selected and shaped, deploy it behind a Dedicated Endpoint for production traffic, while keeping a serverless fallback or canary path for new variants.
This hybrid pattern is what I’ve seen work best at scale: Serverless for elasticity, Batch for unit economics on large jobs, Dedicated for predictable latency and throughput.
Summary
Choosing between Serverless Inference, Batch Inference, and Dedicated Endpoints on together.ai is about matching deployment mode to workload reality:
- Serverless Inference (Real-time): Variable traffic, unknown adoption, rapid iteration, and low operational overhead.
- Batch Inference: Massive offline workloads—up to 30B tokens per job—at up to 50% less cost, optimized for throughput over interactive latency.
- Dedicated Endpoints (Model / Container): Predictable or high-volume traffic with strict latency and throughput SLOs, tenant-level isolation, and better unit economics at scale.
The winning architecture usually combines all three: start everything on Serverless, move bulk workloads to Batch, and promote hot, stable paths to Dedicated Endpoints as they prove out.