together.ai: how do I choose between Serverless Inference, Batch Inference, and Dedicated Endpoints for my workload?
Foundation Model Platforms

together.ai: how do I choose between Serverless Inference, Batch Inference, and Dedicated Endpoints for my workload?

9 min read

Choosing between Serverless Inference, Batch Inference, and Dedicated Endpoints is ultimately a unit-economics and SLO decision: you’re trading off latency, throughput, and control based on how your traffic actually behaves. The good news is that on together.ai’s AI Native Cloud, you can mix all three behind the same OpenAI-compatible API, so you don’t have to pick just one.

Quick Answer: Use Serverless Inference for variable or early workloads, Batch Inference for massive offline jobs up to 30B tokens at up to 50% less cost, and Dedicated Endpoints (Dedicated Model Inference or Dedicated Container Inference) for predictable, latency-sensitive production traffic where you care about every millisecond and every dollar per 1M tokens.


The Quick Overview

  • What It Is: A decision framework for mapping your workload to together.ai’s deployment modes: Serverless Inference (real-time), Batch Inference, and Dedicated Endpoints (Dedicated Model Inference / Dedicated Container Inference / GPU Clusters).
  • Who It Is For: Engineering and ML teams building AI products that need to go from prototype to high-throughput production while controlling latency, reliability, and cost.
  • Core Problem Solved: You avoid overpaying for idle GPUs or under-provisioning for spikes by matching each workload to the right inference mode—without rewriting your application.

How It Works

At together.ai, all of these options live on the same AI Native Cloud and share a common OpenAI-compatible interface. The difference is how compute is allocated and managed under the hood:

  1. Serverless Inference (Real-time):

    • Fully managed, auto-scaling API.
    • Best for variable or unpredictable traffic, rapid prototyping, and early-stage production.
    • You pay per token, with no reservations or infrastructure to manage.
  2. Batch Inference:

    • Asynchronous processing of massive jobs (up to 30 billion tokens per job) at up to 50% less cost.
    • Ideal for large datasets, offline summarization, and synthetic data generation.
    • You trade interactivity for throughput and cost efficiency.
  3. Dedicated Endpoints (Dedicated Model / Container / GPU Clusters):

    • Reserved, isolated compute with the Together inference engine and kernel stack.
    • Best for predictable, steady traffic and latency-sensitive applications.
    • You get consistent P95 latency, higher tokens/sec, and predictable costs.

The mechanism behind the performance is the same research-to-production stack: Together Kernel Collection (from the FlashAttention team), ATLAS for speculative decoding, and CPD for long-context serving. You choose the mode; the platform applies these optimizations to meet your SLOs.


A Practical Decision Flow

If you want a fast rule of thumb:

  • Is your workload interactive (users waiting on responses)?

    • Yes + traffic variable/unpredictable → Start with Serverless Inference (Real-time).
    • Yes + traffic steady and large → Move to Dedicated Model Inference or Dedicated Container Inference.
  • Is your workload offline / asynchronous?

    • Needs to chew through big datasets, not user-facing → Use Batch Inference.
  • Do you need custom runtimes, multi-model orchestration, or deep infra control?

    • Yes → Consider Dedicated Container Inference or GPU Clusters.

You can—and usually should—combine them: e.g., real-time agents on Serverless, nightly summarization on Batch, and high-volume chat workloads on Dedicated.


Mode-by-Mode Breakdown

Serverless Inference (Real-time)

Best when you don’t want to think about GPUs at all yet.

Use this when:

  • Traffic is spiky or unpredictable (campaigns, early product-market fit).
  • You’re actively iterating on prompts, models, or product flows.
  • You want no commitments, no capacity planning, and only pay for what you use.

How it behaves:

  • Together’s infrastructure automatically adds capacity behind your endpoint as traffic increases.
  • You get production-grade latency with no reservations. For most apps, this is enough until you cross a fairly high, steady QPS.

Batch Inference

Batch is a different beast: it’s optimized for throughput and cost, not interactivity.

Use this when:

  • You need to process large datasets: logs, documents, customer interactions.
  • Workloads are offline:
    • Classifying or tagging millions of records
    • Offline summarization (e.g., nightly report generation)
    • Synthetic data generation or data augmentation

Key properties:

  • Handles jobs up to 30 billion tokens.
  • Costs up to 50% less than equivalent real-time processing.
  • You submit a job, the system parallelizes it, and you read results once complete.

Dedicated Endpoints

This covers:

  • Dedicated Model Inference: Together’s inference engine on reserved GPUs, per model.
  • Dedicated Container Inference: Your own container/runtime deployed as a managed endpoint.
  • GPU Clusters: Full-cluster control via Kubernetes or Slurm.

Use these when:

  • Traffic is predictable or steadily high.
  • You need tight latency SLOs or sub-second, even under heavy load.
  • You want stronger isolation, consistent performance, and predictable cost per 1M tokens.

By reserving capacity, you leverage Together’s low-level optimizations (TKC, ATLAS, CPD, custom CUDA) with direct control over how the capacity is used.


Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
Serverless Inference (Real-time)Auto-scales a fully managed, OpenAI-compatible API for text, code, and multimodal workloadsNo infrastructure to manage / Ideal for variable traffic / Fast path from prototype to production
Batch InferenceProcesses up to 30B tokens asynchronously at up to 50% lower costHigh-throughput processing / Lower unit cost / Perfect for offline & dataset-scale workloads
Dedicated Model & Container InferenceReserves isolated GPU resources with Together’s inference engine or your own containerConsistent low latency / High tokens/sec / Predictable economics for steady traffic

Ideal Use Cases

  • Best for New Products and PoCs (Serverless Inference):
    Because it lets you integrate via an OpenAI-compatible API, test model variants, and absorb traffic spikes without thinking about GPU reservations or capacity planning.

  • Best for Data Pipelines and Offline Jobs (Batch Inference):
    Because it can process up to 30 billion tokens per job at up to 50% less cost, making it ideal for classification, summarization, and synthetic data generation pipelines.

  • Best for High-Volume, Latency-Critical Apps (Dedicated Endpoints):
    Because reserved, isolated compute plus together.ai’s kernel stack (TKC, ATLAS, CPD) delivers consistent latency and better tokens/sec for workloads like voice agents, complex chat, and high-QPS API products.


Limitations & Considerations

  • Serverless Inference Limits:

    • At very high, steady QPS, per-token costs can be higher than Dedicated.
    • You don’t control underlying GPU topology or scheduling.
      Workaround: Start on Serverless, then migrate your hot paths or largest tenants to Dedicated Endpoints once usage patterns stabilize.
  • Batch Inference Tradeoffs:

    • Not suitable for interactive workloads—users can’t wait for batch jobs.
    • You need to design your pipeline for asynchronous processing and result retrieval.
      Workaround: Use Batch for the heavy lifting (e.g., embeddings or summarization) and keep only the final user interactions on Serverless or Dedicated.
  • Dedicated Endpoint Planning:

    • Requires some capacity planning to avoid over/under-provisioning.
    • You take more control; in return, you get better per-unit economics and consistent latency.
      Workaround: Start with conservative reservations and monitor utilization; adjust using Together’s observability and scaling options.

Pricing & Plans

Pricing is structured to align with how “elastic” or “reserved” your workload is:

  • Serverless Inference:

    • Pay-per-token with no commitments.
    • Best when you’re still discovering your usage pattern or expect highly bursty traffic.
    • Ideal for teams who want “no infrastructure to manage, no long-term commitments.”
  • Batch Inference:

    • Optimized pricing for large, asynchronous jobs—up to 50% less cost for long-running, high-volume jobs compared to equivalent real-time workloads.
    • Best when throughput and cost matter more than latency.
  • Dedicated Endpoints (Model / Container / GPU Clusters):

    • Reserved capacity with predictable spend and best price-performance at scale.
    • Best for teams with predictable or steady traffic and defined latency SLOs (e.g., 99.9% uptime, sub-second responses).
    • You can go from shared to dedicated endpoints “in minutes,” and GPU clusters can scale from 8 GPUs to 4,000+ as your workload grows.

Within that structure, you can think in “plan style”:

  • Elastic Mode (Serverless + Batch): Best for teams needing maximum flexibility, experimentation, and cost control on variable or offline workloads.
  • Reserved Mode (Dedicated Endpoints / GPU Clusters): Best for teams needing guaranteed performance and lowest unit cost on steady, production workloads.

For precise pricing details, model-specific rates, and cluster options, contact Together’s team.


Frequently Asked Questions

How do I know when to move from Serverless Inference to a Dedicated Endpoint?

Short Answer: When your traffic is steady, latency-critical, and your monthly spend is dominated by a few hot workloads, it’s time to consider Dedicated.

Details:
Start on Serverless Inference; it’s the lowest-friction way to validate your product. As you grow, monitor:

  • Steady QPS: If you’re consistently busy (e.g., daytime hours) rather than spiky, reserved capacity starts to pay off.
  • Latency SLOs: If you’re chasing sub-second P95 or strict SLOs for voice, trading, or agentic workflows, Dedicated’s isolated compute helps.
  • Unit economics: When a small number of endpoints, tenants, or workflows dominate your token volume, moving those to Dedicated Endpoints can reduce your effective cost per 1M tokens while providing more predictable performance.

You don’t have to change your client code heavily; the OpenAI-compatible API and consistent interface make migration straightforward.


Can I mix Serverless, Batch, and Dedicated in one application?

Short Answer: Yes. Most mature deployments use all three.

Details:
A common pattern on together.ai:

  • Serverless Inference: For interactive UX, experimentation, and long-tail traffic.
  • Dedicated Model or Container Inference: For the core, high-volume endpoints that must hit tight latency and reliability SLOs.
  • Batch Inference: For background jobs, dataset-level processing, nightly summarization, and synthetic data generation.

Because all of this lives on the same AI Native Cloud with an OpenAI-compatible API, you keep consistent authentication, logging, and monitoring while optimizing each path separately. Your data and models remain fully under your ownership, with tenant-level isolation and encryption in transit and at rest (SOC 2 Type II).


Summary

Choosing between Serverless Inference, Batch Inference, and Dedicated Endpoints on together.ai is about aligning your workload shape with the right infrastructure:

  • Serverless Inference for variable or unpredictable traffic and fast iteration.
  • Batch Inference for massive offline workloads, up to 30B tokens at up to 50% less cost.
  • Dedicated Endpoints (Model / Container / GPU Clusters) for predictable, latency-sensitive, high-throughput production.

Under the hood, you’re always riding on the same research-driven kernel stack—FlashAttention-derived Together Kernel Collection, ATLAS, and CPD—so the choice is not about raw capability, but about how you want to trade off elasticity, control, and unit economics.


Next Step

Get Started