together.ai pricing: how do input vs output token rates work, and how do I estimate monthly cost per model?
Foundation Model Platforms

together.ai pricing: how do input vs output token rates work, and how do I estimate monthly cost per model?

12 min read

Most teams discover that the hardest part of “how much will this cost?” is not the GPU math; it’s understanding how input and output token pricing actually plays out per model, per workload, per month. On together.ai, that’s deliberate: you get transparent per‑million token rates, the same OpenAI‑compatible request patterns you already use, and clear levers to control spend.

Quick Answer: together.ai charges separately for input tokens (the prompt you send) and output tokens (what the model generates). Your monthly cost per model is essentially:

Monthly Cost ≈ (Input Tokens × Input Rate) + (Output Tokens × Output Rate)

…multiplied across all requests and deployment modes (Serverless, Batch, Dedicated Inference, etc.).


The Quick Overview

  • What It Is: A token‑based pricing model where each model exposes a per‑million rate for input tokens and a separate per‑million rate for output tokens. Different deployment modes (Serverless Inference, Batch Inference, Dedicated Model Inference, Dedicated Container Inference, GPU Clusters) give you control over latency vs. cost.
  • Who It Is For: Product teams, infra engineers, and data scientists who need to budget and compare models—across open‑source and partner models—without guessing at GPU hours.
  • Core Problem Solved: You can’t optimize unit economics if pricing is opaque. together.ai’s token‑centric, model‑level pricing lets you project cost per feature, per tenant, and per month using simple, auditable formulas.

How It Works

At together.ai, every text or multimodal call ultimately resolves to tokens. Two buckets matter:

  • Input tokens: Everything you send in the request (system prompt, user messages, tool definitions, images converted to tokens, etc.).
  • Output tokens: Everything the model generates back (assistant messages, tool calls, JSON responses).

Each model has:

  • An input price per 1M tokens (e.g., $0.90 / 1M input tokens).
  • An output price per 1M tokens (e.g., $3.30 / 1M output tokens).

If a request uses T_in input tokens and T_out output tokens, and the model’s rates are P_in and P_out, the cost of that single request is:

Request Cost = (T_in / 1,000,000) × P_in  +  (T_out / 1,000,000) × P_out

From there, monthly cost is just:

Monthly Cost = Σ (Request Cost) over all requests in the month

The only wrinkle is deployment mode:

  • Serverless Inference: Pay per token, elastic capacity. Best for variable traffic and prototyping.
  • Batch Inference: Async jobs at up to 50% less cost per token, at massive scale (up to 30B enqueued tokens per model per user, often finishing in hours, with a <24h SLA).
  • Dedicated Model Inference / Dedicated Container Inference: Reserved, isolated compute with predictable latency and a capacity‑based pricing model; effective unit cost depends on utilization.
  • GPU Clusters: You provision raw GPUs (Kubernetes/Slurm style) and run your own stack; cost is per GPU‑hour, not per token.

In practice, most teams:

  1. Start on Serverless with per‑million token pricing.
  2. Move heavy offline workloads (classification, synthetic data, log summarization) to Batch for up to 50% savings.
  3. Shift steady, latency‑sensitive workloads to Dedicated Inference once traffic stabilizes.

1. Understanding Input Tokens

Input tokens are all the context you send to the model:

  • System + developer instructions
  • Chat history / few‑shot examples
  • Tool / function schemas
  • Documents or images converted to tokens for multimodal models

If a model has:

  • Context length: 128K tokens
  • Input price: $0.90 / 1M tokens (example from a reasoning model like arcee-ai/maestro-reasoning)

…and you send a 4K‑token context, the input side of a single call costs:

Input Cost = (4,000 / 1,000,000) × $0.90 ≈ $0.0036

2. Understanding Output Tokens

Output tokens are everything the model emits:

  • Natural language responses
  • Code completions
  • JSON output (including closing braces, commas, etc.)
  • Tool / function call payloads

If the same model has:

  • Output price: $3.30 / 1M tokens

…and you generate 1K tokens in one response:

Output Cost = (1,000 / 1,000,000) × $3.30 ≈ $0.0033

Total request cost:

Total = $0.0036 (input) + $0.0033 (output) = $0.0069

How To Estimate Monthly Cost Per Model

You don’t need perfect precision to make good decisions. You need a defensible estimate. Here’s the pragmatic approach I recommend when advising teams migrating to together.ai.

Step 1: Pick the Model and Note Its Rates

For each model you care about, record:

  • P_in = Input price per 1M tokens
  • P_out = Output price per 1M tokens

Example (illustrative numbers, not a full catalog):

ModelContext LengthInput Price (/1M)Output Price (/1M)Deployment Modes*
arcee-ai/maestro-reasoning128K$0.90$3.30Serverless, On‑Demand Dedicated, Reserved
Typical chat model (FP16, multimodal)128K$0.18$0.59Serverless, Batch, Dedicated

*Exact modes vary by model; check the model page in the dashboard.

Step 2: Estimate Average Tokens Per Request

You can instrument this in production, but for planning:

  1. Estimate input tokens per call:

    • System prompt: 200–1,000 tokens
    • User message: 100–800 tokens
    • History / examples: 500–8,000 tokens depending on how “chatty” your app is

    Suppose you target 3K input tokens per request for a long‑context reasoning agent.

  2. Estimate output tokens per call:

    • Short chat reply: 50–150
    • Analytical answer: 300–800
    • Multi‑step reasoning / code: 800–2,000+

    Suppose you target 1K output tokens per request.

We’ll use:
T_in = 3,000 tokens, T_out = 1,000 tokens.

Step 3: Calculate Cost Per Request

Use the formula:

Request Cost = (T_in / 1,000,000) × P_in  +  (T_out / 1,000,000) × P_out

For arcee-ai/maestro-reasoning:

  • P_in = $0.90
  • P_out = $3.30
  • T_in = 3,000
  • T_out = 1,000
Input Cost  = (3,000 / 1,000,000) × 0.90  ≈ $0.0027
Output Cost = (1,000 / 1,000,000) × 3.30 ≈ $0.0033
Total / Request ≈ $0.0060

So each call is about 0.6 cents.

Step 4: Multiply by Monthly Volume

If you expect 2M requests per month:

Monthly Cost = 2,000,000 × $0.0060 ≈ $12,000

Now you can break this down by:

  • Tenant / account
  • Feature (e.g., “AI Assistant”, “Code Copilot”)
  • Environment (staging vs production)

Step 5: Incorporate Deployment Mode

Up to here, we’ve assumed Serverless Inference. That’s the right default for:

  • Early or variable traffic
  • Feature flags and experiments
  • Bursty workloads (seasonal, time‑of‑day spikes)

But for heavy, predictable workloads, you can improve unit economics:

5.1 Batch Inference Cost Estimates

Batch jobs process tokens asynchronously at up to 50% less cost than real‑time serverless for most models. The math is identical; the rates are lower.

If the same model on Batch is effectively ~50% cheaper, and your workload is offline (e.g., summarizing ticket histories), you might see:

  • Serverless cost: $12,000 / month
  • Batch cost (same total tokens): ~$6,000 / month

Batch parameters from together.ai:

  • Up to 30B enqueued tokens per model per user
  • Jobs “often within hours” with a <24h processing time SLA

That’s ideal when latency isn’t user‑visible.

5.2 Dedicated Model / Container Inference

For workloads that are:

  • Predictable or steady traffic
  • Latency‑sensitive applications

Dedicated endpoints are backed by reserved, isolated compute and the together.ai inference engine (FlashAttention kernels, ATLAS speculative decoding, CPD long‑context serving).

You typically agree on:

  • A certain capacity (e.g., a model on N GPUs)
  • An SLO (latency, availability)
  • A pricing structure (monthly minimum, possibly plus overage)

Your effective cost per 1M tokens becomes a function of utilization:

Effective Unit Cost ≈ (Monthly Dedicated Spend) / (Total Tokens Served)

As a rule of thumb:

  • If you’re pegging the model at high utilization, Dedicated often beats serverless on $/1M tokens.
  • If traffic is spiky or low‑volume, Serverless is usually cheaper overall—even if its raw per‑token rate is higher—because you’re not paying for idle GPUs.

5.3 GPU Clusters

Here you provision GPUs and run your own stack. Cost is:

Monthly Cost ≈ Σ (GPU-Hours × Rate_per_GPU-Hour) + Storage / Networking

To compare with serverless:

Effective Unit Cost ≈ Total Spend / Total Tokens Generated

GPU Clusters become compelling when:

  • You have very large, consistent workloads across many models.
  • You want full control of kernels, quantization, and serving stack.

You can still use the same token accounting, but you’re in charge of turning GPU hours into tokens/sec via engineering choices.


Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
Per‑million token pricingExposes explicit $ / 1M input tokens and $ / 1M output tokens per model and modePredictable, auditable cost estimates per app and feature
Mode‑specific economics (Serverless/Batch/Dedicated)Lets you choose between per‑token, discounted async, or reserved capacity pricingMatch latency & traffic patterns to best price‑performance
OpenAI‑compatible APIReuses your existing request/response patterns and token accountingNo code changes required to migrate and compare costs
Long‑context and multimodal supportSupports context lengths up to 128K and multiple modalities, priced the same way (per token)Easy to reason about cost of RAG, agents, and media flows

Ideal Use Cases

  • Best for variable or unpredictable traffic: Because Serverless Inference gives you per‑token pricing with no capacity planning. You only pay for what you use, and you still get production‑grade performance.
  • Best for massive offline jobs (classification, summarization, synthetic data): Because Batch Inference can run up to 30B tokens per model per user with up to 50% less cost, making large‑scale processing economic without GPU ops overhead.
  • Best for stable, latency‑sensitive workloads (chatbots, copilots): Because Dedicated Model Inference gives you isolated compute, SLO‑driven latency, and the potential for better effective $ / 1M tokens at high utilization.

Limitations & Considerations

  • Token estimation error: Early in a project, your guess of T_in and T_out per request may be off. Mitigation: instrument token usage in staging and adjust your model; together.ai’s metrics help you converge quickly.
  • Mode‑dependent pricing: Dedicated endpoints and GPU Clusters don’t expose a simple “published per‑token rate” because economics depend on GPU type and utilization. Mitigation: work with together.ai sales to derive effective $ / 1M tokens from your expected traffic and SLOs.

Pricing & Plans

together.ai doesn’t force a one‑plan‑fits‑all model; instead, you choose deployment modes and mix them per model and workload.

  • Serverless Inference (On‑Demand): Best for teams needing variable capacity, rapid prototyping, and cost‑sensitive early‑stage production. You pay per token using published input and output rates per model.
  • Batch Inference: Best for teams needing to classify large datasets, run offline summarization, or do synthetic data generation at scale. You still pay per token, but at up to 50% less cost vs real‑time serverless for most models.
  • Dedicated Model Inference / Dedicated Container Inference: Best for teams with predictable or steady traffic and latency‑sensitive applications. Pricing is typically capacity‑based, and you can back‑solve effective token economics from your projected utilization.
  • GPU Clusters: Best for teams who want full stack control (custom runtimes, training, advanced model shaping) and have workloads large enough to keep clusters highly utilized.

To get precise numbers for your models and volumes, you’ll typically:

  1. Start with the published Serverless rates for input/output tokens per model.
  2. Estimate volume and cost as shown above.
  3. Talk to sales to benchmark Batch and Dedicated options for your specific traffic pattern.

Frequently Asked Questions

How do I know how many tokens my prompts and responses use?

Short Answer: Use the Together Sandbox and API response metadata to inspect token counts directly, then average them across representative requests.

Details:
When you call models via the OpenAI‑compatible API, you can:

  • Inspect usage fields in responses (e.g., prompt_tokens, completion_tokens) to see actual input and output token counts.
  • Use Together Sandbox to iterate on prompts and watch token usage at the same time.
  • Export metrics to your logging/observability stack to track tokens per request by endpoint, tenant, and feature.

Once you measure typical T_in and T_out, plug them into:

Cost / Request = (T_in / 1,000,000) × P_in + (T_out / 1,000,000) × P_out

Multiply by monthly request volume to estimate spend per model.


How do I compare serverless vs batch vs dedicated costs for the same model?

Short Answer: Treat serverless as your baseline, apply “up to 50% less” for Batch on the same token volume, and compute effective $ / 1M tokens for Dedicated based on projected utilization.

Details:

  1. Serverless baseline:

    • Use published P_in and P_out rates.
    • Compute monthly cost from your measured token usage.
  2. Batch Inference:

    • Same total tokens, but priced at up to 50% less.
    • If your serverless cost for an offline workload is $10,000, expect Batch to land closer to $5,000 for the same tokens, subject to model‑specific rates.
  3. Dedicated Inference:

    • Suppose you provision capacity that costs $X / month and serves Y total tokens / month.
    • Effective unit cost: $X / (Y / 1,000,000) = $ / 1M tokens.
    • If that number is lower than your serverless effective rate (and you hit your latency SLO), Dedicated is economically superior for that workload.

This is exactly how I guide teams: implement with Serverless, measure real token usage and traffic patterns, then move high‑volume offline work to Batch and steady, latency‑sensitive flows to Dedicated once you can justify the reservation.


Summary

together.ai’s pricing model is intentionally simple at the surface—input tokens × input rate + output tokens × output rate—and flexible under the hood. By keeping everything expressed in tokens and exposing per‑million rates per model, you can:

  • Estimate cost per request and cost per feature in a few lines of math.
  • Compare serverless vs batch vs dedicated vs GPU clusters in clear unit economics.
  • Make long‑context, multimodal, and reasoning workloads predictable from a budget standpoint.

Layer on the AI Native Cloud’s research‑to‑production engine—FlashAttention‑driven kernels, ATLAS speculative decoding, CPD for long‑context—and you’re not just paying for tokens; you’re buying better latency and throughput at a given spend.


Next Step

Get Started