together.ai pricing: how do input vs output token rates work, and how do I estimate monthly cost per model?

Understanding together.ai pricing starts with one idea: you pay per token, and input tokens and output tokens are metered separately. Once you’re clear on that, estimating monthly cost per model becomes a straightforward spreadsheet exercise.

Quick Answer: together.ai charges per 1M input tokens and per 1M output tokens for each model. To estimate monthly cost, you multiply these rates by your expected input and output tokens per request, then scale by request volume and mix of deployment modes (Serverless, Batch, Dedicated).

The Quick Overview

What It Is: A token-based pricing model where each model has a separate rate for input tokens (your prompts/context) and output tokens (the model’s responses).
Who It Is For: Engineering, data, and product teams running open‑source and partner models on the AI Native Cloud and needing predictable, controllable inference costs.
Core Problem Solved: It gives you simple, OpenAI‑compatible economics while still letting you optimize for latency, throughput, and total cost across Serverless Inference, Batch Inference, and Dedicated Inference.

How token‑based pricing works on together.ai

Every model on together.ai publishes a pricing block with:

Input price (per 1M tokens) – what you pay for tokens in the prompt, system messages, retrieved context, function definitions, etc.
Output price (per 1M tokens) – what you pay for tokens generated by the model.
Deployment modes – which pricing tables apply in:
- Serverless Inference (real‑time, on‑demand)
- Batch Inference (asynchronous, up to 50% less cost)
- Dedicated Model Inference / Dedicated Container Inference (reserved capacity)
Context length and modalities – maximum tokens per request and whether the model supports text, image, etc.

Example snippets from the catalog (illustrative):

A vision‑capable chat model:
- Input price: $0.18 / 1M tokens
- Output price: $0.59 / 1M tokens
- Modalities: Text + Image input, Text output, FP16
A reasoning model:
- Input price: $0.90 / 1M tokens
- Output price: $3.30 / 1M tokens
- Context length: 128K
- Endpoint: arcee-ai/maestro-reasoning

This structure is consistent across the platform: you always see a per‑1M‑token price for input and output, then you just scale linearly by your usage.

Input tokens vs output tokens: what counts where?

Think of a single API call as two separate metering streams.

Input tokens

You are charged input tokens for:

The prompt text (user messages, system messages, instructions)
Retrieved context (RAG passages, tool descriptions, database records you append)
Function/tool definitions and schemas you send with the request
Image embeddings or image prompt encodings (when applicable) as they are converted to tokens
Any hidden prompt prefix you programmatically prepend on your side

In practice, everything you send in messages, input, or similar fields that is tokenized for the model counts as input.

Output tokens

You are charged output tokens for:

The generated completion or chat response
Tool call arguments or JSON that the model produces (if returned as part of the completion)
Streaming responses – all tokens that flow over the stream are counted

If the model stops early due to a stop sequence or max_tokens limit, you only pay for what was actually generated, not the cap.

Core pricing levers that affect your bill

There are four main levers you control:

Input length per request
- Long system prompts, large RAG context, and tool definitions all grow your input token count.
- For long‑context models (e.g., 128K context), CPD (prefill–decode disaggregation) on together.ai helps you serve large prompts efficiently, but your bill still scales with tokens.
Output length per request
- Higher max_tokens and more verbose answers increase output tokens.
- For many workloads (e.g., reasoning or code generation), output tokens can dominate cost.
Request volume (RPS / monthly calls)
- Total tokens = tokens/request × requests/month.
- This is where batching and model shaping can dramatically change economics.
Deployment mode
- Serverless Inference: pay‑per‑token with no commitments; best for variable traffic.
- Batch Inference: up to 50% less cost for many models; best for offline, high‑volume workloads (up to 30B enqueued tokens per model per user).
- Dedicated Model/Container Inference: you pay for reserved compute (capacity‑based), but still reason about tokens to evaluate effective cost per 1M tokens.

Step‑by‑step: estimating monthly cost per model

Below is a practical framework I use when modeling costs during architecture reviews.

Step 1: Pull the model’s token prices

From the together.ai model catalog, note:

input_price_per_1M
output_price_per_1M

Example:

Model: arcee-ai/maestro-reasoning
Input price:  $0.90 / 1M tokens
Output price: $3.30 / 1M tokens
Context:      128K
Deployment:   Serverless, On-Demand Dedicated, Monthly Reserved

Step 2: Estimate tokens per request

You need two numbers:

avg_input_tokens – average tokens in prompt + context
avg_output_tokens – average tokens in the completion

If you’re migrating from another provider with an OpenAI‑compatible API, you can:

Use your existing token logs as a baseline.
Or run representative prompts in Together Sandbox and inspect token usage.

Example for a chat app:

Average chat turn:
- System prompt: 300 tokens
- Conversation history: 500 tokens
- RAG context: 700 tokens
- Latest user query: 100 tokens
  → avg_input_tokens ≈ 1,600
Model output (assistant message):
→ avg_output_tokens ≈ 350

Step 3: Estimate monthly request volume

Define:

requests_per_user_per_day
active_users_per_day
days_per_month (commonly 30)

Then:

monthly_requests = requests_per_user_per_day
                 * active_users_per_day
                 * days_per_month

Example:

20 requests/user/day
5,000 active users/day
30 days/month

monthly_requests = 20 * 5,000 * 30 = 3,000,000

Step 4: Calculate total monthly tokens

Compute:

total_input_tokens  = avg_input_tokens  * monthly_requests
total_output_tokens = avg_output_tokens * monthly_requests

Using the example:

total_input_tokens  = 1,600 * 3,000,000 = 4,800,000,000
total_output_tokens =   350 * 3,000,000 = 1,050,000,000

Convert to millions:

input_tokens_in_M  = 4,800,000,000 / 1,000,000 = 4,800
output_tokens_in_M = 1,050,000,000 / 1,000,000 = 1,050

Step 5: Apply token pricing

Use:

input_cost  = input_tokens_in_M  * input_price_per_1M
output_cost = output_tokens_in_M * output_price_per_1M
total_cost  = input_cost + output_cost

Example with arcee-ai/maestro-reasoning:

input_cost  = 4,800 * $0.90  = $4,320
output_cost = 1,050 * $3.30  = $3,465
total_cost  = $4,320 + $3,465 = $7,785 / month

From here you can compute:

effective_cost_per_1M_tokens =
  total_cost / (input_tokens_in_M + output_tokens_in_M)

effective_cost_per_1M_tokens =
  7,785 / (4,800 + 1,050) ≈ $1.29 / 1M tokens (blended)

Where Batch Inference changes the math

Batch Inference is designed for high‑volume, non‑interactive workloads:

Up to 50% less cost vs real‑time for many serverless models.
Scale to 30 billion enqueued tokens per model per user.
<24h processing time SLA, with most jobs completing in hours.

You still think in tokens, but the per‑1M‑token rate is lower for batch jobs. The pricing page or catalog will show Batch Inference rates where available.

Example: same workload, split between real‑time and batch

Assume:

70% of tokens are interactive chat (Serverless Inference).
30% of tokens are nightly summarization runs (Batch Inference at 50% discount).

If:

Real‑time effective rate = $1.29 / 1M tokens
Batch effective rate ≈ $0.65 / 1M tokens

Then:

total_tokens_in_M = 4,800 + 1,050 = 5,850

realtime_tokens_in_M = 0.70 * 5,850 ≈ 4,095
batch_tokens_in_M    = 0.30 * 5,850 ≈ 1,755

realtime_cost = 4,095 * 1.29 ≈ $5,283
batch_cost    = 1,755 * 0.65 ≈ $1,140

blended_total = ≈ $6,423 / month

You’ve shaved ~17% off the original $7,785 simply by routing batch‑friendly traffic to Batch Inference.

This is the pattern I recommend:

User‑facing, low‑latency paths → Serverless Inference
Offline jobs (classification, summarization, synthetic data) → Batch Inference

Dedicated Inference: from token‑based to capacity‑based

Dedicated Model Inference and Dedicated Container Inference use reserved, isolated compute:

Best for predictable or steady traffic and latency‑sensitive applications.
You effectively rent a slice of the AI Native Cloud and get tenant‑level isolation and consistent latency.
Pricing is capacity‑based (GPU instances, reservations) rather than strictly per‑token.

However, you can still reason about costs using tokens:

Estimate utilization:
- Measure tokens/sec per GPU using your representative workload.
- Together’s stack (Together Kernel Collection, FlashAttention‑4, ATLAS speculative decoding, CPD for long‑context) optimizes this number.

Compute monthly token throughput:

tokens_per_second_per_GPU * GPUs * seconds_per_month

Implied effective cost per 1M tokens:

monthly_GPU_cost / (monthly_tokens / 1,000,000)

If your utilization is high and predictable, Dedicated Inference typically yields a lower effective cost/1M tokens than pure serverless, while giving you stricter latency SLOs. Many teams use:

Dedicated endpoints for steady “core” traffic, and
Serverless for spikes and long‑tail workloads.

Example scenarios: estimating monthly cost per model

Below are three concrete scenarios to help you map together.ai pricing to real workloads.

Scenario 1: Early‑stage chat app on Serverless Inference

Model: medium‑sized chat model, text‑only
Input price: $0.18 / 1M tokens
Output price: $0.59 / 1M tokens
avg_input_tokens = 900
avg_output_tokens = 300
monthly_requests = 500,000

Tokens:

total_input_tokens  = 900 * 500,000 = 450,000,000  (450M)
total_output_tokens = 300 * 500,000 = 150,000,000  (150M)

Costs:

input_cost  = 450 * $0.18 = $81
output_cost = 150 * $0.59 = $88.50
total_cost  = $169.50 / month

This is typically what I show teams as a “base case”: start serverless, watch your token mix in logs, then revisit once you cross a few hundred dollars per month per model.

Scenario 2: Analytics pipeline using Batch Inference

Workload: nightly summarization of logs and tickets.
Model: same as above.
Batch rate: assume 50% off real‑time for illustration.

Tokens:

10M documents/month
300 input tokens/doc
150 output tokens/doc

total_input_tokens  = 10,000,000 * 300 = 3,000,000,000 (3,000M)
total_output_tokens = 10,000,000 * 150 = 1,500,000,000 (1,500M)

Real‑time cost (if you did this with serverless):

input_cost_rt  = 3,000 * $0.18 = $540
output_cost_rt = 1,500 * $0.59 = $885
total_rt       = $1,425 / month

Batch cost (50% discount):

input_cost_batch  = 3,000 * $0.09  = $270
output_cost_batch = 1,500 * $0.295 = $442.50
total_batch       = $712.50 / month

You save ~50% by sending this workload through Batch Inference, with jobs finishing well within the <24h processing SLA.

Scenario 3: Mature product mixing Dedicated + Serverless + Batch

A high‑traffic product often ends up with:

Dedicated Model Inference for core traffic on one or two models.
Serverless Inference for long‑tail features and sporadic endpoints.
Batch Inference for retraining, evaluation, and offline scoring.

The process:

Use the token math above to estimate:
- Baseline cost if you did everything via Serverless.
Project steady vs spiky workloads:
- Steady → move into Dedicated and negotiate capacity.
- Spiky + offline → Batch.
Recalculate effective cost per 1M tokens by deployment mode:
- This “blended rate” is what finance and product care about.

This is where together.ai’s research‑to‑production stack (FlashAttention‑4, Together Kernel Collection, ATLAS, CPD) becomes more than marketing: higher tokens/sec/GPU and efficient long‑context prefill directly improve your effective cost per 1M tokens on dedicated capacity.

Practical tips to keep cost per model under control

Log tokens per request from day 1
- Capture prompt_tokens, completion_tokens, and total_tokens (OpenAI‑compatible fields).
- Monitor distribution, not just averages; long‑tail requests can dominate costs.
Trim prompts and context
- De‑duplicate conversation history.
- Use retrieval to include only relevant documents instead of full dumps.
- Move static instructions into shorter system prompts.
Constrain output where possible
- Set max_tokens based on UX needs; avoid open‑ended generation for structured tasks.
- Use function calling / JSON mode to keep responses minimal and machine‑readable.
Route workloads to the right deployment mode
- Real‑time UX → Serverless Inference.
- Backfills, evaluations, synthetic data → Batch (up to 50% less cost, up to 30B tokens per model per user).
- High, steady traffic → Dedicated Model/Container Inference.
Fine‑tune instead of oversizing models
- Use Model Shaping to adapt open‑source models to your domain.
- A smaller, well‑tuned model often yields similar accuracy at lower token and compute cost.
Consider quantization and model choice
- Where quality allows, select variants with more aggressive quantization or smaller parameter counts.
- On dedicated endpoints or GPU Clusters, this often reduces effective cost/1M tokens by boosting tokens/sec.

Security, ownership, and pricing

As you scale spend on a single model, governance matters as much as raw pricing:

SOC 2 Type II – audited controls for production workloads.
Tenant‑level isolation – especially on Dedicated Inference and GPU Clusters.
Encryption in transit and at rest across the platform.
Data ownership: your data and models remain fully under your ownership.

This is relevant when you forecast multi‑model, multi‑team usage: finance wants predictable costs; security wants clear boundaries; engineering wants consistent SLOs.

Summary

On together.ai, pricing is driven by input vs output token usage per model, then shaped by your deployment choices:

Input vs output tokens: prompts, context, and tools count as input; generated text and tool outputs count as output.
Per‑model pricing: each model publishes $ / 1M rates for input and output tokens across serverless and, where applicable, batch.
Cost estimation:
1. Estimate avg input/output tokens per request.
2. Multiply by monthly request volume.
3. Convert to millions of tokens.
4. Apply the per‑1M rates and sum.
Optimization levers: choose between Serverless, Batch, and Dedicated; trim prompts; tune models; and leverage Together’s research‑driven kernels (FlashAttention‑4, ATLAS, CPD, Together Kernel Collection) to maximize tokens/sec and minimize effective cost per 1M tokens.

If you’re designing around a specific model or traffic pattern and want help building an explicit cost model, the fastest path is to talk directly with the team.

Next Step

Get Started