
SambaNova Cloud pricing: where can I see per-model input/output $ per million tokens?
Most teams evaluating SambaNova Cloud want a clear, per-model view of what they’ll actually pay in input and output dollars per million tokens before they move workloads. This guide walks through where to find those numbers, how to interpret them, and what to consider when comparing costs to your current provider.
Quick Answer: Per‑model input/output pricing on SambaNova Cloud is published in the SambaNova Cloud console and pricing docs. Sign in to see $ per 1M input tokens and $ per 1M output tokens for each supported model, along with region and plan details.
The Quick Overview
- What It Is: A usage‑based pricing model for SambaNova Cloud that publishes per‑model costs in dollars per million input and output tokens.
- Who It Is For: Platform teams, infra leads, and developers who need to compare SambaNova Cloud to existing OpenAI‑compatible or GPU‑based deployments.
- Core Problem Solved: Eliminates guesswork around total serving cost by showing transparent, model‑level token pricing tied to the workloads you actually run.
How SambaNova Cloud pricing is structured
SambaNova Cloud is built for production inference—agentic workflows, multi‑model pipelines, and sovereign deployments—so pricing is aligned to tokens and infrastructure, not just “API calls.”
At a high level:
- You pay per million tokens, split into:
- Input tokens (prompt, system messages, tool calls, retrieved context).
- Output tokens (model completions and intermediate agent responses).
- Different models (DeepSeek‑R1, Llama variants, OpenAI gpt‑oss‑120b, etc.) have different price points per 1M input and per 1M output tokens.
- In higher‑commitment plans, you can combine per‑token pricing with reserved capacity (e.g., dedicated SambaRack) for predictable spend.
The per‑model input/output rates are visible once you’re in the SambaNova Cloud environment.
Where to see per‑model $ per million tokens
You can view per‑model pricing in a few places, depending on whether you’re already a customer.
1. Inside the SambaNova Cloud console
Once you have access:
- Sign in to SambaNova Cloud.
- Navigate to Billing or Usage & Pricing in the console sidebar.
- Open the Models or Pricing tab.
For each model, you’ll see something like:
- Model name (e.g.,
DeepSeek-R1,Llama 3.1 70B,gpt-oss-120b). - Input price:
$X.XX per 1M input tokens - Output price:
$Y.YY per 1M output tokens - Supported endpoints (e.g.,
/chat/completionsvia OpenAI‑compatible APIs). - Any relevant region or plan‑specific notes.
This is the most up‑to‑date source of truth, since prices can differ by region, contract, or volume tier.
2. Public pricing documentation
If you’re still evaluating:
- Go to the main site:
https://sambanova.ai. - Look for SambaNova Cloud → Pricing or Documentation.
- Open the API pricing or Usage pricing section.
You’ll typically find:
- A pricing table with per‑model token rates.
- Separate input and output pricing per 1M tokens.
- Notes on free tiers, evaluation quotas, or minimum commitments (if available).
If you don’t see the model you care about, a short form or “Contact Sales” link will get you a quote and per‑model pricing sheet.
3. Custom or enterprise deployments
For dedicated racks, sovereign inference, or bring‑your‑own‑checkpoint setups, the pricing model is usually:
- Base infrastructure (e.g., SambaRack SN40L‑16 or SambaRack SN50) +
- Token‑based or capacity‑based pricing negotiated per deployment.
Your account team will provide a pricing schedule that includes:
- Effective $ per million tokens at your committed volume.
- Any discount tiers beyond a given monthly token threshold.
- Additional line items for SambaOrchestrator features if applicable.
How per‑model token pricing works in practice
Per‑model $/1M token pricing on SambaNova Cloud is designed to be straightforward if you’re coming from another OpenAI‑compatible provider.
1. You choose a model per workload
You can build with:
- DeepSeek‑R1 for code, reasoning, and math – up to 200 tokens/second on SambaNova RDU (independently measured by Artificial Analysis).
- Llama series (e.g., Llama 3.1 8B, 70B, 405B, and Llama 4 series as they’re supported) for general agentic and open‑source‑friendly workloads.
- OpenAI gpt‑oss‑120b with over 600 tokens/second in throughput on SambaNova infrastructure.
Each of these has its own input and output pricing per 1M tokens, listed in the console and pricing docs.
2. You send traffic via OpenAI‑compatible APIs
Because SambaNova Cloud uses OpenAI‑compatible endpoints:
- Existing apps can be ported in minutes by swapping the base URL and API key.
- All traffic is metered at the token level, just like you’re used to.
Behind the scenes, SambaStack uses model bundling and a three‑tier memory architecture on RDUs to keep models and prompts hot in memory, which is how it delivers high tokens/sec at efficient tokens per watt. You don’t pay extra for “model switching” inside an agent; it’s all just token usage.
3. Your bill breaks down by model and direction
Monthly or periodic usage reports typically show:
- Per model:
- Total input tokens and cost.
- Total output tokens and cost.
- Across all models:
- Total tokens.
- Effective blended $ per 1M tokens.
This is useful if you run multi‑model agent workflows, because you can see exactly where spend is concentrated—long‑context retrieval models, reasoning models like DeepSeek‑R1, or high‑volume utility models.
Features & benefits of transparent per‑model pricing
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Per‑model input/output token pricing | Publishes $ per 1M input and $ per 1M output tokens for each supported model | Lets teams forecast costs at the workload level, not just per‑call |
| OpenAI‑compatible metering | Meters tokens using familiar semantics on /chat/completions and similar | Enables fast comparison to current providers with minimal rework |
| High‑throughput RDUs & SambaStack | Uses custom dataflow + three‑tier memory for efficient agentic inference | Lower cost per token at scale due to higher tokens per watt |
Ideal use cases for SambaNova Cloud per‑model pricing
-
Best for production agentic inference:
Because you can see exactly what complex, multi‑step workflows cost when they hit multiple models—reasoning (DeepSeek‑R1), planning (Llama), and tools—without guessing at blended rates. -
Best for teams migrating from OpenAI‑style APIs:
Because OpenAI‑compatible endpoints and clear per‑model $/1M token tables make it straightforward to compare costs and port applications in minutes.
Limitations & considerations
-
Pricing visibility may require an account:
Some detailed per‑model pricing tables are only visible once you have SambaNova Cloud access or a sales contact. If you don’t see public tables, request access or a quote to get the exact $/1M numbers. -
Rates can vary by region and contract:
Enterprise and sovereign deployments can have different per‑token prices than public, shared‑tenant cloud usage. Always rely on your console or contract as the source of truth for planning.
Pricing & plans: how per‑token costs fit in
SambaNova offers flexible consumption models that range from pay‑as‑you‑go inference to dedicated racks for fully private deployments. Specific plan names and price points will be visible in your region, but conceptually you’ll see something like:
-
On‑Demand / Shared Cloud:
Best for teams needing fast evaluation or variable workloads. You pay per million tokens per model, using the published input/output pricing. -
Committed / Dedicated Capacity (SambaRack + SambaOrchestrator):
Best for enterprises with steady or high‑volume production traffic, especially agentic workflows with large models. You get negotiated rates per 1M tokens and/or capacity‑based pricing, plus operational control—Auto Scaling | Load Balancing | Monitoring | Model Management.
In both cases, the core unit is still tokens, which means migrating from a GPU‑based or other LLM provider is mostly about matching throughput and $/1M token numbers.
Frequently Asked Questions
Where exactly can I see the $ per 1M input and output tokens for each model?
Short Answer: In the SambaNova Cloud console under billing/pricing, and in the official SambaNova Cloud pricing documentation or quote you receive from the sales team.
Details:
Once you sign in to SambaNova Cloud, navigate to the billing or pricing section to see a per‑model breakdown. Each supported model lists the input price per 1M tokens and output price per 1M tokens, plus any plan or region qualifiers. If you’re in the process of evaluating, your SambaNova contact can share a current pricing sheet with the same per‑model breakdown.
How do I compare SambaNova Cloud pricing to my current LLM provider?
Short Answer: Use your historical token usage per model, apply SambaNova’s per‑model $/1M input and output token rates, and factor in the higher tokens/sec you can get from RDUs.
Details:
Export recent usage from your current provider (input and output tokens per model, per month). Then:
- Map models (e.g., compare your current reasoning model to DeepSeek‑R1, your general model to Llama or gpt‑oss‑120b).
- Apply SambaNova’s published $/1M input and $/1M output for those models.
- Adjust assumptions for throughput—SambaNova RDUs can deliver over 600 tokens/sec on gpt‑oss‑120b and up to 200 tokens/sec on DeepSeek‑R1 (independently measured), which can reduce concurrency overhead and infrastructure cost elsewhere.
- Include savings from consolidating agentic workflows onto fewer nodes via model bundling, which reduces operational overhead compared to “one-model-per-node” setups.
Your SambaNova team can also walk through this comparison using your real logs if you prefer.
Summary
Per‑model, per‑direction token pricing on SambaNova Cloud is designed for teams who care about real production economics, not just headline demo costs. You can see $ per 1M input and $ per 1M output tokens for each major model—DeepSeek‑R1, Llama variants, gpt‑oss‑120b, and more—directly in the console or pricing docs, and plug those numbers into your existing dashboards.
Because SambaNova exposes OpenAI‑compatible APIs and delivers high tokens/sec using RDUs, SambaStack, and tiered memory, you get a clean comparison to your current provider and a clear path to lower cost per token for agentic workloads.