
Our LLM feature costs are exploding—how do teams measure and reduce cost per million tokens in production?
Most teams only realize their LLM feature costs are exploding when the monthly invoice shows up—by then, it’s too late to see which agents, models, or prompts are driving the bill. The right move is to treat “cost per million tokens” as a first-class SRE metric for your AI stack, not an afterthought in finance.
This walkthrough is written from the perspective of someone who has operated production LLM serving at rack scale. I’ll focus on how to measure, attribute, and then systematically reduce cost per million tokens in production—especially for agentic, multi-model workflows—using the kinds of levers that actually move the bill: architecture, model selection, and infrastructure efficiency.
Quick Answer: You reduce cost per million tokens by (1) making it a traceable metric at the app, model, and workflow level, (2) eliminating “one-model-per-node” waste and cross-endpoint hops, and (3) running inference on infrastructure that delivers more tokens per watt with efficient model switching—like SambaNova’s RDU-based stack.
The Quick Overview
- What It Is: A practical model for measuring and lowering cost per million tokens across production LLM workloads, from simple completions to complex agent loops.
- Who It Is For: Platform, infra, and AI engineering teams responsible for LLM feature P&L, latency SLOs, and data center efficiency—not just prototype builders.
- Core Problem Solved: You can’t control what you can’t see. Most orgs lack clear per-workflow token accounting and run on architectures that make agentic inference far more expensive than it needs to be.
How Cost per Million Tokens Really Works
At a high level, your effective cost per million tokens in production is:
Effective Cost / 1M tokens = (All infra + provider costs) / (All tokens served)
The nuance is in the numerator and denominator:
- Numerator (cost):
- Public API usage (per-token pricing, surcharges for higher-throughput tiers)
- Cloud GPU instances or on-prem systems (CapEx amortized + OpEx: power, cooling, space)
- Overhead from orchestration (idle GPU memory, one-model-per-node inefficiencies, overprovisioning for peak)
- Denominator (usage):
- Input + output tokens per request
- Number of calls per feature (especially in agent loops and tools)
- Prompt growth over time (history, retrieval context, system prompts)
You’ve got two levers:
- Use fewer tokens for the same or better outcome.
- Generate more tokens per dollar (or per watt) of infrastructure.
The first is mostly application and prompt design. The second is about architecture and hardware–software co-design, where most teams still have untapped headroom.
Step 1: Instrument Cost per Million Tokens as a First-Class Metric
Before reducing cost, you need visibility that matches how your features actually behave.
1. Capture token usage per request
At the request level, log:
- Model / version
- Input tokens
- Output tokens
- Total tokens
- Latency
- Request type / endpoint (chat, embeddings, tools, etc.)
- Feature or service name (search, support assistant, internal agent)
If you’re using an OpenAI-compatible API (including SambaCloud), you already get token counts from the response. Persist those into your logging/metrics system.
2. Attribute cost to features and workflows
Define a shared schema so you can aggregate:
- By feature: “support_chat”, “code_assistant”, “RAG_search”
- By workflow/agent: “multi-hop_research_agent”, “QA + summarization pipeline”
- By tenant / customer: For cost sharing and internal chargeback
Then compute:
- Cost per 1M tokens per feature
- Tokens per request and requests per user
- Effective $ / user / month and $ / workflow execution
This is often where the first surprises appear: an agent that seems “cheap” per call might loop 15–20 times, quietly dominating your spend.
3. Break out model-level economics
Next, slice by model:
- Model name (e.g., “gpt-oss-120b”, “Llama 3.1 70B”, “DeepSeek-R1”)
- Provider (public API vs SambaCloud vs your own racks)
- Tokens sec / model (throughput)
- Latency distribution per model
This gives you per-model cost per million tokens and shows where you can safely downgrade or switch models without hurting outcomes.
Step 2: Identify the Real Cost Drivers
Once you have clean metrics, the usual culprits for exploding cost per million tokens look like this.
1. Agent loops and tool use
Multi-step agents often:
- Call multiple models across different providers
- Expand prompts with intermediate state on each hop
- Run sequentially instead of in parallel
Symptoms:
- High tokens per user flow
- Long-tail latency spikes
- “Invisible” spend in intermediate prompts
2. One-model-per-node infrastructure
On typical GPU-based stacks, each node is pinned to a single large model. For agentic workflows, that causes:
- Cross-endpoint hops: Each model call hits a different node or service.
- Underutilized memory: VRAM full of static weights that spend much of their time idle.
- Operational sprawl: More nodes to manage for a given aggregate token volume.
That manifests as higher CapEx and OpEx per million tokens, plus higher latency due to network hops and cold starts.
3. Prompt bloat
Over time:
- System prompts grow with policies, tools, and formatting instructions.
- Conversation histories are appended instead of windowed.
- RAG context is over-sized “just to be safe.”
All of this inflates input tokens per call—your denominator goes up in the wrong way.
Step 3: Concrete Strategies to Reduce Cost per Million Tokens
Strategy 1: Optimize prompts and workflows, not just models
-
Put hard caps on context windows per use case.
- Enforce max history length or token budgets for RAG context.
- Implement summarization of older turns instead of replaying full histories.
-
Right-size models by step.
- Use smaller models for:
- Classification, routing, light rewriting
- Metadata extraction
- Reserve frontier-scale models for:
- Reasoning-intensive hops (planning, coding, complex QA)
- On SambaCloud or SambaRack, “model bundling” lets multiple models coexist on the same RDU node, so you don’t pay a latency penalty for mixing them in one workflow.
- Use smaller models for:
-
Parallelize where possible.
- Run independent tool calls or sub-queries in parallel.
- This doesn’t reduce raw tokens, but it lets you:
- Finish work faster
- Pack more throughput into the same infrastructure footprint
- Lower effective cost because you amortize fixed costs over more tokens/sec
Strategy 2: Switch from “one-model-per-node” to bundled, multi-model inference
A major, often hidden cost driver is architectural: dedicating an entire node to a single large model, then stitching workflows across many endpoints.
With SambaNova’s architecture:
- RDUs + three-tier memory are designed for model bundling:
- Multiple frontier-scale models can stay hot on a single node.
- Tiered memory (high-bandwidth on-chip + near memory + far memory) keeps models and prompts resident, reducing expensive data movement.
- SambaStack is built for “infrastructure flexibility”:
- It switches between multiple models on one node without having to reload weights.
- Complex agentic workflows can execute end-to-end locally instead of hopping between nodes.
Impact on cost per million tokens:
- Lower infrastructure count: Fewer nodes to serve multiple models = less CapEx and less power/cooling.
- Higher utilization: You generate more tokens per watt because you’re not wasting memory and compute on idle, single-model nodes.
- Reduced network overhead: Fewer cross-service hops per workflow = less tail latency and less padding/guardrails overhead.
Strategy 3: Run on infrastructure optimized for tokens per watt
Once your application-level waste is under control, the biggest lever becomes how efficiently your hardware generates tokens.
On SambaNova:
- gpt-oss-120b has been measured at over 600 tokens per second on SambaNova RDUs—well-suited for near real-time agentic AI.
- DeepSeek-R1 (671B) achieves up to 200 tokens / second, as measured independently by Artificial Analysis.
- SambaRack SN40L-16 is optimized for low-power inference (average of 10 kWh), and SN50 is built for fast agentic inference at a fraction of the cost on the largest models.
More tokens/sec at lower power translates directly into:
- Lower $ / million tokens at the rack level.
- Fewer racks for a given global QPS.
- Lower cooling and data center footprint.
Because SambaNova provides a full-stack inference system (RDU chips + SambaRack systems + SambaStack + SambaOrchestrator + SambaCloud APIs), you get:
- Chips-to-model computing: Architecture tuned end-to-end for inference throughput and efficiency.
- Three-tier memory architecture: Minimized data movement, maximized tokens-per-watt.
- Unified orchestration: SambaOrchestrator handles Auto Scaling | Load Balancing | Monitoring | Model Management, so you can focus on tokens and SLOs instead of node micromanagement.
Strategy 4: Make switching infrastructure low-friction
You only realize cost savings if you can move workloads without rewriting everything.
With SambaCloud:
- OpenAI-compatible APIs mean:
- You can “port your application…in minutes.”
- Existing OpenAI SDKs and client libraries typically work with only endpoint and key changes.
- This lowers:
- Migration cost
- Risk of regression in production
- Time-to-value on infrastructure improvements
By moving high-volume, stable workloads to SambaCloud or SambaRack, you can keep experimentation on public APIs while anchoring the bulk of your tokens on a more efficient stack.
How SambaNova’s Stack Helps Lower Cost per Million Tokens
Here’s how the pieces fit together from a cost-control perspective.
-
SambaCloud (managed inference via OpenAI-compatible APIs)
- Production-ready support for:
- DeepSeek models (including DeepSeek-R1)
- Meta Llama family (Llama 3.1 8B/70B/405B, and Llama 4 series as a launch partner)
- OpenAI gpt-oss-120b
- Optimized for high-throughput, low-latency inference.
- Flexible consumption: token-based inference as a service, or dedicated capacity.
- Production-ready support for:
-
SambaRack SN40L-16 and SN50 (rack systems)
- SN40L-16: optimized for low-power inference—ideal when power and cooling are first-order constraints.
- SN50: designed for fast agentic inference on frontier-scale models at a fraction of the cost of competitive chips.
- Both leverage RDUs and the three-tier memory architecture to minimize data movement and maximize tokens-per-watt.
-
SambaStack + SambaOrchestrator (inference stack and control plane)
- Built for multi-model, agentic workflows:
- Model bundling on a single node
- Efficient context and prompt handling
- Control plane supports:
- Auto Scaling | Load Balancing | Monitoring | Model Management
- Designed to keep operational overhead low, so your ops cost per million tokens is controlled alongside raw infrastructure cost.
- Built for multi-model, agentic workflows:
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Model Bundling on RDUs | Keeps multiple frontier-scale models hot on a single node using a three-tier memory architecture. | Reduces node count and cross-endpoint hops; lowers $ / million tokens for multi-model workflows. |
| High Tokens-per-Watt Performance | Custom dataflow technology plus tiered memory minimizes data movement and maximizes throughput (e.g., gpt-oss-120b >600 tokens/sec; DeepSeek-R1 up to 200 tokens/sec). | More tokens out of the same power budget; lower infrastructure and energy cost per million tokens. |
| OpenAI-Compatible SambaCloud APIs | Expose SambaNova’s inference stack through familiar OpenAI-style interfaces. | Lets you port existing apps quickly and shift high-volume workloads to a more efficient stack without major code changes. |
Ideal Use Cases
- Best for agentic workflows with multiple model calls: Because SambaStack can switch between bundled models on a single RDU node, you avoid the cross-endpoint overhead that makes multi-hop agents disproportionately expensive.
- Best for teams constrained by power, cooling, or data residency: Because SambaRack SN40L-16 and SN50 deliver higher efficiency than GPU-based solutions and support sovereign, private deployments, you lower cost per million tokens while meeting operational and regulatory constraints.
Limitations & Considerations
- Not a fit if you only run tiny, low-volume models: If your entire workload is occasional inference on small models, the overhead of adopting a new inference stack may outweigh the savings. For most teams, cost pressure shows up once volume and model size both increase.
- You still need application-level optimization: Infrastructure efficiency can dramatically lower cost per million tokens, but wasteful prompts, unbounded agents, and poor workflow design will still leak budget. The best results come from combining SambaNova’s stack with disciplined prompt and agent engineering.
Pricing & Plans
Specific pricing depends on your deployment model and volume, but the patterns are consistent:
- SambaCloud consumption: Token-based inference as a service with predictable per-token economics and the ability to reserve or scale capacity as workloads grow.
- SambaRack deployments: Rack-scale systems (SN40L-16, SN50) sized for your data center footprint and workload profile, with CapEx amortized against significantly higher tokens-per-watt.
Typical decision split:
- SambaCloud: Best for teams needing fast time-to-production with managed operations, clear token-based pricing, and no hardware management.
- SambaRack (SN40L-16 or SN50): Best for enterprises and sovereign AI deployments that need full control over data, power usage, and long-term cost per million tokens at scale.
For current pricing, discounts at higher volumes, and help modeling TCO versus existing GPU fleets or public APIs, it’s best to talk directly with SambaNova.
Frequently Asked Questions
How do I actually calculate cost per million tokens across multiple providers?
Short Answer: Normalize all spend to dollars and all usage to tokens, then aggregate: (Total monthly LLM spend) ÷ (Total tokens generated) × 1,000,000.
Details:
Pull invoices and usage from each provider (OpenAI, SambaCloud, other clouds) plus your own infra. Convert:
- Public API: sum (tokens × per-token rate).
- Managed inference (e.g., SambaCloud): use the token-based charges, plus any committed capacity.
- On-prem racks: amortize hardware over its useful life (e.g., 3–5 years) and add power, cooling, and ops costs.
On the usage side, aggregate input + output tokens from your logs. Once you have:
- Total monthly cost (USD)
- Total monthly tokens (input + output)
Compute Effective Cost / 1M tokens. Then drill down by feature, model, and workflow to see where to optimize and which workloads should move to higher-efficiency infrastructure like SambaNova’s RDUs.
How much can infrastructure actually move my cost per million tokens vs better prompts?
Short Answer: Both matter; prompt and workflow optimization often yields 1.5–3× improvements, while moving from generic GPU stacks or public APIs to a high-efficiency inference stack can deliver similar or greater savings on the remaining tokens.
Details:
In practice, teams often see:
-
Prompt/workflow optimization:
- 30–60% reduction in tokens per request via context windowing, better routing, and right-sizing models.
- Fewer agent hops and parallelization shaving latency and incidentally reducing duplicate work.
-
Infrastructure optimization (e.g., SambaNova):
- Higher tokens/sec for the same power envelope (e.g., gpt-oss-120b at over 600 tokens/sec; DeepSeek-R1 up to 200 tokens/sec).
- Lower power per node and fewer nodes needed, especially for multi-model workloads thanks to model bundling and three-tier memory.
- Operational simplicity (SambaOrchestrator) that reduces the ops overhead per token.
When combined, these moves compound: you can cut total LLM feature spend by multiples, not just a few percent, while improving latency and reliability.
Summary
Teams struggling with exploding LLM feature costs usually lack two things: clear, workflow-level visibility into cost per million tokens, and infrastructure that’s built for agentic, multi-model inference rather than one-model-per-node serving.
By turning tokens into a first-class metric, tightening prompts and agent design, and then running those workloads on an inference stack optimized for tokens-per-watt and model bundling, you can materially reduce cost per million tokens without sacrificing capability.
SambaNova’s chips-to-model computing approach—RDUs with three-tier memory, SambaRack systems, SambaStack and SambaOrchestrator, and OpenAI-compatible SambaCloud APIs—gives you a direct path to lower, predictable LLM economics at scale.