What does SambaNova SambaStack + SambaOrchestrator include, and how do we evaluate it for autoscaling and multi-model routing?
AI Inference Acceleration

What does SambaNova SambaStack + SambaOrchestrator include, and how do we evaluate it for autoscaling and multi-model routing?

11 min read

Most platform teams don’t fail at building powerful models—they struggle to run agentic, multi-model workloads reliably under real traffic. SambaStack and SambaOrchestrator exist to solve that exact problem: autoscaling frontier-scale inference, routing across multiple models, and keeping latency and cost under control without rewriting your applications.

Quick Answer: SambaStack is SambaNova’s integrated inference stack that runs on RDUs and SambaRack systems, optimized for multi-model, agentic workloads. SambaOrchestrator is the control plane that gives you autoscaling, load balancing, monitoring, and model management across data centers so you can evaluate, harden, and operate autoscaling and multi-model routing in production.

The Quick Overview

  • What It Is:
    SambaStack is the “chips-to-model” inference stack that runs on SambaNova RDUs and SambaRack systems, built for high-throughput, low-power, multi-model inference. SambaOrchestrator is the distributed control plane that handles autoscaling, routing, and lifecycle management of those models and endpoints.

  • Who It Is For:
    Platform, infra, and MLOps teams responsible for production LLM serving—especially those running agentic workflows, multiple models (DeepSeek, Llama, gpt-oss, BYO checkpoints), and needing predictable cost, latency, and capacity planning.

  • Core Problem Solved:
    It eliminates “one-model-per-node” infrastructure, letting you bundle and switch between multiple frontier-scale models on a single node, while SambaOrchestrator provides autoscaling and multi-model routing so you can operate these workloads across clusters and data centers.

How It Works

At a high level, SambaStack runs your inference workloads on SambaNova’s Reconfigurable Dataflow Units (RDUs) and SambaRack systems; SambaOrchestrator sits above that as the control plane, exposing OpenAI-compatible APIs, and managing how traffic flows to models and how capacity scales.

Under the hood, SambaStack leverages custom dataflow technology and a three-tier memory architecture on the SN50 RDU to minimize data movement and maximize tokens-per-watt. That matters for agentic workloads, where prompts grow, context windows expand, and you chain multiple model calls per user request. SambaOrchestrator uses this flexible infrastructure to bundle models together and keep both models and prompts “hot,” then applies standard production primitives—autoscaling, load balancing, monitoring, model management—to keep the system stable under load.

  1. Inference Stack (SambaStack):

    • Runs on SambaRack SN40L-16 (optimized for low-power multi-model inference, ~10 kWh average) and SambaRack SN50 (optimized for fast, frontier-scale agentic inference).
    • Hosts frontier and open models (e.g., DeepSeek-R1, Llama, gpt-oss-120b) and supports BYO checkpoints.
    • Uses model bundling plus three-tier memory on the RDU to switch between multiple models and large prompts on a single node with minimal overhead.
  2. Control Plane (SambaOrchestrator):

    • Provides Auto Scaling | Load Balancing | Monitoring | Model Management | Cloud Create | Server Management.
    • Manages AI workloads across data centers, including on-prem SambaRack and SambaCloud deployments.
    • Exposes OpenAI-compatible APIs, so you can port existing applications in minutes and then test autoscaling and routing behavior without code rewrites.
  3. Developer & Ops Integration:

    • Developers hit OpenAI-compatible endpoints for chat, completion, and tools/agents.
    • Ops teams use SambaOrchestrator to define scaling rules, routing policies, and observability hooks.
    • You evaluate autoscaling and multi-model routing using load tests, synthetic agent workloads, and real production traffic shadowing—backed by metrics in SambaOrchestrator for tokens/sec, latency, and error rates.

Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
Model Bundling on SambaStackHosts and switches between multiple frontier-scale models on the same RDU node using three-tier memoryEliminates one-model-per-node constraints, enabling complex agentic workflows to run end-to-end on fewer systems
SambaOrchestrator Autoscaling & RoutingAuto Scaling | Load Balancing | Monitoring | Model Management across racks and data centersKeeps latency and cost stable under changing load; simplifies multi-model routing and failover
OpenAI-Compatible Inference APIsExposes OpenAI-compatible endpoints on SambaCloud and SambaRackLets teams port existing apps in minutes and evaluate SambaNova with minimal integration work

Ideal Use Cases

  • Best for agentic AI and multi-step workflows:
    Because SambaStack can bundle multiple frontier-scale models and keep them hot in tiered memory, agent loops (e.g., planner → tool caller → verifier) can run on one node, with SambaOrchestrator routing traffic and autoscaling based on tokens/sec, not just QPS.

  • Best for sovereign, cost- and power-sensitive inference:
    Because SambaRack SN40L-16 is optimized for low power inference (average of 10 kWh) and SambaOrchestrator can manage workloads across your own data centers, you can run high-throughput inference where data residency, power envelopes, and rack density are first-order constraints.

What SambaStack Includes (Workload View)

From the perspective of an inference operator, you can think of SambaStack as everything that turns RDU silicon and SambaRack hardware into usable, scalable inference capacity:

  • Chips-to-model computing:

    • SN50 RDU with custom dataflow architecture and three-tier memory for high tokens-per-watt at frontier model scale.
    • Integration with SambaRack SN50 for fast agentic inference and SN40L-16 for efficient multi-model inference.
  • Inference runtime and serving layer:

    • Optimized serving for modern LLMs, including long-context and reasoning models.
    • Token streaming with throughput like: gpt-oss-120b at over 600 tokens/second; DeepSeek-R1 up to 200 tokens/second (independently measured by Artificial Analysis).
    • Scheduling tuned for batching, prompt reuse, and multi-model workloads.
  • Model catalog + BYO:

    • Ready-to-run models such as DeepSeek, Llama, and gpt-oss families.
    • Bring Your Own Checkpoints support, so you can run proprietary or fine-tuned models in the same stack.
  • Developer surface area:

    • OpenAI-compatible APIs for chat/completions and agents.
    • Minimal migration friction—point existing clients at SambaNova endpoints, update auth, and begin measurements.

What SambaOrchestrator Includes (Control Plane View)

SambaOrchestrator is where you operate SambaStack at scale across data centers:

  • Core control-plane capabilities:

    • Auto Scaling | Load Balancing | Monitoring | Model Management | Cloud Create | Server Management.
    • Centralized configuration for models, endpoints, and routing policies.
  • Operational automation:

    • Automatically scales model replicas/racks up/down based on CPU/RDU utilization, tokens/sec, or custom load metrics.
    • Load balances intelligently across nodes and models, keeping hot models and prompts in tiered memory.
  • Observability & operations:

    • End-to-end monitoring of model deployments: request latency, throughput, error rates, and capacity utilization.
    • Health checks and alerting hooks to integrate with existing logging and incident systems.
  • Multi-environment / multi-region control:

    • Manage SambaRack deployments in your own data centers and SambaCloud resources from a unified plane.
    • Support for sovereign AI setups where models and data must stay within specific jurisdictions.

How to Evaluate Autoscaling on SambaStack + SambaOrchestrator

When you’re validating autoscaling, you’re really testing three questions:

  1. Does it scale fast enough to meet burst traffic?
  2. Does it scale back down so you’re not overspending?
  3. Does it preserve latency and quality for agentic, multi-call workloads?

Here’s a structured evaluation approach:

  1. Baseline single-model performance.

    • Start with a key model (e.g., gpt-oss-120b).
    • Measure tokens/sec, p95 latency, and utilization at a steady load on a fixed capacity (no autoscaling).
    • This becomes your reference for later comparisons.
  2. Enable autoscaling on a single model.

    • Configure SambaOrchestrator scaling rules (e.g., target utilization or queue depth).
    • Use a load generator that models your real traffic pattern: mix of prompt lengths, streaming vs non-streaming, and agent loops.
    • Evaluate: time-to-scale-up under a burst, time-to-scale-down after load drops, and how stable p95/p99 latency remains during scaling events.
  3. Test burst and noisy-neighbor scenarios.

    • Drive short, sharp bursts (e.g., 10x in a few seconds) to validate vertical reaction time.
    • Run concurrent workloads on other models to check whether SambaOrchestrator respects isolation, keeps your critical endpoints within SLOs, and makes appropriate scaling decisions.
  4. Measure cost and power behavior.

    • On SambaRack SN40L-16, track power draw (average ~10 kWh) and capacity utilization while autoscaling is active.
    • Verify that scale-down behavior actually returns you toward baseline power and cost footprints.

How to Evaluate Multi-Model Routing

Multi-model routing is where SambaStack’s model bundling and SambaOrchestrator’s control plane differentiate from GPU-first stacks that force “one-model-per-node.”

  1. Define your routing policy.
    Typical policies to test:

    • Tiered quality/cost routing: e.g., default to a smaller gpt-oss model; route “hard” requests to a larger model.
    • Latency-aware routing: route latency-sensitive endpoints to specific SambaRack clusters.
    • Agent-specific routing: planner on one model, tool executor on another, verifier on a third—all on the same node via bundling.
  2. Deploy multiple models on shared nodes.

    • Use SambaStack to bundle at least two or three models on the same SN50 or SN40L-16 node.
    • Confirm that SambaOrchestrator treats them as independently addressable endpoints while taking advantage of shared tiered memory for hot prompts.
  3. Run realistic mixed workloads.

    • Generate mixed traffic patterns that hit different models in the bundle according to your routing policy.
    • Evaluate:
      • Switch overhead when routing between models on the same node.
      • Impact on tokens/sec and p95 latency as you increase the number of active models.
      • Behavior when one model becomes a hotspot—does SambaOrchestrator rebalance or rescale effectively?
  4. Failure and degradation testing.

    • Intentionally take a model or rack out of service.
    • Confirm that SambaOrchestrator:
      • Reroutes traffic to alternate models or clusters according to your policy.
      • Maintains reasonable latency and error rates while capacity is reduced.
      • Surfaces clear metrics and events for incident response.
  5. Compare to one-model-per-node baselines.

    • If you have GPU-based baselines, run the same workloads with dedicated nodes per model.
    • Track the difference in utilization, tokens-per-watt, and cost-per-token.
    • SambaStack’s model bundling plus tiered memory will typically show higher effective utilization and fewer stranded resources.

Limitations & Considerations

  • Ecosystem and tooling adaptation:
    While SambaNova exposes OpenAI-compatible APIs, low-level metrics, dashboards, and autoscaling semantics may differ from your existing GPU stack. Plan a short “observability integration” sprint to hook SambaOrchestrator into your logging, metrics, and alerting systems.

  • Model-specific tuning requirements:
    Different models (e.g., DeepSeek-R1 vs smaller Llama variants) have distinct latency/throughput profiles. For robust autoscaling and routing, you’ll want per-model SLOs and tailored scaling thresholds rather than a one-size-fits-all policy.

Pricing & Plans

Specific pricing for SambaStack, SambaOrchestrator, and SambaRack deployments depends on your scale, deployment model (on-prem vs SambaCloud), and workload mix (frontier vs smaller models, agentic intensity, tokens/month). In practice, most teams evaluate via a staged engagement:

  • Proof-of-Concept / Pilot:
    Best for platform teams needing to validate autoscaling, multi-model routing, and OpenAI-compatible integration on representative workloads before committing to rack-scale. Typically includes access to SambaCloud, select models, and guidance on benchmark design.

  • Production Deployment (Rack + Control Plane):
    Best for organizations needing sovereign, high-throughput inference with predictable power envelopes—deploying SambaRack SN40L-16 and/or SN50 in their own data centers, managed via SambaOrchestrator for autoscaling, routing, and monitoring.

Your SambaNova account team can provide concrete cost-per-token and tokens-per-watt comparisons against your current infrastructure.

Frequently Asked Questions

How does SambaOrchestrator’s autoscaling differ from typical GPU-based scaling?

Short Answer: SambaOrchestrator autoscaling is built around SambaStack’s ability to bundle models on RDUs, so it can scale multi-model, agentic workloads on shared nodes instead of adding one model per GPU cluster.

Details:
Traditional GPU clusters often scale by adding more copies of a single model per node group. When you introduce multiple models and agent loops, this leads to fragmentation and low utilization. SambaStack runs multiple frontier-scale models on a single RDU-based node using three-tier memory and dataflow scheduling. SambaOrchestrator’s autoscaling logic takes this into account, scaling capacity for the bundled set of models and routing traffic intelligently across them. The result is higher effective tokens-per-watt and better latency stability for multi-model workflows, even under bursty traffic.

Can I evaluate SambaStack + SambaOrchestrator without rewriting my applications?

Short Answer: Yes. You can port your applications in minutes using OpenAI-compatible APIs and then start measuring autoscaling and routing behavior directly.

Details:
SambaNova exposes OpenAI-compatible APIs for inference on SambaCloud and SambaRack. Most teams simply reconfigure their client libraries with a new base URL and API key, then gradually roll traffic over (via canarying or shadowing). From there, you can:

  • Compare latency and tokens/sec vs your existing stack under identical workloads.
  • Turn on SambaOrchestrator’s autoscaling and monitor how it responds to your real traffic patterns.
  • Introduce multi-model routing progressively—first as an A/B between two models, then as a full agentic workflow spanning several models.
    Because the interface is familiar, your evaluation focuses on infrastructure outcomes (throughput, cost, energy) rather than integration work.

Summary

SambaStack and SambaOrchestrator are built for the workloads that break conventional GPU stacks: multi-model, agentic inference under real-world constraints like power, sovereignty, and strict SLOs. SambaStack provides chips-to-model computing on RDUs and SambaRack systems, with model bundling and tiered memory to maximize tokens-per-watt. SambaOrchestrator adds the production control plane—Auto Scaling | Load Balancing | Monitoring | Model Management—so you can evaluate and run autoscaling and multi-model routing with OpenAI-compatible APIs and enterprise-grade observability.

If you’re currently stitching workflows across multiple endpoints, dedicating nodes per model, or fighting to stay within power envelopes, this combination gives you a path to consolidate, simplify, and scale.

Next Step

Get Started