What does SambaNova SambaStack + SambaOrchestrator include, and how do we evaluate it for autoscaling and multi-model routing?
AI Inference Acceleration

What does SambaNova SambaStack + SambaOrchestrator include, and how do we evaluate it for autoscaling and multi-model routing?

13 min read

Most teams hit the same wall when they try to run agentic, multi-model LLM workloads in production: one-model-per-node architectures force you to pin hardware to individual models, bolt together multiple endpoints for every agent step, and then fight autoscaling and routing complexity as usage grows. SambaStack plus SambaOrchestrator are built specifically to break that pattern.

Quick Answer: SambaStack is SambaNova’s integrated inference stack that runs on RDUs and SambaRack systems, optimized for agentic, multi-model workloads. SambaOrchestrator is the control plane that adds autoscaling, load balancing, monitoring, model management, and multi-model routing across data centers. You evaluate them by how well they keep complex workflows on a single node, sustain throughput (tokens/sec) under load, and simplify operations compared with one-model-per-node setups.


The Quick Overview

  • What It Is: A full-stack inference environment (SambaStack) plus a production control plane (SambaOrchestrator) that together deliver scalable, multi-model AI inference on SambaNova RDUs and SambaRack systems.
  • Who It Is For: Platform, SRE, and infra teams responsible for running LLMs and agentic workflows at scale, especially where power, latency, and data residency are first-order constraints.
  • Core Problem Solved: Eliminates the one-model-per-node anti-pattern by enabling multiple frontier-scale models and agent steps to run end-to-end on a single node, with autoscaling and routing handled centrally.

How It Works

At a high level, SambaStack runs your models; SambaOrchestrator runs your fleet.

SambaStack sits directly on SambaNova’s Reconfigurable Dataflow Unit (RDU) hardware and SambaRack systems, exposing OpenAI-compatible APIs for inference. Under the hood, it uses custom dataflow technology and a three-tier memory architecture to keep models and prompts “hot,” making it practical to bundle multiple large models on the same node and switch between them inside complex agent workflows.

SambaOrchestrator then takes those nodes and turns them into a managed inference fabric. It handles autoscaling, load balancing, monitoring, model lifecycle, and routing policies so your applications can treat the cluster as a single logical endpoint—even when you’re running multiple models, agents, and workloads across data centers.

  1. Execution Layer (SambaStack on RDUs):
    Models are deployed on RDUs within SambaRack systems (including SN40L-16 and SN50). SambaStack’s dataflow execution and tiered memory reduce data movement, maximize tokens-per-watt, and allow multiple large models to coexist on a node without thrashing. This is where you run Llama, DeepSeek, gpt-oss, and your own checkpoints.

  2. Control Plane (SambaOrchestrator across Racks):
    SambaOrchestrator discovers SambaStack instances, registers deployed models, and exposes a managed interface for Auto Scaling | Load Balancing | Monitoring | Model Management | Cloud Create | Server Management. It routes requests, scales replicas, and keeps health and performance state in sync across racks and data centers.

  3. Developer & Ops Interface (APIs and Policies):
    Developers connect via OpenAI-compatible APIs; platform teams configure autoscaling, routing, and observability policies via SambaOrchestrator. You can evaluate behavior under real-world loads—multi-step agents, long prompts, and model chaining—without rewriting your application stack.


What SambaStack Includes

From the perspective of someone running production LLM workloads, SambaStack is the “serve models fast and efficiently” layer.

Key components and capabilities include:

  • RDU-optimized inference runtime

    • Executes models on SambaNova’s Reconfigurable Dataflow Unit architecture.
    • Uses dataflow scheduling and three-tier memory to minimize data movement and increase tokens-per-watt.
    • Enables fast inference on frontier-scale models, including gpt-oss-120b (over 600 tokens/sec) and DeepSeek-R1 (up to 200 tokens/sec, measured independently by Artificial Analysis).
  • Model bundling and multi-model support

    • Multiple large models can be loaded and kept hot on the same node.
    • Tiered memory allows caching of models and prompts, letting agent workflows call several models without jumping between nodes.
    • Designed to avoid the typical “model swap penalty” that kills latency in one-model-per-node setups.
  • Integrated stack on SambaRack systems

    • Runs on SambaRack SN40L-16 (fourth-gen system optimized for low power inference, average of 10 kWh) and SambaRack SN50 for fast agentic inference on the largest models.
    • Delivers a chips-to-model computing path: from silicon through system and inference runtime.
  • OpenAI-compatible APIs

    • Chat/completions-style APIs aligned with the OpenAI interface.
    • Lets you port existing applications “in minutes” without rewriting clients.
    • Supports BYO checkpoints so you’re not locked into only pre-curated models.
  • Security and compliance-ready foundation

    • Built to operate in standards-compliant racks and enterprise data centers.
    • Forms the computation layer for sovereign AI deployments where data residency matters.

What SambaOrchestrator Includes

SambaOrchestrator turns SambaStack nodes into a managed, observable, and scalable inference service.

Core surface areas:

  • Auto Scaling

    • Policy-driven scaling of model replicas based on utilization, queue depth, and/or latency.
    • Supports bursty workloads and seasonal traffic without manual capacity intervention.
    • Critical when agents spawn multiple concurrent calls per user session.
  • Load Balancing

    • Request distribution across SambaStack instances to maximize utilization and minimize hotspots.
    • Aware of which nodes host which models so multi-model workloads are sent where prompts and weights are already hot.
    • Reduces cross-node chatter that would otherwise erode latency and cost advantages.
  • Monitoring & Observability

    • Centralized metrics across models, nodes, and racks.
    • Key signals typically include tokens/sec, p95/p99 latency, error rates, and utilization per RDU.
    • Supports proactive operations—for example, detecting when agent loops cause unbounded prompt growth and tightening policies.
  • Model Management

    • Registration, versioning, and lifecycle control for models across SambaStack instances.
    • Controlled rollout and rollback paths (e.g., blue/green or canary) for new checkpoints.
    • Ensures that autoscaling and routing respect model versions and compatibility constraints.
  • Cloud Create | Server Management

    • Tools to create and manage logical clusters spanning data centers.
    • Handles server onboarding, health checks, and decommissioning workflows.
    • Provides an operations console for fleet-level management.

In practice, you can think of SambaOrchestrator as the layer that gives you “cluster semantics” for AI: you stop thinking in terms of individual nodes and start operating against policies and SLAs.


Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
Model bundling on SambaStackRuns multiple frontier-scale models on a single node with tiered memoryEliminates one-model-per-node, enabling full agent flows on one node
SambaOrchestrator autoscalingAutomatically scales replicas based on demand and performance signalsMaintains latency and cost efficiency under dynamic workloads
OpenAI-compatible APIsExposes a familiar interface for inference and agentsLets you port applications in minutes with minimal code changes
Multi-model-aware load balancingRoutes requests to nodes where models/prompts are already hotReduces cold starts and cross-node hops
Centralized monitoringAggregates performance/health metrics across models and racksGives platform teams full visibility for tuning and incident response
Model management lifecycleRegisters, versions, and governs models across the fleetEnables safe updates, rollbacks, and compliance control

Ideal Use Cases

  • Best for agentic AI with multi-step workflows:
    Because it keeps multiple models and prompts hot on a single node and lets SambaOrchestrator manage scaling and routing, you can run long-lived agent loops, retrieval steps, and tool calls without stitching together separate endpoints.

  • Best for multi-model internal platforms:
    Because SambaStack supports bundling many models and SambaOrchestrator handles model management and routing, platform teams can expose a single internal “AI endpoint” that serves different models based on teams, use cases, or policies.

  • Best for power- and cost-constrained data centers:
    Because SambaRack SN40L-16 is optimized for low power inference (average of 10 kWh) and RDUs maximize tokens-per-watt, you can achieve high throughput in facilities where power and cooling are hard limits.

  • Best for sovereign and regulated workloads:
    Because the stack runs in standards-compliant racks on-prem or with sovereign partners, and SambaOrchestrator centralizes control, you can meet data residency and regulatory requirements while still using modern agentic architectures.


How to Evaluate Autoscaling & Multi-Model Routing

When you benchmark SambaStack + SambaOrchestrator, you want to emulate how your agents actually behave, not just single-shot prompts. A practical evaluation path:

1. Design representative workloads

  • Traffic patterns:

    • Mix steady baseline traffic with realistic spikes.
    • Include concurrency patterns where each user session translates into multiple parallel model calls.
  • Agent complexity:

    • Use multi-step flows: planning model → tools → reasoning model → summarization model.
    • Include growing prompts (conversation history, retrieval context, tool results).
  • Model mix:

    • Combine heavy models (e.g., 120B-scale reasoning) with lighter models (e.g., routing, classification, summarization).
    • Deploy them together on the same SambaStack nodes to test model bundling.

2. Measure autoscaling behavior under load

Key metrics and questions:

  • Scale-up responsiveness:

    • How quickly does SambaOrchestrator add capacity when QPS or queue depth spikes?
    • Does p95/p99 latency stay within your SLO during bursts?
  • Scale-down efficiency:

    • Does the system right-size back down after peak?
    • Are you left with over-provisioned capacity, or does it converge toward steady-state utilization targets?
  • Tokens/sec per watt and per dollar:

    • On SN50 or SN40L-16, what tokens/sec do you sustain for your main models?
    • Compare effective tokens-per-watt and tokens-per-dollar against your current GPU-based deployments.
  • Agent loop stability:

    • Under heavy load with long-lived sessions, do agent loops remain stable?
    • Are there timeouts, retries, or cascading failures when autoscaling kicks in?

3. Evaluate multi-model routing and model bundling

Focus on whether the infrastructure actually lets you run multiple large models without fragmentation:

  • Single-node workflow execution:

    • For your most complex agent flows, can SambaStack keep the entire workflow on one node most of the time?
    • Measure cross-node hop frequency and its impact on latency.
  • Cold-start behavior:

    • How often does a request land on a node where the target model is not hot?
    • What is the latency penalty when a new model is pulled into memory, and how often does it happen under your mix?
  • Routing policies:

    • Can SambaOrchestrator route by model, tenant, or workload type without you defining separate clusters?
    • How easily can you add a new model to the bundle and route only certain traffic to it (e.g., A/B tests)?
  • Prompt cache effectiveness:

    • For workflows with repeated system prompts or templates, does the tiered memory architecture keep these hot across calls?
    • Look at time-to-first-token across repeated flows to see the cache impact.

4. Ops and observability evaluation

Your operations team should treat this like adopting any critical control plane:

  • Monitoring integration:

    • Can SambaOrchestrator’s metrics feed into your existing observability stack (Prometheus, Grafana, Datadog, etc.)?
    • Do you get per-model and per-tenant views, not just per-node metrics?
  • Incident response:

    • When a model misbehaves or a node degrades, how visible is it?
    • How fast can you cordon a node, rollback a model version, or shift traffic?
  • Configuration ergonomics:

    • Are autoscaling and routing configs declarative and versionable (e.g., via GitOps)?
    • Can platform teams manage policies without constantly involving SambaNova support?
  • Multi-region / sovereign scenarios:

    • How clearly can you see and control which workloads run where?
    • Are data residency boundaries enforced through routing and server management semantics?

Limitations & Considerations

  • Ecosystem familiarity:
    SambaStack and SambaOrchestrator are optimized around RDUs and SambaRack systems, not generic GPU instances. Teams coming from GPU-centric tooling will need a short ramp to understand RDU-oriented metrics and deployment patterns. Mitigation: the OpenAI-compatible APIs substantially reduce developer friction; infra teams can focus their learning on control-plane semantics rather than new SDKs.

  • Model portfolio fit:
    While the stack supports BYO checkpoints and leading open models (e.g., Llama, DeepSeek, gpt-oss), some highly specialized or proprietary formats may require conversion or validation. Mitigation: prioritize evaluating your top 3–5 production-critical models first, ensuring they align with SambaStack’s optimized paths.


Pricing & Plans

SambaStack and SambaOrchestrator are delivered as part of SambaNova’s full-stack offerings rather than as standalone SaaS SKUs. Pricing typically aligns with:

  • SambaRack systems (SN40L-16, SN50):
    Rack-level systems sized to your throughput, latency, and power envelope requirements.

  • SambaCloud / managed options:
    Developer-accessible inference via OpenAI-compatible APIs on SambaNova-managed infrastructure.

  • Enterprise and sovereign deployments:
    Custom arrangements for on-prem, co-location, or sovereign partner data centers, often including SambaOrchestrator as the control plane for multi-rack clusters.

Within that, think in terms of:

  • Throughput and concurrency targets (tokens/sec, users, QPS)
  • Power and space envelopes (e.g., number of racks, kWh constraints)
  • Data residency and control-plane requirements (single vs multi-region, sovereign needs)

Example framing:

  • Inference Rack Deployment: Best for enterprises needing dedicated, on-prem or co-lo racks for agentic inference, with full SambaOrchestrator control over autoscaling and routing.
  • Managed SambaCloud Access: Best for teams wanting to start quickly on managed infrastructure, using OpenAI-compatible APIs and then later moving steady-state workloads onto dedicated SambaRack systems as they scale.

For precise pricing and architecture sizing, you’ll work directly with SambaNova to map your workloads to racks, RDUs, and control-plane configurations.


Frequently Asked Questions

How does SambaOrchestrator differ from a generic Kubernetes + HPA stack for LLMs?

Short Answer: SambaOrchestrator is built specifically for SambaStack and RDUs, with autoscaling, routing, and monitoring tuned to multi-model, agentic inference rather than generic microservices.

Details:
While you can run LLMs on Kubernetes with Horizontal Pod Autoscalers, that stack is unaware of RDU-level constraints, tiered memory behavior, and model bundling semantics. You end up hand-crafting autoscaling metrics, dealing with cold starts when swapping models, and routing blindly across nodes that don’t share hot prompts or weights. SambaOrchestrator understands the SambaStack runtime and the three-tier memory architecture; it can scale and route based on model placement, tokens/sec, and prompt cache effectiveness, sustaining higher throughput and lower latency for agent workloads.


Can I really port my existing OpenAI-based applications without major code changes?

Short Answer: Yes. SambaStack exposes OpenAI-compatible APIs, so most applications can switch endpoints and update configuration without rewriting business logic.

Details:
SambaStack is intentionally designed to match the OpenAI API surface for common operations (chat completions, completions, and similar patterns). That means you typically only need to:

  • Change the base URL to point at SambaCloud or your SambaOrchestrator-managed endpoint.
  • Update model names to the corresponding SambaStack-hosted models (e.g., Llama, DeepSeek, gpt-oss).
  • Adjust any provider-specific parameters if you were using advanced OpenAI-only features.

Because the underlying semantics are aligned, your agents, prompt construction, and orchestration logic remain the same. This is especially important when evaluating: you can replicate real production traffic patterns quickly and measure SambaStack + SambaOrchestrator behavior under realistic workloads.


Summary

SambaStack plus SambaOrchestrator are built for the workloads that stress traditional GPU-based inference the most: agentic, multi-model, long-context workflows that run at scale in power- and cost-constrained environments. SambaStack delivers RDU-optimized inference with model bundling and tiered memory so your workflows can stay on a single node; SambaOrchestrator adds the autoscaling, load balancing, monitoring, and model management you need to run fleets across racks and data centers.

To evaluate them, don’t stop at single prompt benchmarks. Run full agent flows, mix multiple models, push traffic spikes, and watch how autoscaling and routing keep latency, tokens/sec, and cost in check. The more your current stack suffers from one-model-per-node limitations, the more clearly the architecture behind SambaStack and SambaOrchestrator will show its value.


Next Step

Get Started