
How do I configure multi-model routing on SambaNova so an agent can switch between DeepSeek and Llama during a workflow?
Most agent frameworks assume you can cheaply hop between models for different steps—reasoning on one, retrieval or summarization on another. On legacy GPU stacks this usually means stitching calls across endpoints and paying a latency and cost penalty every time you switch. On SambaNova, the goal is different: keep multiple frontier-scale models “hot” on the same RDU-based node and let your agent switch between DeepSeek and Llama in a single workflow with minimal overhead.
This guide walks through how to configure multi-model routing on SambaNova so an agent can move between DeepSeek and Llama during a workflow, using OpenAI-compatible APIs and SambaOrchestrator for control-plane logic.
Quick Answer: You define DeepSeek and Llama as separate model deployments in SambaStack, expose them via OpenAI-compatible routes, then let your agent framework select the model per step using the
modelfield or a small router service. SambaOrchestrator handles autoscaling, load balancing, and model management so both models stay available on the same RDU-backed infrastructure.
The Quick Overview
- What It Is: Multi-model routing on SambaNova is the ability to run multiple frontier-scale models—like DeepSeek-R1 and Llama—on SambaStack and dynamically choose which model to call at each step in an agentic workflow, without re-architecting your serving layer.
- Who It Is For: Platform teams and inference operators building agentic AI workflows that mix reasoning, code generation, and domain summarization, and who need to keep latency, tokens-per-watt, and infrastructure cost under control.
- Core Problem Solved: It eliminates the “one-model-per-node” anti-pattern and the overhead of bouncing agent calls across separate GPU clusters, letting agents switch between DeepSeek and Llama inside a single, efficient inference stack.
How It Works
At a high level, you:
-
Deploy models on SambaStack (DeepSeek + Llama).
Use SambaNova’s full-stack inference infrastructure (RDUs + SambaRack + SambaStack) to host both models. The custom dataflow architecture and three-tier memory keep model weights and prompts hot, so switching doesn’t require cold starts. -
Expose OpenAI-compatible endpoints.
SambaCloud and on-prem SambaStack expose models through OpenAI-compatible APIs. Each model gets a logical name (e.g.,deepseek-r1andllama-405b) you can reference directly from your agent, no SDK rewrite required. -
Implement routing in your agent or a thin router service.
Use your agent framework’s routing logic (or a small HTTP service) to select the right model based on task, tool, or policy. SambaOrchestrator handles autoscaling | load balancing | monitoring | model management behind the scenes so both models scale as the agent’s traffic mix changes.
Step 1: Deploy DeepSeek and Llama on SambaNova
From an operator’s point of view, the first step is treating DeepSeek and Llama as separate, managed deployments on the same SambaNova stack:
-
Provision inference capacity.
- For managed: use SambaCloud with RDUs in SambaNova’s environment.
- For on-prem/colo: deploy SambaRack SN40L-16 (optimized for low power inference, ~10 kWh average) or SambaRack SN50 (for the fastest agentic inference on large models).
-
Enable supported models.
- DeepSeek: SambaNova supports the 671B-parameter DeepSeek-R1, which excels at coding, reasoning, and mathematics. On SambaNova RDUs, DeepSeek-R1 reaches up to 200 tokens/second (independently measured by Artificial Analysis).
- Llama: SambaNova is a launch partner for Meta’s Llama 4 series and supports Llama models as first-class workloads.
-
Bundle models on the same node where possible.
Instead of dedicating nodes to each model, you leverage SambaNova’s model bundling and three-tier memory architecture so multiple models can coexist on the same RDU node. This allows:- lower latency model switching for agents,
- higher tokens-per-watt because less data is moved off-chip,
- better utilization vs siloed GPU clusters.
From the operator side, this typically appears as multiple model deployments managed by SambaOrchestrator, each addressable by a model ID or name.
Step 2: Expose OpenAI-Compatible Routes
SambaNova deliberately uses OpenAI-compatible APIs so you don’t need to refactor your agent code just to change infrastructure.
You’ll configure:
- One logical model name for DeepSeek, e.g.
deepseek-r1. - One logical model name for Llama, e.g.
llama-4-70borllama-4-405b.
Your endpoint will look similar to:
POST https://api.sambanova.ai/v1/chat/completions
Authorization: Bearer $SAMBA_API_KEY
Content-Type: application/json
Switching models is just a matter of changing the model field in the JSON payload.
DeepSeek example:
{
"model": "deepseek-r1",
"messages": [
{"role": "system", "content": "You are an expert reasoning agent."},
{"role": "user", "content": "Prove that the sum of two even numbers is even."}
],
"max_tokens": 512
}
Llama example:
{
"model": "llama-4-70b",
"messages": [
{"role": "system", "content": "You are a precise business summarization assistant."},
{"role": "user", "content": "Summarize the following transcript for executives..."}
],
"max_tokens": 256
}
Because SambaNova’s APIs are OpenAI-compatible, existing agents built on openai or langchain or similar libraries typically only need:
- a base URL change, and
- swapping
modelnames to the SambaNova equivalents.
Porting an application often takes minutes, not days.
Step 3: Add Routing Logic for Agents
With both models deployed and addressable, routing is just a policy decision in your agent:
-
Inline routing inside the agent.
Use decision logic (by tool, step index, or content type) to choosemodelat each call. -
Centralized router service.
Build a minimal service that:- inspects the request (task, metadata, risk level),
- chooses DeepSeek or Llama,
- forwards the call to SambaNova’s OpenAI-compatible endpoint with the selected model.
-
Orchestrator-driven routing.
At larger scale, SambaOrchestrator provides:- Auto Scaling | Load Balancing | Monitoring | Model Management
so that your router doesn’t need to reason about capacity—just business logic.
- Auto Scaling | Load Balancing | Monitoring | Model Management
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Model Bundling on RDUs | Keeps multiple frontier-scale models (DeepSeek, Llama) resident on the same RDU-based node. | Enables low-latency agentic workflows without “one-model-per-node” sprawl or cross-cluster hops. |
| OpenAI-Compatible APIs | Exposes DeepSeek and Llama via familiar /v1/chat/completions and /v1/completions interfaces. | Lets you port existing agent code in minutes—no new SDK, no agent refactor. |
| SambaOrchestrator Control Plane | Manages autoscaling, load balancing, and model management across data centers. | Keeps multi-model routing stable and elastic as traffic patterns shift between DeepSeek and Llama. |
| Three-Tier Memory Architecture | Caches models and prompts close to compute on the SN50 RDU. | Maximizes tokens-per-watt and reduces latency when agents repeatedly switch between models. |
| High-Throughput Inference | Delivers DeepSeek-R1 at up to 200 tokens/second (independent measurement) and >600 tps on gpt-oss. | Supports near real-time agent loops without over-provisioning clusters. |
| Sovereign and Hybrid Deployment Options | Lets you deploy SambaRack + SambaOrchestrator in your own data centers or with sovereign partners. | Keeps data residency and compliance under your control while still enabling centralized multi-model routing. |
Ideal Use Cases
-
Best for multi-step agents with specialized tasks:
Because you can route complex reasoning and coding to DeepSeek-R1 while using Llama for summarization, classification, or lower-latency steps—all on the same inference stack. -
Best for production workloads under power and cost pressure:
Because SambaNova’s custom dataflow RDUs and three-tier memory architecture reduce data movement, improving tokens-per-watt and letting you retire “GPU islands” dedicated to individual models.
Limitations & Considerations
-
Model compatibility and behavior differences:
DeepSeek and Llama have different strengths (e.g., DeepSeek-R1 for coding/reasoning, Llama for general language tasks). You’ll need explicit routing policies and prompt templates per model so your agent doesn’t treat them as interchangeable. -
Hot model footprint and capacity planning:
While SambaNova’s tiered memory and model bundling keep multiple models hot, extremely large fleets of models may still require careful capacity planning. Use SambaOrchestrator’s monitoring to understand tokens/sec, concurrency, and memory pressure before adding more bundled models.
Pricing & Plans
Exact pricing depends on whether you’re using managed SambaCloud access or deploying SambaRack systems in your own data center, but the planning typically breaks down as:
-
SambaCloud / Managed Inference:
- Consumption-based pricing tied to tokens and model class.
- Best for teams who want to “start building in minutes” and evaluate DeepSeek + Llama routing without hardware procurement.
-
SambaRack + SambaOrchestrator (On-Prem / Colo):
- Capital + support model for SN40L-16 or SN50 racks, with SambaOrchestrator as the control plane.
- Best for enterprises needing sovereign AI, strict data residency, or tight integration with existing data center operations.
Example alignment:
- Developer / Pilot Plan: Best for product teams and platform engineers needing a low-friction way to test DeepSeek vs Llama routing and benchmark agent workflows.
- Enterprise / Sovereign Plan: Best for organizations needing full control of data, power budgets, and SLAs, consolidating multi-model agent workloads into a small number of highly efficient SambaRacks.
For specific pricing, deployment sizing, and throughput targets, you should talk directly with the SambaNova team.
Frequently Asked Questions
How do I tell SambaNova which calls should go to DeepSeek vs Llama?
Short Answer: You select the model per request using the model field in the OpenAI-compatible API, or route via a small service that sets this field based on your own rules.
Details:
Once DeepSeek-R1 and Llama are enabled as models on your SambaNova deployment, each is given a model name (for example, deepseek-r1 and llama-4-70b). Your agent simply picks the appropriate name per step:
- For reasoning-heavy or coding steps:
model: "deepseek-r1" - For summarization, dialog, or lower-cost steps:
model: "llama-4-70b"
If you want to centralize policy, create a router service that:
- Accepts a generic task request from your agent (e.g., with a
task_typeorrisk_levelfield). - Translates that to a specific SambaNova model name.
- Calls the SambaNova endpoint with the correct
model.
SambaOrchestrator ensures that each model deployment is scaled and balanced appropriately, so you don’t have to encode capacity rules in the router itself.
Can I use the same agent code I wrote for OpenAI without major changes?
Short Answer: Yes. SambaNova exposes OpenAI-compatible APIs, so you mainly change the base URL and model names.
Details:
SambaNova’s inference APIs are intentionally OpenAI compatible. Typical migration steps:
- Change base URL from
https://api.openai.com/v1/...to your SambaNova endpoint (e.g.,https://api.sambanova.ai/v1/...or your on-prem gateway). - Update the API key to your SambaNova token.
- Swap model IDs to SambaNova’s DeepSeek and Llama names (e.g.,
deepseek-r1,llama-4-70b). - Optionally tune prompts per model to account for differences in style and capabilities.
Because the request/response schema is compatible, LangChain, LlamaIndex, and custom agents that already target OpenAI can be redirected to SambaNova in minutes. From there, you can introduce deeper routing logic and model-specific prompts incrementally, without a full rewrite.
Summary
Configuring multi-model routing on SambaNova so an agent can switch between DeepSeek and Llama is primarily an architectural choice, not a code rewrite. You deploy both models on SambaStack, rely on SambaNova’s RDUs and three-tier memory to keep them hot on the same node, expose them via OpenAI-compatible endpoints, and then give your agent routing logic to choose the right model per step. SambaOrchestrator provides the control plane—autoscaling, load balancing, monitoring, and model management—so as your traffic shifts between DeepSeek and Llama, the system adapts without you juggling separate clusters or dealing with “one-model-per-node” constraints.
The result is an agentic workflow that can use the best model for each task while maximizing tokens-per-watt, minimizing latency, and keeping your routing logic straightforward and observable.