How do I configure multi-model routing on SambaNova so an agent can switch between DeepSeek and Llama during a workflow?

Most agent frameworks assume “one model per endpoint,” which quickly breaks when you want a single agent to tap DeepSeek for reasoning and Llama for fast, general-purpose generation. On SambaNova, the stack is built to run multiple frontier-scale models side by side and switch between them in a single workflow—without you stitching together separate clusters or reworking your code for a new API.

Quick Answer: You configure multi-model routing on SambaNova by deploying both DeepSeek and Llama behind OpenAI-compatible endpoints on SambaCloud, then using routing logic in your agent (or a small gateway service) to choose the right model per step. SambaStack and the RDU architecture handle the “model bundling” and switching on a single node, while SambaOrchestrator manages autoscaling, load balancing, and monitoring across data centers.

The Quick Overview

What It Is: A production pattern for routing a single agent’s workflow between DeepSeek and Llama on SambaNova, using OpenAI-compatible APIs and SambaOrchestrator to manage model switching, scaling, and observability.
Who It Is For: Platform and infra teams running agentic inference at scale who need DeepSeek’s high‑end reasoning and Llama’s speed in the same pipeline—without dedicating a node per model.
Core Problem Solved: Avoids the “one-model-per-node” anti‑pattern by letting multiple frontier-scale models share the same RDU-backed infrastructure, while your code sees a simple, familiar API surface.

How It Works

On SambaNova, multi-model routing is an overlay on top of an inference stack designed for agentic AI and multiple large models running together. You expose DeepSeek and Llama as named models via OpenAI-compatible endpoints, then implement routing logic that selects the model per call based on task type, cost/latency targets, or agent state.

Under the hood:

SambaStack runs on RDUs with custom dataflow and a three‑tier memory architecture, so multiple models and prompts can stay “hot” on one node.
SambaOrchestrator provides the control plane—Auto Scaling | Load Balancing | Monitoring | Model Management—across racks and data centers.
SambaCloud exposes it all via OpenAI-compatible APIs so you can port an existing agent to SambaNova in minutes, then add routing logic with minimal code changes.

A typical setup has three phases:

Provision & register models (DeepSeek + Llama):
- Deploy DeepSeek (e.g., DeepSeek‑R1) and your chosen Llama model onto SambaRack SN50 or SN40L‑16 via SambaStack.
- Confirm each shows up as an addressable model in SambaOrchestrator and SambaCloud (e.g., deepseek-r1, llama-4-405b).
Expose OpenAI-compatible endpoints:
- Use SambaCloud’s OpenAI-compatible APIs so each model is callable with a model field—no new SDKs required.
- Optionally create logical aliases (e.g., reasoning-model, general-model) if you want a layer of indirection over the physical models.
Implement multi-model routing in your agent or gateway:
- Add routing rules: send coding/math/reasoning-heavy steps to DeepSeek; send chat, rewriting, or low-latency steps to Llama.
- Use SambaOrchestrator metrics to refine rules (e.g., shift long‑running tasks to DeepSeek when you care more about quality than latency).

Features & Benefits Breakdown

Core Feature	What It Does	Primary Benefit
OpenAI-compatible multi-model endpoints	Exposes DeepSeek, Llama, and other models via standard OpenAI-style APIs	Port agents in minutes and add routing via a simple `model` parameter
Model bundling on RDUs	Runs multiple frontier-scale models on the same RDU-based node with a three-tier memory architecture that keeps models and prompts hot	Avoids “one-model-per-node,” improves tokens-per-watt, and reduces latency when switching models
SambaOrchestrator control plane	Provides Auto Scaling \| Load Balancing \| Monitoring \| Model Management across racks and data centers	Keeps multi-model workflows stable at scale, with observability for routing decisions and capacity planning

Step-by-Step: Configuring Multi-Model Routing

1. Decide your DeepSeek–Llama split

Before touching config, define how your agent should use each model:

DeepSeek (e.g., DeepSeek-R1):
- Strengths: advanced reasoning, coding, math; often preferred for complex tool‑use and long chains of thought.
- Performance: up to 200 tokens/second on SambaNova RDUs, measured independently by Artificial Analysis.
- Typical roles: “planner” or “expert” steps in an agent loop.
Llama (e.g., Llama 4 series):
- Strengths: fast, general-purpose chat, summarization, rewriting, high throughput for user-facing responses.
- Typical roles: conversational front-end, summarizer, paraphraser, retrieval answerer.

Define simple rules like:

Use DeepSeek when:
- The task is tagged reasoning, code, or math.
- The context window is large and you need reliable stepwise reasoning.
Use Llama when:
- Responding to user chat.
- Summarizing documents or search results.
- Doing low-cost, high-frequency operations.

You’ll encode these rules in your routing logic later.

2. Deploy DeepSeek and Llama on SambaNova

Assuming you’re using SambaRack + SambaOrchestrator:

Allocate RDUs and racks
- Choose the system:
  - SambaRack SN50 for “fast agentic inference at a fraction of the cost” on large models and complex agents.
  - SambaRack SN40L‑16 for low power inference (average of ~10 kWh) when energy and density constraints dominate.
- Plan capacity so DeepSeek and Llama can be bundled: both run on the same racks, and SambaStack handles partitioning and parallelism.
Register models with SambaStack
- Load the checkpoints for DeepSeek and Llama. SambaNova supports bringing your own checkpoints in addition to curated model offerings.
- Use SambaFlow (the compilation layer) to compile each model for the available RDUs. SambaFlow handles multi-chip data‑parallel and model‑parallel strategies automatically.
Verify model availability in SambaOrchestrator
- Each model should show up as a managed entity with health, capacity, and autoscaling configuration.
- Tag models with metadata like family=deepseek or family=llama, tier=reasoning / tier=chat to support future routing policies.

3. Expose OpenAI-compatible APIs

SambaCloud provides OpenAI-compatible APIs so your agent just needs a base URL and an API key.

Configure API endpoints
- For example, you might end up with:
```
POST https://api.sambanova.ai/v1/chat/completions
```
- Models available might include:
```
["deepseek-r1", "llama-4-405b"]
```

Test direct calls to each model

curl https://api.sambanova.ai/v1/chat/completions \
  -H "Authorization: Bearer $SN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1",
    "messages": [
      {"role": "user", "content": "Solve this math problem step by step: ∫ x^2 dx"}
    ]
  }'

And for Llama:

curl https://api.sambanova.ai/v1/chat/completions \
  -H "Authorization: Bearer $SN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-4-405b",
    "messages": [
      {"role": "user", "content": "Summarize the main points of this article in 3 bullet points."}
    ]
  }'

If you already integrate with OpenAI today, the only change is the base URL and the model names; everything else can remain the same.

4. Implement routing logic in your agent

There are two pragmatic patterns:

Pattern A: Routing inside the agent code

You decide per call which model to use, by mapping “task types” to models.

Example (Python, pseudo-code with OpenAI-compatible client):

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.sambanova.ai/v1",
    api_key=os.environ["SN_API_KEY"],
)

def route_model(task_type: str) -> str:
    if task_type in {"reasoning", "math", "code"}:
        return "deepseek-r1"
    return "llama-4-405b"

def call_agent(task_type: str, messages: list[dict]) -> str:
    model_name = route_model(task_type)
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        temperature=0.2 if model_name == "deepseek-r1" else 0.7,
    )
    return response.choices[0].message.content

Here, SambaStack + RDUs handle the model bundling; you just choose the model string.

Pattern B: Routing via a gateway service

If you don’t want your application code to know about specific models, add a small gateway:

App calls POST /agent-chat.
Gateway inspects the request (e.g., task_type, user, or cost tier).
Gateway forwards to SambaCloud with appropriate model.

This lets you swap DeepSeek or Llama versions without touching application code—just change routing rules in the gateway or in SambaOrchestrator.

5. Use SambaOrchestrator to scale and observe

Multi-model routing only works in production if you can see and manage traffic patterns.

SambaOrchestrator lets you:

Auto Scale
- Configure independent autoscaling for DeepSeek and Llama pools based on QPS, queue depth, or latency.
- Example policy: scale DeepSeek more conservatively, but keep a warm baseline, while Llama scales aggressively for bursty chat traffic.
Load Balancing
- Route calls across multiple SambaRack nodes, preserving model locality where possible to keep prompts and weights hot in RDU memory.
Monitoring
- Track per-model metrics: token throughput, latency, error rates, tokens-per-watt, and utilization.
- Use these metrics to refine routing—for example, send more analytic workloads to DeepSeek when Llama nodes are saturated.
Model Management
- Safely roll out new DeepSeek or Llama versions via canary or traffic-splitting rules at the routing layer (e.g., 10% of reasoning tasks to deepseek-r1.1).

Because the underlying architecture is optimized for inference—dataflow RDUs, tiered memory, and model bundling—you avoid the overhead of bouncing between separate GPU clusters for each model.

Features & Benefits Breakdown

Core Feature	What It Does	Primary Benefit
OpenAI-compatible APIs	Expose DeepSeek, Llama, and other models via the same Chat Completions interface you already use	Port existing agents with minimal changes; routing is just changing the `model`
Multi-model bundling on RDUs	Hosts multiple large models on the same RDU-backed node, leveraging three-tier memory to keep weights and prompts in fast-access tiers	Fast switching between DeepSeek and Llama, higher tokens-per-watt, better hardware utilization
SambaOrchestrator routing observability	Provides per-model metrics and scaling policies across racks/data centers	Keeps multi-model workflows reliable at scale and supports data-backed routing decisions

Ideal Use Cases

Best for agentic workflows with specialized roles:
Because it lets a single agent use DeepSeek as a high‑precision “planner” or “solver” and Llama as a fast “communicator,” all on shared infrastructure without manual model shuffling.
Best for enterprises consolidating inference clusters:
Because it replaces multiple single‑model GPU clusters with a unified RDU-based deployment where DeepSeek, Llama, and other models run together, simplifying ops while hitting cost and power budgets.

Limitations & Considerations

Per-model capacity planning:
You still need to size DeepSeek and Llama capacity separately based on workload mix. Use SambaOrchestrator metrics to iterate; early on, over‑provision DeepSeek if you expect heavy reasoning traffic.
Routing policy complexity:
Sophisticated routing (dynamic quality/cost tradeoffs, user-level tiers) adds complexity. Start with simple, tag-based rules (reasoning vs. chat) and evolve toward more advanced policies as you gather traces and metrics.

Pricing & Plans

SambaNova offers flexible options aligned to how you deploy multi-model workloads:

SambaCloud (managed): Best for teams who want to start building in minutes using OpenAI-compatible endpoints for DeepSeek, Llama, and other models, without managing racks or data centers.
SambaRack + SambaOrchestrator (self-hosted / sovereign): Best for organizations needing on-prem or sovereign AI deployments, fine-grained control over power/cooling, and rack-level optimization of multi-model agentic inference.

Pricing and throughput expectations (tokens/sec, tokens-per-watt) depend on your chosen models, racks (SN40L‑16 vs. SN50), and deployment footprint. A SambaNova solutions team can model this against your workload mix (e.g., percentage of DeepSeek vs. Llama tokens, average context length).

Frequently Asked Questions

Do I need separate clusters for DeepSeek and Llama, or can they run on the same SambaNova infrastructure?

Short Answer: They can run on the same RDU-based infrastructure and even share nodes.

Details: SambaStack and the RDU architecture are designed for model bundling. Multiple large models—like DeepSeek-R1 and a Llama 4 series model—can be compiled and deployed to the same SambaRack nodes. The three-tier memory architecture keeps models and prompts hot, so switching between them is efficient. You use SambaOrchestrator to allocate capacity and SambaCloud’s OpenAI-compatible APIs to select the model per call; you don’t need to maintain separate GPU clusters for each model.

How do I switch an existing OpenAI-based agent to use DeepSeek and Llama on SambaNova?

Short Answer: Point your client at SambaNova’s OpenAI-compatible endpoint, update the API key, and set the model field to DeepSeek or Llama.

Details: SambaCloud provides OpenAI-compatible APIs, so existing code using chat.completions can be ported in minutes. The migration steps are:

Change the base URL to SambaNova’s endpoint.
Replace the API key with your SambaNova key.
Update the model names to SambaNova’s DeepSeek and Llama identifiers (e.g., deepseek-r1, llama-4-405b).
Add a small routing function or gateway to choose the model per task.

Because the interface is the same, you don’t need to rewrite your agent logic or adopt a new SDK. You can then leverage SambaOrchestrator for autoscaling and monitoring across your multi-model deployment.

Summary

Configuring multi-model routing on SambaNova is about combining two ideas: keep the agent’s interface simple and familiar (OpenAI-compatible APIs) while letting the infrastructure do the heavy lifting of bundling DeepSeek, Llama, and other models on the same RDU-based stack. You deploy both models via SambaStack, expose them as OpenAI-style endpoints, and implement routing in your agent or a lightweight gateway. SambaOrchestrator then ensures DeepSeek and Llama scale, stay healthy, and deliver the throughput and cost profile you expect—even for complex agentic workflows.

When you’re ready to design your DeepSeek–Llama mix or benchmark tokens-per-second and tokens-per-watt for your specific workloads, the next step is to talk directly with the SambaNova team.

Next Step

Get Started

How do I configure multi-model routing on SambaNova so an agent can switch between DeepSeek and Llama during a workflow?

The Quick Overview

How It Works

Features & Benefits Breakdown

Step-by-Step: Configuring Multi-Model Routing

1. Decide your DeepSeek–Llama split

2. Deploy DeepSeek and Llama on SambaNova

3. Expose OpenAI-compatible APIs

4. Implement routing logic in your agent

Pattern A: Routing inside the agent code

Pattern B: Routing via a gateway service

5. Use SambaOrchestrator to scale and observe

Features & Benefits Breakdown

Ideal Use Cases

Limitations & Considerations

Pricing & Plans

Frequently Asked Questions

Do I need separate clusters for DeepSeek and Llama, or can they run on the same SambaNova infrastructure?

How do I switch an existing OpenAI-based agent to use DeepSeek and Llama on SambaNova?

Summary

Next Step

Keep Reading

More from AI Inference Acceleration

Who are SambaNova’s sovereign/in-country deployment partners (EU/UK/AU) and how do we engage them for procurement?

What does SambaNova SambaStack + SambaOrchestrator include, and how do we evaluate it for autoscaling and multi-model routing?

SambaNova SambaRack SN50: how do I request a quote and what facilities info (power/cooling) do you need?