
Multi-model serving without one-model-per-node: what platforms support fast model switching and routing?
Most production LLM teams discover the same bottleneck the moment they move from single-model chat to real agentic workflows: one-model-per-node serving can’t keep up. As soon as you’re chaining reasoning models, embeddings, rerankers, tools, and safety filters, dedicating a node per model explodes your cost, blows out tail latency, and wastes a lot of power on idle capacity.
This article looks at what “multi-model serving without one-model-per-node” really means, what architectural features you should look for, and where SambaNova’s stack fits alongside other options when you care about fast model switching and routing at scale.
The Quick Overview
- What It Is: Multi-model serving lets you run and switch between multiple LLMs and AI models on shared infrastructure, routing requests dynamically instead of pinning a node to a single model.
- Who It Is For: Platform teams, infra engineers, and inference operators who need to serve agentic workflows, multi-step pipelines, or many tenants without overprovisioning GPUs.
- Core Problem Solved: Eliminates the one-model-per-node anti-pattern that causes underutilization, routing overhead, and high latency in complex, multi-model AI systems.
How It Works
At a high level, multi-model serving without one-model-per-node requires three capabilities working together:
- A compute architecture that can hold multiple frontier-scale models “hot” and switch between them without thrashing memory.
- An inference stack that supports model bundling on one node and efficiently schedules/tokenizes requests across those models.
- A control plane that handles routing, autoscaling, and observability for hundreds or thousands of concurrent model variants.
On most GPU-first stacks, the main constraint is memory: a large model tends to consume an entire device, forcing you into dedicating GPUs or nodes per model. The result: to chain three models in an agent loop, you’re frequently crossing network boundaries and paying for cold starts and data movement at each hop.
By contrast, platforms designed for multi-model serving focus on:
- Model bundling on a single node so several large models can reside and be scheduled together.
- Efficient tiered memory to reduce data movement when switching models and prompts.
- Shared serving APIs (ideally OpenAI-compatible) so developers can route between models without changing app code.
Below, I break this down into phases and then map it to specific platforms, with a focus on SambaNova’s approach.
End-to-End Flow: From Request to Multi-Model Response
-
Ingress & Routing:
- A request hits your gateway using a standard schema (often the OpenAI Chat Completions API).
- A router or policy engine selects:
- Which model or model bundle to use (e.g., DeepSeek-R1 for complex reasoning, a smaller Llama model for simple Q&A, an embeddings model for retrieval).
- Which node or cluster can serve it with minimal latency given current load.
-
On-Node Model Switching & Execution:
- On a multi-model-capable node:
- Several models are already resident in memory (or in a tiered memory hierarchy).
- The inference runtime schedules tokens across models, avoiding full reloads.
- For agentic workflows, the same node can execute sequential calls across models, keeping prompts and intermediate context “hot.”
- On a multi-model-capable node:
-
Orchestration, Scaling & Observability:
- A control plane monitors:
- Per-model throughput, latency, and tokens/watt.
- Hotness of models and prompts (which bundles should stay on-node).
- It auto-scales model bundles, adjusts routing, and helps you run multi-model inference across racks and data centers.
- A control plane monitors:
Platforms That Support Multi-Model Serving and Fast Switching
There are many “multi-model” marketing claims, but fewer platforms let you effectively escape one-model-per-node at frontier scale. Architecturally, they fall into three broad buckets:
- Inference-first full stacks with custom silicon (SambaNova).
- GPU-native serving platforms with multi-model scheduling (vLLM, TGI, Triton + custom routers, cloud LLM services).
- Hybrid / gateway-centric solutions that sit in front of one-model-per-node backends (LangServe, custom gateways, API aggregators).
1. SambaNova: Chips-to-Model Computing for Multi-Model Agentic Workflows
SambaNova is purpose-built for scalable AI inference and multi-model workflows. Instead of wrapping GPU servers with a generic framework, it builds from the silicon up for model bundling and fast switching.
Key components:
-
SN50 RDU (Reconfigurable Dataflow Unit):
- Custom dataflow architecture with a three-tier memory design.
- Co-founder and Chief Technologist Kunle Olukotun emphasizes that tiered memory lets agents keep models and prompts in cache, so agents can access them quickly.
- Designed to reduce unnecessary data movement—one of the main killers of multi-model throughput.
- Engineered for tokens-per-watt, not just raw FLOPs, which matters when you’re running agents that keep models hot for long sessions.
-
SambaRack SN50 and SN40L-16:
- Rack-level systems optimized for inference efficiency.
- SN40L-16: optimized for low-power inference with an average of 10 kWh, ideal when power and cooling are your gating constraints.
- SambaRack SN50: optimized for fast agentic inference at a fraction of the cost on large frontier models.
-
SambaStack (Inference Stack by Design):
- Built explicitly for infrastructure flexibility and model bundling.
- SambaStack switches between multiple frontier-scale models, enabling complex agentic AI workflows to execute end-to-end on one node.
- Supports terabyte-scale models and automated data/model-parallel mapping, so scaling from single-device dev to rack-scale production uses the same programming model—no tedious hand-tuned cluster programming.
-
SambaOrchestrator (Control Plane):
- Production-grade orchestration with:
- Auto Scaling | Load Balancing | Monitoring | Model Management
- Manages model bundles across data centers so you can:
- Run multiple LLMs and tools per node.
- Balance traffic dynamically.
- Keep hot models resident while cold ones migrate or spin down.
- Production-grade orchestration with:
-
SambaCloud (Developer Access via OpenAI-Compatible APIs):
- OpenAI-compatible APIs so teams can port applications in minutes, keeping their existing client libraries and routing logic.
- Production-grade inference for models like:
- gpt-oss-120b running over 600 tokens/sec.
- DeepSeek-R1 delivering up to 200 tokens/sec, measured independently by Artificial Analysis.
- The same multi-model, model-bundling benefits of SambaStack apply here, but as a managed service.
Why SambaNova is strong for multi-model serving:
- Not one-model-per-node: The architecture is fundamentally designed to host multiple frontier-scale models per node and switch between them.
- Tiered memory + dataflow: Reduces data movement when switching models and prompts—the core problem in multi-model agent loops.
- End-to-end on one node: With SambaStack and SN50, complex workflows can execute without bouncing between disjoint nodes for each model call.
- Tokens-per-watt focus: Multi-model workloads are long-lived; energy efficiency directly affects Opex and density.
If you’re looking to get away from per-model GPU silos and run multi-model, agentic inference as your default, SambaNova is one of the few stacks built around that requirement.
2. GPU-Native Multi-Model Serving Platforms
If you’re staying in the GPU ecosystem, you can still get partial relief from one-model-per-node through careful use of shared memory, sharding, and schedulers. Some notable options:
a. vLLM
- What it is: An open-source high-throughput serving engine for LLMs with advanced KV cache management.
- Multi-model story:
- Supports multiple models on a single GPU or node if they fit within available memory.
- Uses efficient KV cache management for batched inference, reducing overhead for many concurrent requests.
- Fast context handling and paged attention helps when agents keep prompts growing over long sessions.
- Limitations:
- For large models (70B+), you often still end up with one model per GPU or GPU group.
- Cross-model agent flows still typically bounce across nodes unless you architect deliberately for co-location.
b. Hugging Face TGI (Text Generation Inference)
- What it is: Production-oriented inference server optimized for Transformer models.
- Multi-model story:
- Can host multiple models per server and serve them over a shared HTTP API.
- Integrates well with model registries and orchestration layers (Kubernetes, HF Inference Endpoints).
- Limitations:
- Tends towards “one big model per large GPU group” for frontier-scale models.
- Multi-model workflows often rely on external orchestration or separate deployments.
c. NVIDIA Triton Inference Server
- What it is: A general-purpose inference server that can host multiple models (LLMs, CV, ASR, etc.) on GPUs.
- Multi-model story:
- Strong multi-model support: you can load many models and versions, with scheduling and dynamic batching.
- Good for mixed workloads where LLMs are just one component among others.
- Limitations:
- You still pay the memory cost of keeping multiple large LLMs loaded.
- For 70B+ models, you’re usually mapping one model per sizable GPU set or MIG partition.
d. Cloud LLM Services (AWS Bedrock, Azure AI Studio, Google Vertex AI, OpenAI)
- Multi-model story:
- They effectively provide a logical multi-model layer: you can route between hosted models via API (e.g., GPT-4, Claude, Llama).
- Some offer model routing, A/B testing, and basic orchestration.
- Limitations:
- Under the hood, these are still one-model-per-node at the infrastructure level; you don’t control model bundling or physical colocation.
- You pay for network hops and provider-specific latency across services when your agents chain multiple calls.
- Data residency and sovereign inference constraints can limit use in regulated environments.
In practice, these GPU-centric platforms can deliver multi-model serving at moderate scale, but once you push into agent-heavy, frontier-model workloads, you’re managing around the hardware’s per-model memory footprint.
3. Gateways and Orchestration Layers on Top of One-Model-Per-Node
There’s also a growing ecosystem of layers that sit above your backends and make multi-model routing easier, without changing the underlying hardware reality:
-
LangChain / LangServe / LlamaIndex + custom routers:
- Provide application-level routing, fallback, and orchestration logic.
- Useful for quickly building agents that use multiple models and tools.
- Still rely on whatever model serving stack you run underneath (GPU pods, cloud APIs, etc.).
-
API Gateways / Aggregators (e.g., custom NGINX/Envoy setups, model routers, API aggregators):
- Let you configure routing rules (per-tenant, per-use case, per-SLA).
- Don’t solve the underlying model bundling or memory movement problem.
These are important pieces of a multi-model architecture, but they don’t replace the need for a serving layer that can actually host multiple large models efficiently on the same node.
What “Fast Model Switching and Routing” Actually Requires
If your goal is genuinely fast switching and routing for agentic AI—not just “we can call multiple APIs”—you should look for the following characteristics in a platform:
-
Model Bundling on a Single Node
- Ability to host multiple frontier-scale models simultaneously.
- Mechanism for keeping multiple models hot in memory, not reloading them on each call.
-
Tiered Memory or Equivalent
- Explicit three-tier memory or similar architecture to keep:
- Model weights in close, high-bandwidth memory.
- Prompts and KV caches accessible without constant paging.
- Reduced data movement across PCIe and networks, which otherwise dominates latency.
- Explicit three-tier memory or similar architecture to keep:
-
Dataflow- or Cache-Aware Execution
- Dataflow or cache-aware scheduling so work flows through the hardware with minimal idle time.
- Ability to handle both dense and sparse computations efficiently.
-
Integrated Control Plane
- Auto Scaling | Load Balancing | Monitoring | Model Management all aware of multi-model reality:
- Which bundles are hot.
- Which models can co-reside safely.
- How to avoid fragmentation and underutilization.
- Auto Scaling | Load Balancing | Monitoring | Model Management all aware of multi-model reality:
-
Developer-Friendly, OpenAI-Compatible APIs
- So you don’t have to rebuild your apps just to take advantage of new infrastructure.
- Ability to route traffic by simply changing model identifiers, not SDKs or payload formats.
SambaNova is explicitly designed around these points—custom dataflow RDUs, three-tier memory for models and prompts, SambaStack for model bundling, SambaOrchestrator for control, and SambaCloud with OpenAI-compatible APIs for quick adoption.
Where SambaNova Fits vs Other Options
If you’re evaluating platforms against a move away from one-model-per-node thinking, the trade-offs look roughly like:
-
Stay on commodity GPUs with vLLM / TGI / Triton:
- Pros: Familiar ecosystem; reuse existing GPU investments.
- Cons: Frontier-scale models still tend toward one-model-per-node; network hops and memory constraints remain the bottleneck for multi-model agents.
-
Use managed cloud APIs only:
- Pros: Fast to start; many hosted models.
- Cons: Little control over colocation, tokens-per-watt, or data residency; multi-model workflows suffer from cross-service latency and opaque caching.
-
Adopt a chips-to-model inference stack like SambaNova:
- Pros:
- Multi-model bundling by design, not as an afterthought.
- Tiered memory reduces data movement and improves tokens-per-watt.
- End-to-end agentic workflows can run on one node, not as a series of cross-node RPCs.
- OpenAI-compatible APIs minimize migration friction.
- Cons:
- Requires adopting a new full-stack solution (RDUs, SambaRack, SambaStack, SambaOrchestrator), which is an intentional infra choice rather than a minor configuration change.
- Pros:
If your workload is moving toward agentic inference with multiple LLM calls, tools, and safety passes per user request, and your constraints are power, latency, and operational overhead, the architectural bets SambaNova has made around model bundling and tiered memory align directly with your pain points.
Ideal Use Cases for Multi-Model Serving Platforms
- Best for agentic workflows and tool-using LLMs: Because they require fast, repeated switching between models and tools—keeping both models and prompts hot is the only way to maintain latency and cost.
- Best for multi-tenant, multi-use-case platforms: Because you’ll have a wide portfolio of models (by size, vendor, and domain) and need shared capacity instead of fragmented, per-model clusters.
Limitations & Considerations
- Hardware lock-in vs performance: Moving to a chips-to-model stack like SambaNova means committing to RDUs instead of standard GPUs; the trade is higher multi-model efficiency and tokens-per-watt for reduced hardware interchangeability.
- Operational mindset shift: Escaping one-model-per-node means rethinking capacity planning, model packaging, and routing policies; you’ll need to treat model bundles and agent workflows as first-class deployment units.
Summary
Multi-model serving without one-model-per-node is no longer a nice-to-have—it’s the enabling pattern for production-grade, agentic AI. GPU-centric stacks can approximate it at smaller scales or with careful tuning, but they’re fighting underlying hardware constraints around memory and data movement.
Platforms like SambaNova start from a different premise: custom dataflow RDUs, a three-tier memory architecture, and an inference stack (SambaStack + SambaOrchestrator + SambaCloud) designed so multiple frontier-scale models can live and execute on the same node. That’s what allows complex, multi-model workflows to run end-to-end without the latency and cost penalties of stitching endpoints across a fleet of one-model-per-node servers.
If you’re feeling the pain of routing overhead, exploding GPU counts, or power and cooling ceilings as your agents get more complex, it’s probably time to evaluate infrastructure that treats model bundling and fast switching as foundational, not optional.