
How do I switch my app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint?
Most teams discover that their real blocker isn’t the LLM API—it’s the infrastructure behind it. If your app already talks to OpenAI’s /v1/chat/completions, moving to SambaNova Cloud is intentionally low-friction: you keep the same interface, but you run on chips-to-model infrastructure designed for fast, efficient agentic inference.
Quick Answer: You switch by updating your API base URL, swapping in a SambaNova API key, and selecting a SambaNova-hosted model (like
deepseek-r1orgpt-oss-120b) while keeping the same/v1/chat/completionsrequest shape your app already uses.
The Quick Overview
- What It Is: An OpenAI-compatible
/v1/chat/completionsendpoint on SambaNova Cloud that lets you port existing OpenAI-based apps with minimal code changes. - Who It Is For: Developers and platform teams running LLM apps (chatbots, agents, copilots) who want higher throughput, better tokens-per-watt, or sovereign deployment options without rewriting their code.
- Core Problem Solved: Eliminates “one-model-per-node” and vendor lock-in constraints by letting you keep your OpenAI integration while shifting inference to SambaNova’s RDU-powered, model-bundling infrastructure.
How It Works
At a protocol level, SambaNova Cloud looks like OpenAI’s chat completions API: same HTTP method, similar headers, similar JSON body, and equivalent response structure. Under the hood, SambaStack routes those calls onto RDUs with custom dataflow processing and a three-tier memory architecture so multiple frontier-scale models can stay hot on the same node.
From an app perspective, migration typically involves three steps:
-
Switch the endpoint and auth:
- Update your
base_urlfromhttps://api.openai.comto the SambaNova Cloud base URL. - Replace your OpenAI key with a SambaNova API key in the
Authorization: Bearerheader.
- Update your
-
Select a SambaNova model:
- Swap your
modelfield (e.g., fromgpt-4.xorgpt-4o-mini) to a SambaNova-hosted model such as:deepseek-r1llama-3.1-70bllama-3.1-405bgpt-oss-120b
- Keep the rest of the request body (messages, temperature, tools) largely the same.
- Swap your
-
Tune for performance and costs:
- Adjust parameters (e.g.,
max_tokens,temperature,top_p) based on new model behavior. - For agentic workloads, leverage SambaNova’s high tokens-per-second throughput to safely increase context size or the number of tool calls per loop.
- Adjust parameters (e.g.,
Step‑by‑Step: Switching /v1/chat/completions From OpenAI to SambaNova Cloud
1. Replace the Base URL and API Key
If your current OpenAI client is set up like this:
import openai
openai.api_key = "OPENAI_API_KEY"
openai.base_url = "https://api.openai.com/v1"
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain agentic AI."}
]
)
You adapt it for SambaNova Cloud by:
- Pointing
base_urlto the SambaNova Cloud endpoint (example placeholder below). - Supplying your SambaNova API key.
- Swapping the
modelname.
import openai # Reuse the OpenAI client with a different base_url
openai.api_key = "SAMBA_API_KEY"
openai.base_url = "https://api.sambanova.ai/v1" # Example; use your actual base URL
response = openai.ChatCompletion.create(
model="deepseek-r1", # or llama-3.1-70b, gpt-oss-120b, etc.
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain agentic AI."}
]
)
If you’re using raw HTTP instead of a client, the change is equally small:
curl https://api.sambanova.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer SAMBA_API_KEY" \
-d '{
"model": "deepseek-r1",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain agentic AI." }
]
}'
The Authorization header format remains Bearer <key>, matching OpenAI.
2. Map Your Model Choices
SambaNova Cloud highlights best-in-class open-source models, including:
- DeepSeek-R1 – Strong for complex reasoning and agentic loops (Artificial Analysis measured up to 200 tokens/second on SambaNova RDUs).
- Llama 3.1 (8B, 70B, 405B) – Supported with fast inference; SambaNova was the first to support all three 3.1 variants.
- OpenAI gpt-oss-120b – Independent gpt-oss-120b runs over 600 tokens per second on SambaNova hardware.
Typical mappings from OpenAI to SambaNova might look like:
- Lightweight GPT →
llama-3.1-8b - General-purpose GPT-4-class →
llama-3.1-70borgpt-oss-120b - High-complexity, multi-step agents →
deepseek-r1orllama-3.1-405b
The rest of your /v1/chat/completions payload—messages, temperature, max_tokens, top_p, stream, etc.—stays structurally the same.
3. Keep the Chat Request Shape
You can reuse your existing messages format directly:
{
"model": "gpt-oss-120b",
"messages": [
{ "role": "system", "content": "You are a senior infra engineer helping with AI rollout." },
{ "role": "user", "content": "Summarize pros and cons of one-model-per-node." }
],
"temperature": 0.3,
"max_tokens": 512,
"stream": false
}
SambaNova’s /v1/chat/completions endpoint follows the OpenAI structure, so:
- You keep
rolevalues (system,user,assistant,tool). - Multi-turn history is passed in the same array.
- Responses include familiar fields (
choices,usage, etc.).
This is what “port your application…in minutes” looks like in practice—no SDK rewrite, just configuration changes.
4. Streaming Responses (If You Use stream: true)
If your OpenAI integration uses server-sent events (SSE) with stream: true, you can preserve that pattern:
import openai
openai.api_key = "SAMBA_API_KEY"
openai.base_url = "https://api.sambanova.ai/v1"
stream = openai.ChatCompletion.create(
model="deepseek-r1",
messages=[{"role": "user", "content": "Walk through a memory-bound inference example."}],
stream=True,
)
for chunk in stream:
delta = chunk["choices"][0]["delta"].get("content", "")
if delta:
print(delta, end="", flush=True)
SambaNova’s high tokens-per-second throughput on RDUs means streamed responses arrive quickly, which is particularly noticeable in interactive UIs and IDE copilots.
5. Agentic and Multi‑Model Workflows
If your app chains multiple /v1/chat/completions calls across different models—reasoning, retrieval, tool orchestration—SambaStack’s model bundling and three-tier memory architecture are designed to run that entire workflow on a single node.
Practical implications when you switch:
- Lower routing overhead: Multiple models can stay hot on the same SambaRack node instead of hopping between GPU pools.
- Higher throughput for loops: With DeepSeek-R1 and gpt-oss-120b running at hundreds of tokens per second on RDUs, you can sustain more tool calls and longer prompts without latency spikes.
- Better tokens-per-watt: The custom dataflow architecture reduces excess data movement, which is where many agentic systems run into power and cooling barriers.
From your app’s perspective, this still looks like multiple /v1/chat/completions calls—you’re just targeting different model values that run efficiently on the same underlying stack.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
OpenAI-compatible /v1/chat/completions | Mirrors OpenAI’s chat completions API, including messages format and streaming. | Port your existing OpenAI-based app in minutes without rewriting integrations or clients. |
| Chips-to-model inference on RDUs | Runs models on SambaNova’s Reconfigurable Dataflow Units with three-tier memory architecture. | Higher throughput and better tokens-per-watt for chat and agentic workloads compared to generic GPU stacks. |
| Model bundling & infrastructure flexibility | Lets multiple frontier-scale models share a node and stay hot in memory. | Efficient multi-model, multi-step workflows without “one model per node” constraints or routing overhead. |
Ideal Use Cases
- Best for production agents and copilots: Because SambaNova’s stack is purpose-built for agentic inference—fast reasoning models like DeepSeek-R1, model bundling on RDUs, and a control plane (SambaOrchestrator) for autoscaling and monitoring across data centers.
- Best for sovereign or regulated deployments: Because you can start with SambaCloud’s OpenAI-compatible endpoint, then move to sovereign data center partners or on-prem SambaRack systems while keeping your API pattern stable.
Limitations & Considerations
- Model behavior and tuning differences: Even with compatible APIs, different models (DeepSeek-R1 vs. your current GPT) may respond differently. Plan for a validation phase where you A/B test prompts,
temperature, andmax_tokensbefore flipping all production traffic. - Endpoint-specific features: While the core
/v1/chat/completionsbehavior is OpenAI-compatible, certain OpenAI-specific beta features or nonstandard parameters may not map 1:1. Review SambaNova Cloud docs for any gaps and adjust usage accordingly.
Pricing & Plans
SambaNova offers flexible ways to consume inference capacity, typically aligned to:
- SambaCloud (managed, OpenAI-compatible APIs): For teams that want to “start building in minutes” with minimal operational overhead. You pay for API usage while SambaNova manages SN40L-16 or SN50-backed infrastructure.
- SambaRack + SambaOrchestrator (in your data center or sovereign partner): For infrastructure buyers who need rack-level control, power and cooling planning, and sovereign deployment. You buy or contract for rack capacity and run inference under your own operational model.
Specific pricing depends on usage, models (e.g., DeepSeek-R1 vs. Llama 3.1 variants vs. gpt-oss-120b), and deployment model (SambaCloud vs. sovereign/on-prem).
- SambaCloud API usage: Best for developers and product teams needing rapid iteration and simple, per-call economics.
- Rack-level deployments (SN40L-16, SN50): Best for platform and infra teams needing predictable throughput, tokens-per-watt efficiency, and integration into existing data center operations.
For detailed pricing and sizing guidance, contact SambaNova directly.
Frequently Asked Questions
Do I have to change my OpenAI SDK or client library to use SambaNova Cloud?
Short Answer: In most cases, no—you can reuse the OpenAI client by pointing it at SambaNova’s base URL and using a SambaNova API key.
Details: Since SambaNova Cloud exposes an OpenAI-compatible /v1/chat/completions endpoint, many users simply:
- Set
openai.base_url(or equivalent) to the SambaNova Cloud URL. - Replace the OpenAI API key with a SambaNova key.
- Change the
modelname to one of the supported SambaNova models.
If you’ve abstracted your LLM provider behind an internal interface, the change is typically limited to configuration. If you’re using custom HTTP clients, it’s a straightforward URL and header update.
Will switching to SambaNova Cloud break my existing prompts or agent workflows?
Short Answer: Your request shape remains the same, but you should plan to revalidate prompts because different models have different behaviors.
Details: The messages array, roles, and parameters like temperature, max_tokens, and top_p work as expected on SambaNova’s /v1/chat/completions. However:
- DeepSeek-R1, Llama 3.1, and gpt-oss-120b have their own strengths and response styles compared to commercial GPT models.
- For critical workflows (RAG, agents with tools, code generation), run a calibration phase:
- Replay representative logs against SambaNova models.
- Compare quality metrics (accuracy, hallucinations, completion length).
- Adjust prompts and parameters as needed.
Because SambaNova can deliver high tokens-per-second throughput and favorable tokens-per-watt, you may choose to increase context window usage or the number of steps in your agent loop without breaching latency and cost targets.
Summary
Switching your app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint is primarily a configuration change: update the base URL, supply a SambaNova API key, and select a model like DeepSeek-R1, Llama 3.1, or gpt-oss-120b. Underneath that familiar API, you gain a chips-to-model inference stack—RDUs with dataflow processing and three-tier memory—that’s purpose-built for fast, efficient agentic workloads, supports model bundling on a single node, and can extend from SambaCloud to sovereign or on-prem deployments without rewriting your integration.