
How do I switch my app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint?
Most OpenAI-based applications can be moved to SambaNova Cloud in minutes because SambaNova exposes an OpenAI-compatible /v1/chat/completions endpoint. In practice, you change the base URL, update the API key, map models, and keep the rest of your payload nearly identical.
Quick Answer: To switch from OpenAI to SambaNova Cloud using the OpenAI-compatible
/v1/chat/completionsendpoint, you update your API base URL and key to SambaNova, select a supported model (e.g., Llama, DeepSeek, or gpt-oss), and keep your existing request schema (messages, temperature, streaming, etc.) the same—with only minor model-name and configuration tweaks as needed.
The Quick Overview
- What It Is: A drop-in migration path from OpenAI’s
/v1/chat/completionsto SambaNova Cloud’s OpenAI-compatible endpoint, powered by SambaNova’s chips-to-model inference stack. - Who It Is For: Teams already in production on OpenAI (or prototyping against its API) who want higher throughput, lower cost-per-token, or sovereign deployments without rewriting application logic.
- Core Problem Solved: Eliminates “one-model-per-node” infrastructure lock-in and opaque SaaS dependency by letting you redirect traffic to SambaNova’s RDU-powered inference while preserving your existing integration pattern.
How It Works
SambaNova Cloud exposes OpenAI-compatible APIs on top of SambaStack, which runs on SambaNova RDUs (Reconfigurable Dataflow Units). For you as an application developer, the migration path is:
- Swap the base URL to SambaNova Cloud.
- Replace your OpenAI key with a SambaNova API key.
- Choose a SambaNova-supported model (Llama, DeepSeek, gpt-oss, etc.).
- Keep using the same
/v1/chat/completionsschema (messages, tools, temperature, streaming flags) with minimal or no code changes.
Behind the scenes, SambaNova’s custom dataflow architecture and three-tier memory system keep models and prompts “hot” in memory, so when your agentic workloads fan out across multiple calls—or multiple bundled models—you still get high tokens-per-second and strong tokens-per-watt efficiency.
1. Update the API Base URL
In a typical OpenAI client, you specify the base URL. To switch to SambaNova Cloud, you:
- Change
https://api.openai.com/v1 - To SambaNova’s Cloud endpoint (e.g.,
https://api.sambanova.ai/v1or your tenant-specific URL as documented in your SambaNova account).
Example (Node.js / TypeScript using fetch):
const response = await fetch("https://api.sambanova.ai/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${process.env.SAMBANOVA_API_KEY}`,
},
body: JSON.stringify({
model: "llama-3.1-70b-instruct", // SambaNova-supported model
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain chips-to-model computing." },
],
temperature: 0.2,
stream: false,
}),
});
2. Replace the API Key
Wherever you currently inject OPENAI_API_KEY, you’ll:
- Add a new secret, e.g.,
SAMBANOVA_API_KEY. - Pull it from your environment or secret manager.
- Use it in the
Authorization: Bearerheader.
Example (Python):
import os
import requests
API_KEY = os.environ["SAMBANOVA_API_KEY"]
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}",
}
You do not need a new auth flow or SDK to start—basic HTTP with a Bearer token is sufficient.
3. Map and Select Models
SambaNova Cloud supports leading open-source and open-weight models including:
- Llama series (Meta) — SambaNova is a launch partner for Llama 4; SambaCloud supported all three Llama 3.1 variants (8B, 70B, 405B) with fast inference.
- DeepSeek — reasoning and coding-oriented models (e.g., DeepSeek-R1) with high tokens/sec on SambaNova RDUs.
- OpenAI gpt-oss-120b — open-weight OSS model family optimized for inference.
- Additional models advertised in the console or documentation.
Where you previously used something like:
"model": "gpt-4o-mini"
you’ll now set:
"model": "llama-3.1-70b-instruct"
or another SambaNova-supported model that fits your use case (general chat, coding, reasoning, etc.). Model names will be listed in your SambaNova Cloud account and documentation.
4. Keep the /v1/chat/completions Schema
The goal is “port your application…in minutes,” so SambaNova adheres to the OpenAI chat schema:
modelmessages: [{ role, content }]temperature,top_p,max_tokensstream(with SSE streaming)logprobs/top_logprobs(where supported)- System/user/assistant roles
Typical request body (compatible with your existing OpenAI logic):
{
"model": "llama-3.1-70b-instruct",
"messages": [
{ "role": "system", "content": "You are a senior software engineer." },
{ "role": "user", "content": "Help me design an agentic workflow for code review." }
],
"temperature": 0.3,
"max_tokens": 1024,
"stream": true
}
Your client-side parsing for choices, message, and streaming deltas should continue to work with SambaNova’s endpoint.
5. Streaming & SSE
If you already stream responses from OpenAI (Server-Sent Events):
- Keep
stream: true. - Continue reading the
data:lines withdeltapayloads. - Terminate when you receive
[DONE].
Because SambaNova’s RDUs and tiered memory are tuned for high-throughput inference, you should see competitive or improved tokens/sec—especially on large models like gpt-oss-120b and DeepSeek-R1 (with independent measurements showing DeepSeek-R1 at up to 200 tokens/sec on SambaNova hardware).
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
OpenAI-compatible /v1/chat/completions | Mirrors OpenAI’s chat schema and behavior. | Switch infrastructure without rewriting your app. |
| Chips-to-model inference stack | Runs models on SambaNova RDUs + SambaRack + SambaStack. | High tokens/sec and tokens-per-watt for agentic workloads. |
| Model bundling & flexibility | Lets multiple frontier-scale models run and switch on a single node. | Multi-model agents without one-model-per-node fragmentation. |
Ideal Use Cases
- Best for agentic, multi-step workflows: Because SambaNova’s RDUs and tiered memory minimize data movement and keep models/prompts hot, your agents can make multiple sequential calls (or across several models) without incurring the usual latency and cost penalties.
- Best for teams seeking portability & sovereign inference: Because SambaNova combines OpenAI-compatible APIs with on-prem or sovereign AI data center partners, you can run the same
/v1/chat/completionsworkload in your chosen region or within national borders.
Limitations & Considerations
- Model-name differences: SambaNova Cloud uses its own model identifiers (e.g., specific Llama, DeepSeek, gpt-oss variants). You’ll need a one-time mapping from your current OpenAI model names to SambaNova’s models.
- Feature parity nuances: While the core
/v1/chat/completionsschema is compatible, some advanced behaviors (e.g., particular tools/functions formats, system-level safety toggles, or beta parameters) may differ. Validate any edge-case usage against SambaNova’s docs and adjust payloads accordingly.
Pricing & Plans
SambaNova positions its inference stack as fast agentic inference at a fraction of the cost on the largest models, via:
- SambaCloud: Managed, OpenAI-compatible APIs for Llama, DeepSeek, gpt-oss and more. Ideal if you want to cut over quickly from OpenAI with minimal ops burden.
- SambaRack (SN40L-16 and SN50) + SambaOrchestrator: Rack-level systems with orchestration for teams that need on-prem, co-lo, or sovereign inference—SN40L-16 is optimized for low-power inference (average of ~10 kWh), and SN50 is tuned for fast agentic inference on frontier-scale models.
Pricing specifics (per-token or per-rack) will depend on your contract and deployment model; your SambaNova account team can provide detailed numbers.
- SambaCloud API: Best for developers and product teams needing fast migration from OpenAI, simple usage-based pricing, and “start building in minutes” access via OpenAI-compatible endpoints.
- SambaRack + SambaOrchestrator: Best for infrastructure, platform, and sovereign AI teams needing dedicated racks, tight power/cooling control, and a control plane for auto scaling | load balancing | monitoring | model management.
Frequently Asked Questions
Do I need to rewrite my OpenAI integration to use SambaNova Cloud?
Short Answer: No. You mainly change the base URL, API key, and model name; the /v1/chat/completions schema stays the same.
Details: SambaNova Cloud intentionally exposes OpenAI-compatible APIs so you can port an application “in minutes.” Your core logic—building a messages array, setting temperature, handling streaming events, reading choices[0].message.content—should work as-is. You only need to:
- Update the base URL to SambaNova’s.
- Swap in a SambaNova API key.
- Choose a supported model (e.g., Llama, DeepSeek, gpt-oss).
- Adjust any advanced or beta parameters that may differ between providers.
Can I keep using streaming and tools with the SambaNova /v1/chat/completions endpoint?
Short Answer: You can keep using streaming; tools support depends on the specific feature and model, so check SambaNova’s docs.
Details: Streaming (stream: true and SSE handling) is supported and designed to take advantage of SambaNova’s high tokens/sec throughput on RDUs. For tools/functions, the core pattern is aligned with OpenAI’s format, but exact support (e.g., JSON schema details, tool-call formats) may vary by model and release. Validate your existing tool payloads against SambaNova’s documentation and run a few integration tests—especially for complex agent frameworks—before re-pointing full production traffic.
Summary
Switching your app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint is a low-friction path to better inference economics and higher throughput. You keep your existing chat payloads, streaming behavior, and client logic, while SambaNova’s chips-to-model computing stack—RDUs, SambaRack, SambaStack, and SambaOrchestrator—handles the heavy lifting: model bundling, tiered memory for hot prompts, and efficient multi-model agentic workflows. For platform teams trying to escape one-model-per-node limitations and data-center constraints, it’s a pragmatic way to upgrade infrastructure without rebuilding your application.