How do I switch my app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint?
AI Inference Acceleration

How do I switch my app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint?

9 min read

Most teams discover that their real blocker isn’t the LLM API—it’s the infrastructure behind it. If your app already talks to OpenAI’s /v1/chat/completions, moving to SambaNova Cloud is intentionally low-friction: you keep the same interface, but you run on chips-to-model infrastructure designed for fast, efficient agentic inference.

Quick Answer: You switch by updating your API base URL, swapping in a SambaNova API key, and selecting a SambaNova-hosted model (like deepseek-r1 or gpt-oss-120b) while keeping the same /v1/chat/completions request shape your app already uses.


The Quick Overview

  • What It Is: An OpenAI-compatible /v1/chat/completions endpoint on SambaNova Cloud that lets you port existing OpenAI-based apps with minimal code changes.
  • Who It Is For: Developers and platform teams running LLM apps (chatbots, agents, copilots) who want higher throughput, better tokens-per-watt, or sovereign deployment options without rewriting their code.
  • Core Problem Solved: Eliminates “one-model-per-node” and vendor lock-in constraints by letting you keep your OpenAI integration while shifting inference to SambaNova’s RDU-powered, model-bundling infrastructure.

How It Works

At a protocol level, SambaNova Cloud looks like OpenAI’s chat completions API: same HTTP method, similar headers, similar JSON body, and equivalent response structure. Under the hood, SambaStack routes those calls onto RDUs with custom dataflow processing and a three-tier memory architecture so multiple frontier-scale models can stay hot on the same node.

From an app perspective, migration typically involves three steps:

  1. Switch the endpoint and auth:

    • Update your base_url from https://api.openai.com to the SambaNova Cloud base URL.
    • Replace your OpenAI key with a SambaNova API key in the Authorization: Bearer header.
  2. Select a SambaNova model:

    • Swap your model field (e.g., from gpt-4.x or gpt-4o-mini) to a SambaNova-hosted model such as:
      • deepseek-r1
      • llama-3.1-70b
      • llama-3.1-405b
      • gpt-oss-120b
    • Keep the rest of the request body (messages, temperature, tools) largely the same.
  3. Tune for performance and costs:

    • Adjust parameters (e.g., max_tokens, temperature, top_p) based on new model behavior.
    • For agentic workloads, leverage SambaNova’s high tokens-per-second throughput to safely increase context size or the number of tool calls per loop.

Step‑by‑Step: Switching /v1/chat/completions From OpenAI to SambaNova Cloud

1. Replace the Base URL and API Key

If your current OpenAI client is set up like this:

import openai

openai.api_key = "OPENAI_API_KEY"
openai.base_url = "https://api.openai.com/v1"

response = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain agentic AI."}
    ]
)

You adapt it for SambaNova Cloud by:

  • Pointing base_url to the SambaNova Cloud endpoint (example placeholder below).
  • Supplying your SambaNova API key.
  • Swapping the model name.
import openai  # Reuse the OpenAI client with a different base_url

openai.api_key = "SAMBA_API_KEY"
openai.base_url = "https://api.sambanova.ai/v1"  # Example; use your actual base URL

response = openai.ChatCompletion.create(
    model="deepseek-r1",  # or llama-3.1-70b, gpt-oss-120b, etc.
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain agentic AI."}
    ]
)

If you’re using raw HTTP instead of a client, the change is equally small:

curl https://api.sambanova.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer SAMBA_API_KEY" \
  -d '{
    "model": "deepseek-r1",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain agentic AI." }
    ]
  }'

The Authorization header format remains Bearer <key>, matching OpenAI.

2. Map Your Model Choices

SambaNova Cloud highlights best-in-class open-source models, including:

  • DeepSeek-R1 – Strong for complex reasoning and agentic loops (Artificial Analysis measured up to 200 tokens/second on SambaNova RDUs).
  • Llama 3.1 (8B, 70B, 405B) – Supported with fast inference; SambaNova was the first to support all three 3.1 variants.
  • OpenAI gpt-oss-120b – Independent gpt-oss-120b runs over 600 tokens per second on SambaNova hardware.

Typical mappings from OpenAI to SambaNova might look like:

  • Lightweight GPT → llama-3.1-8b
  • General-purpose GPT-4-class → llama-3.1-70b or gpt-oss-120b
  • High-complexity, multi-step agents → deepseek-r1 or llama-3.1-405b

The rest of your /v1/chat/completions payload—messages, temperature, max_tokens, top_p, stream, etc.—stays structurally the same.

3. Keep the Chat Request Shape

You can reuse your existing messages format directly:

{
  "model": "gpt-oss-120b",
  "messages": [
    { "role": "system", "content": "You are a senior infra engineer helping with AI rollout." },
    { "role": "user", "content": "Summarize pros and cons of one-model-per-node." }
  ],
  "temperature": 0.3,
  "max_tokens": 512,
  "stream": false
}

SambaNova’s /v1/chat/completions endpoint follows the OpenAI structure, so:

  • You keep role values (system, user, assistant, tool).
  • Multi-turn history is passed in the same array.
  • Responses include familiar fields (choices, usage, etc.).

This is what “port your application…in minutes” looks like in practice—no SDK rewrite, just configuration changes.

4. Streaming Responses (If You Use stream: true)

If your OpenAI integration uses server-sent events (SSE) with stream: true, you can preserve that pattern:

import openai
openai.api_key = "SAMBA_API_KEY"
openai.base_url = "https://api.sambanova.ai/v1"

stream = openai.ChatCompletion.create(
    model="deepseek-r1",
    messages=[{"role": "user", "content": "Walk through a memory-bound inference example."}],
    stream=True,
)

for chunk in stream:
    delta = chunk["choices"][0]["delta"].get("content", "")
    if delta:
        print(delta, end="", flush=True)

SambaNova’s high tokens-per-second throughput on RDUs means streamed responses arrive quickly, which is particularly noticeable in interactive UIs and IDE copilots.

5. Agentic and Multi‑Model Workflows

If your app chains multiple /v1/chat/completions calls across different models—reasoning, retrieval, tool orchestration—SambaStack’s model bundling and three-tier memory architecture are designed to run that entire workflow on a single node.

Practical implications when you switch:

  • Lower routing overhead: Multiple models can stay hot on the same SambaRack node instead of hopping between GPU pools.
  • Higher throughput for loops: With DeepSeek-R1 and gpt-oss-120b running at hundreds of tokens per second on RDUs, you can sustain more tool calls and longer prompts without latency spikes.
  • Better tokens-per-watt: The custom dataflow architecture reduces excess data movement, which is where many agentic systems run into power and cooling barriers.

From your app’s perspective, this still looks like multiple /v1/chat/completions calls—you’re just targeting different model values that run efficiently on the same underlying stack.


Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
OpenAI-compatible /v1/chat/completionsMirrors OpenAI’s chat completions API, including messages format and streaming.Port your existing OpenAI-based app in minutes without rewriting integrations or clients.
Chips-to-model inference on RDUsRuns models on SambaNova’s Reconfigurable Dataflow Units with three-tier memory architecture.Higher throughput and better tokens-per-watt for chat and agentic workloads compared to generic GPU stacks.
Model bundling & infrastructure flexibilityLets multiple frontier-scale models share a node and stay hot in memory.Efficient multi-model, multi-step workflows without “one model per node” constraints or routing overhead.

Ideal Use Cases

  • Best for production agents and copilots: Because SambaNova’s stack is purpose-built for agentic inference—fast reasoning models like DeepSeek-R1, model bundling on RDUs, and a control plane (SambaOrchestrator) for autoscaling and monitoring across data centers.
  • Best for sovereign or regulated deployments: Because you can start with SambaCloud’s OpenAI-compatible endpoint, then move to sovereign data center partners or on-prem SambaRack systems while keeping your API pattern stable.

Limitations & Considerations

  • Model behavior and tuning differences: Even with compatible APIs, different models (DeepSeek-R1 vs. your current GPT) may respond differently. Plan for a validation phase where you A/B test prompts, temperature, and max_tokens before flipping all production traffic.
  • Endpoint-specific features: While the core /v1/chat/completions behavior is OpenAI-compatible, certain OpenAI-specific beta features or nonstandard parameters may not map 1:1. Review SambaNova Cloud docs for any gaps and adjust usage accordingly.

Pricing & Plans

SambaNova offers flexible ways to consume inference capacity, typically aligned to:

  • SambaCloud (managed, OpenAI-compatible APIs): For teams that want to “start building in minutes” with minimal operational overhead. You pay for API usage while SambaNova manages SN40L-16 or SN50-backed infrastructure.
  • SambaRack + SambaOrchestrator (in your data center or sovereign partner): For infrastructure buyers who need rack-level control, power and cooling planning, and sovereign deployment. You buy or contract for rack capacity and run inference under your own operational model.

Specific pricing depends on usage, models (e.g., DeepSeek-R1 vs. Llama 3.1 variants vs. gpt-oss-120b), and deployment model (SambaCloud vs. sovereign/on-prem).

  • SambaCloud API usage: Best for developers and product teams needing rapid iteration and simple, per-call economics.
  • Rack-level deployments (SN40L-16, SN50): Best for platform and infra teams needing predictable throughput, tokens-per-watt efficiency, and integration into existing data center operations.

For detailed pricing and sizing guidance, contact SambaNova directly.


Frequently Asked Questions

Do I have to change my OpenAI SDK or client library to use SambaNova Cloud?

Short Answer: In most cases, no—you can reuse the OpenAI client by pointing it at SambaNova’s base URL and using a SambaNova API key.

Details: Since SambaNova Cloud exposes an OpenAI-compatible /v1/chat/completions endpoint, many users simply:

  • Set openai.base_url (or equivalent) to the SambaNova Cloud URL.
  • Replace the OpenAI API key with a SambaNova key.
  • Change the model name to one of the supported SambaNova models.

If you’ve abstracted your LLM provider behind an internal interface, the change is typically limited to configuration. If you’re using custom HTTP clients, it’s a straightforward URL and header update.


Will switching to SambaNova Cloud break my existing prompts or agent workflows?

Short Answer: Your request shape remains the same, but you should plan to revalidate prompts because different models have different behaviors.

Details: The messages array, roles, and parameters like temperature, max_tokens, and top_p work as expected on SambaNova’s /v1/chat/completions. However:

  • DeepSeek-R1, Llama 3.1, and gpt-oss-120b have their own strengths and response styles compared to commercial GPT models.
  • For critical workflows (RAG, agents with tools, code generation), run a calibration phase:
    • Replay representative logs against SambaNova models.
    • Compare quality metrics (accuracy, hallucinations, completion length).
    • Adjust prompts and parameters as needed.

Because SambaNova can deliver high tokens-per-second throughput and favorable tokens-per-watt, you may choose to increase context window usage or the number of steps in your agent loop without breaching latency and cost targets.


Summary

Switching your app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint is primarily a configuration change: update the base URL, supply a SambaNova API key, and select a model like DeepSeek-R1, Llama 3.1, or gpt-oss-120b. Underneath that familiar API, you gain a chips-to-model inference stack—RDUs with dataflow processing and three-tier memory—that’s purpose-built for fast, efficient agentic workloads, supports model bundling on a single node, and can extend from SambaCloud to sovereign or on-prem deployments without rewriting your integration.


Next Step

Get Started