How do I switch my app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint?
AI Inference Acceleration

How do I switch my app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint?

8 min read

Most OpenAI-based applications can be moved to SambaNova Cloud in minutes because SambaNova exposes an OpenAI-compatible /v1/chat/completions endpoint. In practice, you change the base URL, update the API key, map models, and keep the rest of your payload nearly identical.

Quick Answer: To switch from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint, you update your API base URL and key to SambaNova, select a supported model (e.g., Llama, DeepSeek, or gpt-oss), and keep your existing request schema (messages, temperature, streaming, etc.) the same—with only minor model-name and configuration tweaks as needed.


The Quick Overview

  • What It Is: A drop-in migration path from OpenAI’s /v1/chat/completions to SambaNova Cloud’s OpenAI-compatible endpoint, powered by SambaNova’s chips-to-model inference stack.
  • Who It Is For: Teams already in production on OpenAI (or prototyping against its API) who want higher throughput, lower cost-per-token, or sovereign deployments without rewriting application logic.
  • Core Problem Solved: Eliminates “one-model-per-node” infrastructure lock-in and opaque SaaS dependency by letting you redirect traffic to SambaNova’s RDU-powered inference while preserving your existing integration pattern.

How It Works

SambaNova Cloud exposes OpenAI-compatible APIs on top of SambaStack, which runs on SambaNova RDUs (Reconfigurable Dataflow Units). For you as an application developer, the migration path is:

  1. Swap the base URL to SambaNova Cloud.
  2. Replace your OpenAI key with a SambaNova API key.
  3. Choose a SambaNova-supported model (Llama, DeepSeek, gpt-oss, etc.).
  4. Keep using the same /v1/chat/completions schema (messages, tools, temperature, streaming flags) with minimal or no code changes.

Behind the scenes, SambaNova’s custom dataflow architecture and three-tier memory system keep models and prompts “hot” in memory, so when your agentic workloads fan out across multiple calls—or multiple bundled models—you still get high tokens-per-second and strong tokens-per-watt efficiency.

1. Update the API Base URL

In a typical OpenAI client, you specify the base URL. To switch to SambaNova Cloud, you:

  • Change https://api.openai.com/v1
  • To SambaNova’s Cloud endpoint (e.g., https://api.sambanova.ai/v1 or your tenant-specific URL as documented in your SambaNova account).

Example (Node.js / TypeScript using fetch):

const response = await fetch("https://api.sambanova.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.SAMBANOVA_API_KEY}`,
  },
  body: JSON.stringify({
    model: "llama-3.1-70b-instruct", // SambaNova-supported model
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Explain chips-to-model computing." },
    ],
    temperature: 0.2,
    stream: false,
  }),
});

2. Replace the API Key

Wherever you currently inject OPENAI_API_KEY, you’ll:

  • Add a new secret, e.g., SAMBANOVA_API_KEY.
  • Pull it from your environment or secret manager.
  • Use it in the Authorization: Bearer header.

Example (Python):

import os
import requests

API_KEY = os.environ["SAMBANOVA_API_KEY"]

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

You do not need a new auth flow or SDK to start—basic HTTP with a Bearer token is sufficient.

3. Map and Select Models

SambaNova Cloud supports leading open-source and open-weight models including:

  • Llama series (Meta) — SambaNova is a launch partner for Llama 4; SambaCloud supported all three Llama 3.1 variants (8B, 70B, 405B) with fast inference.
  • DeepSeek — reasoning and coding-oriented models (e.g., DeepSeek-R1) with high tokens/sec on SambaNova RDUs.
  • OpenAI gpt-oss-120b — open-weight OSS model family optimized for inference.
  • Additional models advertised in the console or documentation.

Where you previously used something like:

"model": "gpt-4o-mini"

you’ll now set:

"model": "llama-3.1-70b-instruct"

or another SambaNova-supported model that fits your use case (general chat, coding, reasoning, etc.). Model names will be listed in your SambaNova Cloud account and documentation.

4. Keep the /v1/chat/completions Schema

The goal is “port your application…in minutes,” so SambaNova adheres to the OpenAI chat schema:

  • model
  • messages: [{ role, content }]
  • temperature, top_p, max_tokens
  • stream (with SSE streaming)
  • logprobs / top_logprobs (where supported)
  • System/user/assistant roles

Typical request body (compatible with your existing OpenAI logic):

{
  "model": "llama-3.1-70b-instruct",
  "messages": [
    { "role": "system", "content": "You are a senior software engineer." },
    { "role": "user", "content": "Help me design an agentic workflow for code review." }
  ],
  "temperature": 0.3,
  "max_tokens": 1024,
  "stream": true
}

Your client-side parsing for choices, message, and streaming deltas should continue to work with SambaNova’s endpoint.

5. Streaming & SSE

If you already stream responses from OpenAI (Server-Sent Events):

  • Keep stream: true.
  • Continue reading the data: lines with delta payloads.
  • Terminate when you receive [DONE].

Because SambaNova’s RDUs and tiered memory are tuned for high-throughput inference, you should see competitive or improved tokens/sec—especially on large models like gpt-oss-120b and DeepSeek-R1 (with independent measurements showing DeepSeek-R1 at up to 200 tokens/sec on SambaNova hardware).


Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
OpenAI-compatible /v1/chat/completionsMirrors OpenAI’s chat schema and behavior.Switch infrastructure without rewriting your app.
Chips-to-model inference stackRuns models on SambaNova RDUs + SambaRack + SambaStack.High tokens/sec and tokens-per-watt for agentic workloads.
Model bundling & flexibilityLets multiple frontier-scale models run and switch on a single node.Multi-model agents without one-model-per-node fragmentation.

Ideal Use Cases

  • Best for agentic, multi-step workflows: Because SambaNova’s RDUs and tiered memory minimize data movement and keep models/prompts hot, your agents can make multiple sequential calls (or across several models) without incurring the usual latency and cost penalties.
  • Best for teams seeking portability & sovereign inference: Because SambaNova combines OpenAI-compatible APIs with on-prem or sovereign AI data center partners, you can run the same /v1/chat/completions workload in your chosen region or within national borders.

Limitations & Considerations

  • Model-name differences: SambaNova Cloud uses its own model identifiers (e.g., specific Llama, DeepSeek, gpt-oss variants). You’ll need a one-time mapping from your current OpenAI model names to SambaNova’s models.
  • Feature parity nuances: While the core /v1/chat/completions schema is compatible, some advanced behaviors (e.g., particular tools/functions formats, system-level safety toggles, or beta parameters) may differ. Validate any edge-case usage against SambaNova’s docs and adjust payloads accordingly.

Pricing & Plans

SambaNova positions its inference stack as fast agentic inference at a fraction of the cost on the largest models, via:

  • SambaCloud: Managed, OpenAI-compatible APIs for Llama, DeepSeek, gpt-oss and more. Ideal if you want to cut over quickly from OpenAI with minimal ops burden.
  • SambaRack (SN40L-16 and SN50) + SambaOrchestrator: Rack-level systems with orchestration for teams that need on-prem, co-lo, or sovereign inference—SN40L-16 is optimized for low-power inference (average of ~10 kWh), and SN50 is tuned for fast agentic inference on frontier-scale models.

Pricing specifics (per-token or per-rack) will depend on your contract and deployment model; your SambaNova account team can provide detailed numbers.

  • SambaCloud API: Best for developers and product teams needing fast migration from OpenAI, simple usage-based pricing, and “start building in minutes” access via OpenAI-compatible endpoints.
  • SambaRack + SambaOrchestrator: Best for infrastructure, platform, and sovereign AI teams needing dedicated racks, tight power/cooling control, and a control plane for auto scaling | load balancing | monitoring | model management.

Frequently Asked Questions

Do I need to rewrite my OpenAI integration to use SambaNova Cloud?

Short Answer: No. You mainly change the base URL, API key, and model name; the /v1/chat/completions schema stays the same.

Details: SambaNova Cloud intentionally exposes OpenAI-compatible APIs so you can port an application “in minutes.” Your core logic—building a messages array, setting temperature, handling streaming events, reading choices[0].message.content—should work as-is. You only need to:

  • Update the base URL to SambaNova’s.
  • Swap in a SambaNova API key.
  • Choose a supported model (e.g., Llama, DeepSeek, gpt-oss).
  • Adjust any advanced or beta parameters that may differ between providers.

Can I keep using streaming and tools with the SambaNova /v1/chat/completions endpoint?

Short Answer: You can keep using streaming; tools support depends on the specific feature and model, so check SambaNova’s docs.

Details: Streaming (stream: true and SSE handling) is supported and designed to take advantage of SambaNova’s high tokens/sec throughput on RDUs. For tools/functions, the core pattern is aligned with OpenAI’s format, but exact support (e.g., JSON schema details, tool-call formats) may vary by model and release. Validate your existing tool payloads against SambaNova’s documentation and run a few integration tests—especially for complex agent frameworks—before re-pointing full production traffic.


Summary

Switching your app from OpenAI to SambaNova Cloud using the OpenAI-compatible /v1/chat/completions endpoint is a low-friction path to better inference economics and higher throughput. You keep your existing chat payloads, streaming behavior, and client logic, while SambaNova’s chips-to-model computing stack—RDUs, SambaRack, SambaStack, and SambaOrchestrator—handles the heavy lifting: model bundling, tiered memory for hot prompts, and efficient multi-model agentic workflows. For platform teams trying to escape one-model-per-node limitations and data-center constraints, it’s a pragmatic way to upgrade infrastructure without rebuilding your application.


Next Step

Get Started