
How do I point my existing OpenAI SDK to together.ai (base URL, API key) without rewriting my app?
Most teams can point their existing OpenAI SDKs at together.ai in a few lines of configuration: change the base URL, swap the API key, optionally update the model name, and keep the rest of your code exactly the same. together.ai exposes an OpenAI-compatible API, so you do not need to rewrite your app, your agents, or your middleware.
Quick Answer: Configure the OpenAI client to use
https://api.together.xyz/v1as thebase_url(orbaseURL) and set yourTOGETHER_API_KEY. Your existing calls (client.chat.completions.create,openai.ChatCompletion.create, etc.) continue to work with minimal or no code changes.
The Quick Overview
- What It Is: A drop-in OpenAI-compatible endpoint that lets you run top open-source and partner models on together.ai’s AI Native Cloud with the same SDKs you already use.
- Who It Is For: Teams already using OpenAI SDKs (Node, Python, etc.) that want better price-performance, long-context options, and control over infrastructure without changing their app logic.
- Core Problem Solved: Move to faster, cheaper, more flexible inference (serverless or dedicated) without a risky “full rewrite” of your codebase.
How It Works
You keep your existing OpenAI client and method calls, and only change the configuration that points the client to together.ai:
- Swap the Base URL: Replace the OpenAI endpoint with
https://api.together.xyz/v1, which exposes an OpenAI-compatible API surface. - Set the Together API Key: Configure
TOGETHER_API_KEYand pass it where you previously usedOPENAI_API_KEY. - Update Model Names (If Needed): Choose a model available on together.ai (e.g., Mixtral, Llama, Qwen, vision models) and start sending traffic — serverless, batch, or dedicated.
Under the hood, your requests hit together.ai’s AI Native Cloud: FlashAttention-based kernels, ATLAS speculative decoding, and CPD long-context serving give you up to 2.75x faster inference and better economics, with SOC 2 Type II assurances and tenant-level isolation.
Step-by-Step: Pointing Your OpenAI SDK to together.ai
1. Get Your Together API Key
- Register or sign in at together.ai.
- Go to your account dashboard and create an API key.
- Store it as an environment variable, for example:
export TOGETHER_API_KEY="sk-..."
# On Windows (PowerShell)
$env:TOGETHER_API_KEY="sk-..."
You’ll use TOGETHER_API_KEY instead of OPENAI_API_KEY.
2. Update the Client Configuration by Language
Below are minimal changes for common OpenAI SDK setups.
Python (New openai SDK / OpenAI client)
If you’re using the new openai client:
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.together.xyz/v1",
api_key=os.environ["TOGETHER_API_KEY"],
)
resp = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[
{"role": "user", "content": "Explain CPD for long-context serving in 2 sentences."}
],
)
print(resp.choices[0].message.content)
Key changes:
base_url="https://api.together.xyz/v1"api_key=os.environ["TOGETHER_API_KEY"]- Use any together.ai-supported model name.
Python (Legacy openai.ChatCompletion.create style)
If you’re still on the legacy pattern:
import openai
import os
openai.api_key = os.environ["TOGETHER_API_KEY"]
openai.base_url = "https://api.together.xyz/v1"
resp = openai.ChatCompletion.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[
{"role": "user", "content": "Summarize ATLAS in one paragraph."}
],
)
print(resp["choices"][0]["message"]["content"])
Only the configuration lines change; your method calls stay the same.
Node.js / TypeScript (New openai client)
npm install openai
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.together.xyz/v1",
apiKey: process.env.TOGETHER_API_KEY,
});
const resp = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Give 3 bullets on CPD vs naive long-context serving." }],
});
console.log(resp.choices[0].message.content);
Again, only baseURL and apiKey change.
Node.js / TypeScript (Legacy Configuration + OpenAIApi)
import { Configuration, OpenAIApi } from "openai";
const configuration = new Configuration({
apiKey: process.env.TOGETHER_API_KEY,
basePath: "https://api.together.xyz/v1",
});
const client = new OpenAIApi(configuration);
const resp = await client.createChatCompletion({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [{ role: "user", content: "What is FlashAttention-4 and why does it matter?" }],
});
console.log(resp.data.choices[0].message?.content);
Use basePath for the together.ai endpoint.
cURL
If you already have scripts using api.openai.com, you can adapt them:
curl https://api.together.xyz/v1/chat/completions \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
"messages": [{"role": "user", "content": "Outline a batch inference pipeline on Together."}]
}'
3. Choosing Models on together.ai
Because together.ai is model-agnostic and open, you may want to:
- Swap from a proprietary model to a top open model (e.g., Mixtral, Llama 3.1, Qwen).
- Move to a long-context model for RAG or document workflows.
- Use multimodal models (vision, OCR, image understanding) through the same OpenAI-compatible API.
Model names follow the pattern:
provider/model-name
Examples:
meta-llama/Meta-Llama-3.1-70B-Instruct-Turbomistralai/Mixtral-8x7B-Instruct-v0.1- Vision/OCR models for multimodal workflows.
You can usually drop in a new model name without changing request structure (messages, temperature, max_tokens, etc.).
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| OpenAI-compatible API | Uses the same request/response schema and SDK methods | No code rewrite; switch providers by changing config |
| AI Native Cloud performance | Uses FlashAttention kernels, ATLAS, and CPD on modern GPUs | Up to 2.75x faster inference and better price-performance |
| Flexible deployment modes | Serverless, Batch, Dedicated Model, Dedicated Container | Match infra to traffic: latency, throughput, or control |
| Model breadth (open + partner) | Access hundreds of open-source and partner models | Swap models without re-architecting your stack |
| Strong privacy & control | SOC 2 Type II, tenant-level isolation, data ownership | Ship production workloads with compliance and assurances |
Ideal Use Cases
- Best for production apps already on OpenAI: Because you can redirect traffic to together.ai with a base URL + key change, then iterate on models and deployment modes (e.g., move hot paths to Dedicated Model Inference) without touching most of your application code.
- Best for teams optimizing unit economics: Because you can test together.ai’s 2x+ faster serverless and up to 50% cheaper batch processing while keeping your existing gateways, agents, and orchestrators compatible via the OpenAI-style interface.
Deployment Mode Considerations After You Switch
Once your SDK points to together.ai, the next decision is how you want inference to run:
-
Serverless Inference (default for most OpenAI-style calls)
- Best for: Variable or unpredictable traffic, prototypes, internal tools.
- Behavior: Auto-scales; you pay per token; no infrastructure to manage.
- Benefit: Quickest way to test new models and benchmark latency vs your existing provider.
-
Batch Inference
- Best for: Offline jobs, large dataset processing (e.g., 30B tokens), log analysis, backfills.
- Behavior: Submit large jobs; together.ai schedules them on GPU clusters for throughput.
- Benefit: Up to 50% less cost for high-volume workloads.
-
Dedicated Model Inference
- Best for: Steady traffic and latency-sensitive production workloads.
- Behavior: Your own reserved model endpoint on dedicated GPUs.
- Benefit: More predictable latency, better tokens/sec, and strong cost control.
-
Dedicated Container Inference & GPU Clusters
- Best for: Custom runtimes, bespoke serving stacks, or full control over kernels.
- Behavior: Bring your own container or run full workloads on GPU clusters (8–4,000+ GPUs).
- Benefit: Maximum flexibility while still benefiting from together.ai infra and research.
Your integration code (OpenAI SDK calls) can stay the same across these modes; you only change how/where the model is deployed on the backend.
Limitations & Considerations
-
Model name differences:
together.ai does not expose proprietary model IDs from other vendors. You’ll need to pick a compatible open or partner model (e.g., a Llama 3.1 or Mixtral variant) instead ofgpt-*.
Workaround: Create a simple mapping layer in your app — e.g.,MY_DEFAULT_MODEL->meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo. -
Feature parity nuances:
The API is OpenAI-compatible, but not every provider-specific feature or beta flag (e.g., some vendor-only tools) will be identical.
Workaround: Start with baseline chat/completions and tools that are documented to work; then enable advanced features incrementally, testing behavior per model.
Pricing & Plans
together.ai is designed for best-in-market price-performance, with multiple ways to align cost to workload shape:
-
Serverless pay-as-you-go:
Ideal when you’re just pointing your existing OpenAI SDK to together.ai and want to see latency and cost improvements with no commitments. You pay per token and can experiment across many models. -
Reserved / Dedicated capacity:
Ideal once you’ve stabilized on a few models and want lower unit costs and predictable SLOs. Dedicated Model Inference and Dedicated Container Inference give you reserved GPUs, better tokens/sec, and clearer cost per 1M tokens.
For exact per-model pricing and volume options, contact sales or check the pricing page, then choose between:
- On-Demand Serverless: Best for teams needing flexibility, burst handling, and no long-term commitments.
- Reserved / Dedicated: Best for teams with steady or high-throughput workloads that need strict latency SLOs and predictable spend.
Frequently Asked Questions
Do I have to change all my openai method calls to use together.ai?
Short Answer: No. You typically only change the base URL, API key, and model name.
Details:
Because together.ai exposes an OpenAI-compatible API, your existing call patterns like:
client.chat.completions.create(...)(new SDKs)openai.ChatCompletion.create(...)(legacy SDKs)client.images.generate(...)orclient.audio.transcriptions.create(...)
can remain as-is. The critical changes are:
- Configure
base_url/baseURL/basePathtohttps://api.together.xyz/v1 - Set
api_keytoTOGETHER_API_KEY - Use a model ID available on together.ai
If you’ve abstracted your model IDs behind config, the migration is often a one-line change plus updating an environment variable.
Will switching to together.ai break my existing agents, tools, or middleware?
Short Answer: In most cases, no — as long as they rely on the OpenAI API shape and not vendor-specific features.
Details:
Agent frameworks, orchestration layers, and gateways that speak the OpenAI API generally work out-of-the-box when you:
- Point their
base_urltohttps://api.together.xyz/v1 - Swap the API key
- Map their default model name to an equivalent together.ai model
For advanced features like tool calling, reasoning, or vision, together.ai’s Model Shaping and expanded fine-tuning/tool support are designed to work with the same interface. If you use highly vendor-specific functionality, test the behavior in a staging environment first, then gradually cut over production traffic.
Summary
Pointing your existing OpenAI SDK to together.ai is a configuration change, not a rewrite. By updating the base URL to https://api.together.xyz/v1, swapping in TOGETHER_API_KEY, and selecting a together.ai model, you get access to top open-source and partner models, up to 2.75x faster inference, and better unit economics — all while keeping your current app, agent framework, and middleware intact. From there, you can choose the right deployment mode (Serverless, Batch, Dedicated) to align latency and cost with your workload.