
How do I point my existing OpenAI SDK to together.ai (base URL, API key) without rewriting my app?
Most teams can point an existing OpenAI-based app at together.ai in a few lines of config: swap the base URL, change the API key, and keep the rest of your code exactly the same. together.ai exposes an OpenAI-compatible API, so your models, parameters, and client calls typically don’t need to change.
Quick Answer: Update your OpenAI SDK client to use
https://api.together.xyz/v1as thebase_urland yourTOGETHER_API_KEYas theapi_key. Because together.ai provides an OpenAI-compatible API, you can reuse your existing OpenAI SDK and request structure without rewriting your app.
The Quick Overview
- What It Is: A drop-in OpenAI-compatible endpoint from together.ai that lets you reuse your existing OpenAI SDK calls (chat, completions, etc.) against faster, lower-cost open-source and partner models.
- Who It Is For: Engineering teams already using OpenAI SDKs in Python, Node, or other languages who want better price-performance, more control, or access to open models without changing their app logic.
- Core Problem Solved: You avoid a risky, time-consuming rewrite. Instead, you just redirect traffic to together.ai’s AI Native Cloud by changing the base URL and API key.
How It Works
together.ai implements the same high-level API surface area you’re already calling with OpenAI (including an OpenAI-compatible SDK interface). Under the hood, the traffic is served on together.ai’s AI Native Cloud — optimized kernels (Together Kernel Collection, FlashAttention), runtime-learning accelerators (ATLAS), and long-context architecture (CPD) — but your application only “sees” the OpenAI-compatible API.
At a high level:
- Swap the Endpoint: Point your OpenAI SDK to
https://api.together.xyz/v1instead of the OpenAI base URL. - Use a Together API Key: Set
TOGETHER_API_KEYin your environment and pass it to the OpenAI client. - Select Together Models: Use together.ai model IDs in your existing calls (e.g.,
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"), while keeping the rest of your parameters and logic unchanged.
1. Configure the Base URL
In most languages, you’ll create the OpenAI client with a configurable base URL. For together.ai, that becomes:
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.together.xyz/v1",
api_key=os.environ["TOGETHER_API_KEY"],
)
Node.js example:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.together.xyz/v1",
apiKey: process.env.TOGETHER_API_KEY,
});
Once this is set, all your existing client.chat.completions.create(...) or client.responses.create(...) calls are automatically routed to together.ai.
2. Configure the API Key
Create an account on together.ai, grab your API key from the dashboard, and set it as an environment variable:
export TOGETHER_API_KEY="your_together_key_here"
You don’t need to change any headers manually; the OpenAI SDK handles auth given the api_key.
3. Use Together Model Names in Existing Calls
You’ll typically just change the model string:
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Refactor this function for readability..."},
],
temperature=0.2,
)
Your application’s usage of messages, temperature, max_tokens, and streaming flags remains the same; only the model ID and endpoint are different.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| OpenAI-Compatible API | Lets you use the same OpenAI SDK clients and request formats with together. | No code changes; just new base_url + API key. |
| Optimized Inference Runtime | Uses ATLAS, CPD, and Together Kernel Collection on next-gen GPUs. | Up to 2.75x faster inference with lower latency and cost. |
| Breadth of Open Models & Modalities | Exposes text, image, video, code, and voice models behind one endpoint. | One API for multimodal apps; no need to stitch providers. |
Ideal Use Cases
- Best for “lift and shift” from OpenAI: Because it lets you move an existing OpenAI-based app over by changing only configuration. No refactor of API calls, tools logic, or message formats.
- Best for price-performance–sensitive workloads: Because together.ai’s AI Native Cloud can deliver up to 2.75x faster inference and up to 50% cost reductions in batch scenarios, while you keep your familiar OpenAI SDK.
Detailed Language Examples
Below are minimal diffs you’d apply in common languages to point your existing OpenAI SDK to together.ai.
Python (OpenAI SDK ≥ 1.0)
Before:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}],
)
After (together.ai):
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.together.xyz/v1",
api_key=os.environ["TOGETHER_API_KEY"],
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Hello!"}],
)
Only the base_url, api_key, and model changed.
Node.js / TypeScript
Before:
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
const completion = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Summarize this text…" }],
});
After (together.ai):
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.together.xyz/v1",
apiKey: process.env.TOGETHER_API_KEY,
});
const completion = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Summarize this text…" }],
});
Again, same client, same method, same payload pattern.
Curl / Raw HTTP
If you have scripts or services using raw HTTP, you can keep everything and only swap URL + header:
curl https://api.together.xyz/v1/chat/completions \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
"messages": [{"role": "user", "content": "Hello from together.ai"}]
}'
Limitations & Considerations
-
Model Name Differences:
You cannot use OpenAI proprietary model IDs on together.ai. You’ll need to map to supported open-source and partner models (e.g., Llama, Qwen, etc.). The rest of the request shape stays the same. -
Feature Surface Variance:
While the API is OpenAI-compatible, some newer OpenAI features or model-specific behaviors may not exist or may behave differently. For advanced cases (e.g., structured outputs, some tool-calling nuances), test your flows and consult the together.ai docs for exact support.
Additional operational notes:
- Security & Ownership: together.ai provides tenant-level isolation, encryption in transit and at rest, SOC 2 Type II compliance, and your data and models remain fully under your ownership.
- Deployment Modes: If you outgrow pure serverless, you can move specific workloads to Dedicated Model Inference, Dedicated Container Inference, or GPU Clusters without changing the core API pattern.
Pricing & Plans
together.ai is designed to offer the best economics in the market across serverless and dedicated modes:
- Serverless Inference: Pay per token with no commitments. Best for variable or unpredictable traffic, prototyping, and bursty workloads that benefit from an on-demand fleet.
- Dedicated Inference (Model or Container) & GPU Clusters: Best for steady, high-throughput production workloads where you want predictable latency, guaranteed capacity, and fine-grained control over runtime (quantization, custom kernels, or even your own containers).
You can start with serverless via the OpenAI-compatible API, then graduate hot paths to dedicated endpoints or GPU Clusters as you scale — all while keeping the same base URL pattern and API semantics.
- Serverless Plan (On-Demand): Best for teams needing to “just switch the base URL” and immediately test performance and cost, without provisioning or reservations.
- Dedicated / GPU Cluster Plans: Best for teams with strict latency SLOs or large, always-on workloads that need predictable performance, tenant-level isolation, and the ability to run custom models or containers.
Frequently Asked Questions
Do I have to change my OpenAI SDK or can I keep it?
Short Answer: You can keep your existing OpenAI SDK.
Details: together.ai exposes an OpenAI-compatible API, so your existing Python openai client, Node openai client, and other language SDKs can remain in place. You only need to:
- Update
base_url/baseURLtohttps://api.together.xyz/v1. - Use
TOGETHER_API_KEYas theapi_key. - Swap your
modelname to a model available on together.ai.
Your usage of message formats, temperature, max_tokens, and streaming is unchanged in most cases.
What if I’m using tools / function calling or long-context models?
Short Answer: The same pattern applies, but verify model and feature support.
Details: For tool calling, assistants, or long-context use cases:
- The top-level API structure remains OpenAI-compatible, but you must choose a model on together.ai that supports the capability you need.
- together.ai’s AI Native Cloud is particularly strong for long-context workloads thanks to CPD (cache-aware prefill–decode disaggregation) and ATLAS for speculative decoding, which can significantly improve latency and throughput at large context windows.
- For critical production flows, benchmark with your actual prompts and tools to validate latency, tokens/sec, and cost/1M tokens before migrating 100% of traffic.
Summary
You can point an existing OpenAI SDK–based application to together.ai by changing just three things: the base URL (https://api.together.xyz/v1), the API key (TOGETHER_API_KEY), and the model name. Everything else — your OpenAI SDK, request structure, and business logic — stays the same. In return, you get access to together.ai’s AI Native Cloud: up to 2.75x faster inference, better unit economics, long-context and multimodal support, and a path from serverless experiments to dedicated endpoints and GPU Clusters without re-architecting your app.