How do I point my existing OpenAI SDK to together.ai (base URL, API key) without rewriting my app?
Foundation Model Platforms

How do I point my existing OpenAI SDK to together.ai (base URL, API key) without rewriting my app?

9 min read

Most teams can point their existing OpenAI SDKs at together.ai in a few lines of configuration: change the base URL, swap the API key, optionally update the model name, and keep the rest of your code exactly the same. together.ai exposes an OpenAI-compatible API, so you do not need to rewrite your app, your agents, or your middleware.

Quick Answer: Configure the OpenAI client to use https://api.together.xyz/v1 as the base_url (or baseURL) and set your TOGETHER_API_KEY. Your existing calls (client.chat.completions.create, openai.ChatCompletion.create, etc.) continue to work with minimal or no code changes.


The Quick Overview

  • What It Is: A drop-in OpenAI-compatible endpoint that lets you run top open-source and partner models on together.ai’s AI Native Cloud with the same SDKs you already use.
  • Who It Is For: Teams already using OpenAI SDKs (Node, Python, etc.) that want better price-performance, long-context options, and control over infrastructure without changing their app logic.
  • Core Problem Solved: Move to faster, cheaper, more flexible inference (serverless or dedicated) without a risky “full rewrite” of your codebase.

How It Works

You keep your existing OpenAI client and method calls, and only change the configuration that points the client to together.ai:

  1. Swap the Base URL: Replace the OpenAI endpoint with https://api.together.xyz/v1, which exposes an OpenAI-compatible API surface.
  2. Set the Together API Key: Configure TOGETHER_API_KEY and pass it where you previously used OPENAI_API_KEY.
  3. Update Model Names (If Needed): Choose a model available on together.ai (e.g., Mixtral, Llama, Qwen, vision models) and start sending traffic — serverless, batch, or dedicated.

Under the hood, your requests hit together.ai’s AI Native Cloud: FlashAttention-based kernels, ATLAS speculative decoding, and CPD long-context serving give you up to 2.75x faster inference and better economics, with SOC 2 Type II assurances and tenant-level isolation.


Step-by-Step: Pointing Your OpenAI SDK to together.ai

1. Get Your Together API Key

  1. Register or sign in at together.ai.
  2. Go to your account dashboard and create an API key.
  3. Store it as an environment variable, for example:
export TOGETHER_API_KEY="sk-..."

# On Windows (PowerShell)
$env:TOGETHER_API_KEY="sk-..."

You’ll use TOGETHER_API_KEY instead of OPENAI_API_KEY.


2. Update the Client Configuration by Language

Below are minimal changes for common OpenAI SDK setups.

Python (New openai SDK / OpenAI client)

If you’re using the new openai client:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=os.environ["TOGETHER_API_KEY"],
)

resp = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[
        {"role": "user", "content": "Explain CPD for long-context serving in 2 sentences."}
    ],
)
print(resp.choices[0].message.content)

Key changes:

  • base_url="https://api.together.xyz/v1"
  • api_key=os.environ["TOGETHER_API_KEY"]
  • Use any together.ai-supported model name.

Python (Legacy openai.ChatCompletion.create style)

If you’re still on the legacy pattern:

import openai
import os

openai.api_key = os.environ["TOGETHER_API_KEY"]
openai.base_url = "https://api.together.xyz/v1"

resp = openai.ChatCompletion.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[
        {"role": "user", "content": "Summarize ATLAS in one paragraph."}
    ],
)
print(resp["choices"][0]["message"]["content"])

Only the configuration lines change; your method calls stay the same.


Node.js / TypeScript (New openai client)

npm install openai
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.together.xyz/v1",
  apiKey: process.env.TOGETHER_API_KEY,
});

const resp = await client.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  messages: [{ role: "user", content: "Give 3 bullets on CPD vs naive long-context serving." }],
});

console.log(resp.choices[0].message.content);

Again, only baseURL and apiKey change.


Node.js / TypeScript (Legacy Configuration + OpenAIApi)

import { Configuration, OpenAIApi } from "openai";

const configuration = new Configuration({
  apiKey: process.env.TOGETHER_API_KEY,
  basePath: "https://api.together.xyz/v1",
});

const client = new OpenAIApi(configuration);

const resp = await client.createChatCompletion({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  messages: [{ role: "user", content: "What is FlashAttention-4 and why does it matter?" }],
});

console.log(resp.data.choices[0].message?.content);

Use basePath for the together.ai endpoint.


cURL

If you already have scripts using api.openai.com, you can adapt them:

curl https://api.together.xyz/v1/chat/completions \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        "messages": [{"role": "user", "content": "Outline a batch inference pipeline on Together."}]
      }'

3. Choosing Models on together.ai

Because together.ai is model-agnostic and open, you may want to:

  • Swap from a proprietary model to a top open model (e.g., Mixtral, Llama 3.1, Qwen).
  • Move to a long-context model for RAG or document workflows.
  • Use multimodal models (vision, OCR, image understanding) through the same OpenAI-compatible API.

Model names follow the pattern:

provider/model-name

Examples:

  • meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
  • mistralai/Mixtral-8x7B-Instruct-v0.1
  • Vision/OCR models for multimodal workflows.

You can usually drop in a new model name without changing request structure (messages, temperature, max_tokens, etc.).


Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
OpenAI-compatible APIUses the same request/response schema and SDK methodsNo code rewrite; switch providers by changing config
AI Native Cloud performanceUses FlashAttention kernels, ATLAS, and CPD on modern GPUsUp to 2.75x faster inference and better price-performance
Flexible deployment modesServerless, Batch, Dedicated Model, Dedicated ContainerMatch infra to traffic: latency, throughput, or control
Model breadth (open + partner)Access hundreds of open-source and partner modelsSwap models without re-architecting your stack
Strong privacy & controlSOC 2 Type II, tenant-level isolation, data ownershipShip production workloads with compliance and assurances

Ideal Use Cases

  • Best for production apps already on OpenAI: Because you can redirect traffic to together.ai with a base URL + key change, then iterate on models and deployment modes (e.g., move hot paths to Dedicated Model Inference) without touching most of your application code.
  • Best for teams optimizing unit economics: Because you can test together.ai’s 2x+ faster serverless and up to 50% cheaper batch processing while keeping your existing gateways, agents, and orchestrators compatible via the OpenAI-style interface.

Deployment Mode Considerations After You Switch

Once your SDK points to together.ai, the next decision is how you want inference to run:

  • Serverless Inference (default for most OpenAI-style calls)

    • Best for: Variable or unpredictable traffic, prototypes, internal tools.
    • Behavior: Auto-scales; you pay per token; no infrastructure to manage.
    • Benefit: Quickest way to test new models and benchmark latency vs your existing provider.
  • Batch Inference

    • Best for: Offline jobs, large dataset processing (e.g., 30B tokens), log analysis, backfills.
    • Behavior: Submit large jobs; together.ai schedules them on GPU clusters for throughput.
    • Benefit: Up to 50% less cost for high-volume workloads.
  • Dedicated Model Inference

    • Best for: Steady traffic and latency-sensitive production workloads.
    • Behavior: Your own reserved model endpoint on dedicated GPUs.
    • Benefit: More predictable latency, better tokens/sec, and strong cost control.
  • Dedicated Container Inference & GPU Clusters

    • Best for: Custom runtimes, bespoke serving stacks, or full control over kernels.
    • Behavior: Bring your own container or run full workloads on GPU clusters (8–4,000+ GPUs).
    • Benefit: Maximum flexibility while still benefiting from together.ai infra and research.

Your integration code (OpenAI SDK calls) can stay the same across these modes; you only change how/where the model is deployed on the backend.


Limitations & Considerations

  • Model name differences:
    together.ai does not expose proprietary model IDs from other vendors. You’ll need to pick a compatible open or partner model (e.g., a Llama 3.1 or Mixtral variant) instead of gpt-*.
    Workaround: Create a simple mapping layer in your app — e.g., MY_DEFAULT_MODEL -> meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo.

  • Feature parity nuances:
    The API is OpenAI-compatible, but not every provider-specific feature or beta flag (e.g., some vendor-only tools) will be identical.
    Workaround: Start with baseline chat/completions and tools that are documented to work; then enable advanced features incrementally, testing behavior per model.


Pricing & Plans

together.ai is designed for best-in-market price-performance, with multiple ways to align cost to workload shape:

  • Serverless pay-as-you-go:
    Ideal when you’re just pointing your existing OpenAI SDK to together.ai and want to see latency and cost improvements with no commitments. You pay per token and can experiment across many models.

  • Reserved / Dedicated capacity:
    Ideal once you’ve stabilized on a few models and want lower unit costs and predictable SLOs. Dedicated Model Inference and Dedicated Container Inference give you reserved GPUs, better tokens/sec, and clearer cost per 1M tokens.

For exact per-model pricing and volume options, contact sales or check the pricing page, then choose between:

  • On-Demand Serverless: Best for teams needing flexibility, burst handling, and no long-term commitments.
  • Reserved / Dedicated: Best for teams with steady or high-throughput workloads that need strict latency SLOs and predictable spend.

Frequently Asked Questions

Do I have to change all my openai method calls to use together.ai?

Short Answer: No. You typically only change the base URL, API key, and model name.

Details:
Because together.ai exposes an OpenAI-compatible API, your existing call patterns like:

  • client.chat.completions.create(...) (new SDKs)
  • openai.ChatCompletion.create(...) (legacy SDKs)
  • client.images.generate(...) or client.audio.transcriptions.create(...)

can remain as-is. The critical changes are:

  • Configure base_url/baseURL/basePath to https://api.together.xyz/v1
  • Set api_key to TOGETHER_API_KEY
  • Use a model ID available on together.ai

If you’ve abstracted your model IDs behind config, the migration is often a one-line change plus updating an environment variable.


Will switching to together.ai break my existing agents, tools, or middleware?

Short Answer: In most cases, no — as long as they rely on the OpenAI API shape and not vendor-specific features.

Details:
Agent frameworks, orchestration layers, and gateways that speak the OpenAI API generally work out-of-the-box when you:

  • Point their base_url to https://api.together.xyz/v1
  • Swap the API key
  • Map their default model name to an equivalent together.ai model

For advanced features like tool calling, reasoning, or vision, together.ai’s Model Shaping and expanded fine-tuning/tool support are designed to work with the same interface. If you use highly vendor-specific functionality, test the behavior in a staging environment first, then gradually cut over production traffic.


Summary

Pointing your existing OpenAI SDK to together.ai is a configuration change, not a rewrite. By updating the base URL to https://api.together.xyz/v1, swapping in TOGETHER_API_KEY, and selecting a together.ai model, you get access to top open-source and partner models, up to 2.75x faster inference, and better unit economics — all while keeping your current app, agent framework, and middleware intact. From there, you can choose the right deployment mode (Serverless, Batch, Dedicated) to align latency and cost with your workload.


Next Step

Get Started