
How do I set up a together.ai Dedicated Endpoint for steady traffic and lower p95 latency?
Most teams hit a ceiling with pure serverless once their AI traffic stops being “bursty” and starts looking like a real production workload. That’s where together.ai Dedicated Endpoints come in: reserved GPU-backed capacity, tuned for your model and tokens/sec targets, so you can push down p95 latency and keep costs predictable.
Quick Answer: A together.ai Dedicated Endpoint is a reserved, isolated inference endpoint for a specific model or container, deployed on the AI Native Cloud. You pick the model, context length, and throughput target; together.ai provisions dedicated GPUs with ATLAS/CPD/TKC optimizations so you get lower p95 latency, steadier performance, and better unit economics for steady traffic.
The Quick Overview
- What It Is: A Dedicated Endpoint is an always-on inference endpoint backed by reserved GPUs and the together.ai inference engine (Dedicated Model Inference) or your own stack (Dedicated Container Inference).
- Who It Is For: Teams with predictable or steadily growing traffic that care about p95 latency, throughput SLOs, and strict isolation for production workloads.
- Core Problem Solved: You avoid noisy neighbor effects and cold-start spikes common in pure serverless, while getting tighter control over latency, concurrency, and cost-per-1M-tokens.
How It Works
At a high level, you:
- Define your workload: model, context length, target tps (tokens/sec), and latency SLO.
- Choose a Dedicated mode: Dedicated Model Inference for together.ai–managed engines, or Dedicated Container Inference for your own runtime.
- Provision & integrate: together.ai brings up reserved GPUs with the optimized inference stack, exposes an OpenAI-compatible endpoint, and you route steady traffic there while keeping burst traffic on Serverless.
Under the hood, together.ai runs your Dedicated Endpoint on the AI Native Cloud:
- Kernel-level speedups: Together Kernel Collection (from the FlashAttention team) and ThunderKittens-based kernels for attention, KV cache, and tensor ops.
- Runtime accelerators: ATLAS (AdapTive-LeArning Speculator System) for speculative decoding, increasing tokens/sec without quality loss.
- Long-context architecture: CPD (cache-aware Prefill–Decode Disaggregation) so large prompts and long chats don’t blow up latency.
The result: up to 2.75x faster inference on open-source models, p95s that stay flat as you scale, and better price-performance than stitching together multiple generic providers.
Step‑by‑Step: Setting Up a Dedicated Endpoint for Steady Traffic
Below is the practical setup flow I recommend for real deployments that care about the “how do I get lower p95 latency” part, not just the marketing diagram.
1. Profile Your Current Traffic
Before you request a Dedicated Endpoint, you need baselines:
- Measure current p50/p95 latency on your existing provider or together.ai Serverless:
- Time-to-first-token (TTFT)
- Time-to-last-token (end-to-end)
- Characterize load:
- Requests/sec during peak and off‑peak
- Average / p95 prompt tokens and completion tokens
- Modality mix (text, code, image, multimodal)
- Define hard SLOs:
- Example: “p95 TTFT < 700ms for 4K-token prompts, 150 output tokens, at 50 RPS.”
This informs the GPU type, count, and model configuration you’ll need.
2. Choose the Right Dedicated Mode
together.ai offers two Dedicated Inference modes:
-
Dedicated Model Inference
- together.ai manages the inference engine and model.
- Best if:
- You’re running popular OSS or partner models (Llama, Qwen, Mixtral, etc.).
- You want OpenAI-compatible APIs with no engine maintenance.
- You care primarily about p95s, throughput, and price-performance, not custom runtimes.
-
Dedicated Container Inference
- You bring your own image (model + runtime), together.ai manages GPUs and orchestration.
- Best if:
- You’re using custom or generative media models.
- You need non-standard runtimes (custom CUDA, specialized decoders, bespoke preprocessing).
- You’re migrating from an in-house stack and want managed GPUs without re‑platforming your engine.
For “steady traffic and lower p95 latency” on language or multimodal models, Dedicated Model Inference is typically the first choice.
3. Define Your Endpoint Requirements
When talking to together.ai (or configuring via console when available), be explicit:
- Model & context:
- e.g.,
Llama-3.1-70B-Instruct, 8K vs 128K context. - Long-context use cases benefit from CPD; note if you’ll push past 8K consistently.
- e.g.,
- Precision & quantization:
- FP16 vs 8‑bit/4‑bit quantization.
- For most production apps, mixed-precision with ATLAS/CPD is a good balance of latency and quality.
- Throughput targets:
- Peak RPS (requests/sec).
- Target tokens/sec (prefill and decode), derived from:
- avg prompt tokens × RPS
- avg completion tokens × RPS
- Latency SLOs:
- p95 TTFT and p95 end-to-end latency.
- Specify by scenario: “chat under 2K tokens,” “batch summarization up to 16K,” etc.
- Isolation & routing:
- Tenancy requirements: tenant-level isolation, separate endpoints per region/environment (prod/stage).
- Multi-region needs for geo-constrained data or DR.
together.ai uses this to size your Dedicated Endpoint: GPU class, number of replicas, and any needed sharding for long-context.
4. Provision the Dedicated Endpoint
Once requirements are known, together.ai:
- Reserves GPUs on the AI Native Cloud, with:
- Tenant-level isolation.
- Encryption in transit and at rest.
- Deploys the model with the together.ai inference engine (for Dedicated Model Inference):
- FlashAttention/TKC-optimized kernel path.
- ATLAS speculators tuned to your model and context.
- CPD enabled for long-context where applicable.
- Exposes an endpoint:
- OpenAI-compatible path (e.g.,
/v1/chat/completions,/v1/completions,/v1/responses). - Dedicated API key / auth controls.
- Observability hooks for latency, tokens, errors.
- OpenAI-compatible path (e.g.,
From your side, you’ll receive:
- Endpoint URL
- Model identifier (e.g.,
together_completions,together_chat, or a specific model slug) - API key / credentials
- Region / deployment metadata
Provisioning is usually minutes, not days, which is key when you’re migrating a live workload.
5. Integrate via OpenAI‑Compatible API
If you’re already speaking OpenAI APIs, integration is “no code changes” aside from the base URL and API key.
Node.js example (chat):
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: "https://api.together.xyz/v1", // dedicated endpoint uses same interface
});
async function run() {
const completion = await client.chat.completions.create({
model: "your-dedicated-model-id",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Summarize this report in 3 bullet points." },
],
max_tokens: 256,
temperature: 0.3,
stream: false,
});
console.log(completion.choices[0].message);
}
Python example (streaming):
from openai import OpenAI
client = OpenAI(
api_key=os.environ["TOGETHER_API_KEY"],
base_url="https://api.together.xyz/v1",
)
def chat_stream(prompt: str):
stream = client.chat.completions.create(
model="your-dedicated-model-id",
messages=[
{"role": "system", "content": "You are a fast, concise assistant."},
{"role": "user", "content": prompt},
],
temperature=0.2,
max_tokens=256,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.get("content", "")
if delta:
print(delta, end="", flush=True)
chat_stream("Explain CPD for long-context inference in simple terms.")
Streaming is important for p95 perceived latency: even if your long completion takes a few seconds, getting first token under 500–900ms changes UX dramatically.
6. Route Steady vs Bursty Traffic
To get both lower p95 latency and healthy cost curves, split workloads:
-
Steady, predictable traffic → Dedicated Endpoint
- Core conversational flows.
- Internal tools with constant usage.
- High-volume customer interactions.
-
Bursty, unpredictable traffic → Serverless Inference
- Launch spikes.
- Marketing campaigns.
- Ad-hoc experimentation and long-tail features.
Implementation pattern:
- Use Dedicated as the default path in production.
- Keep Serverless as overflow:
- If Dedicated is near capacity (backpressure / queue depth high).
- If a sudden burst would cause p95 to slip, let overflow spill to Serverless.
Most teams implement this in their “model gateway” layer with simple routing rules based on current RPS and queue metrics.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Reserved, Isolated GPUs | Allocates dedicated compute for your endpoint only | Predictable p95 latency, no noisy neighbors |
| ATLAS & CPD Runtime | Speeds up decoding and long-context prefill | Up to 2.75x faster inference and stable long-context p95s |
| OpenAI-Compatible API | Exposes standard /chat/completions & /completions interfaces | No code changes, fast migration from existing providers |
| Tenant-Level Isolation | Separates data and traffic, with encryption in transit/at rest | Production-ready security posture and compliance support |
| Dedicated Model or Container | Lets you choose managed engines or bring-your-own runtime | Full control for advanced teams, simplicity for most |
| Scalable GPU Clusters | Enables vertical and horizontal scaling as demand grows | Handle growth without redesigning your architecture |
Ideal Use Cases
- Best for predictable chat or agent traffic: Because Dedicated Model Inference gives you reserved GPUs and an optimized runtime, so your p95 TTFT stays low even during peak hours.
- Best for long-context workloads (docs, RAG, code review): Because CPD separates prefill and decode, your 16K–128K token prompts don’t cause latency spikes or unpredictable SLO violations.
- Best for high-throughput internal tools: Because you can align Dedicated capacity with known daily usage, improving cost-per-1M tokens versus overpaying for serverless on steady workloads.
- Best for custom media or non-standard stacks (Container): Because Dedicated Container Inference lets you run your own engine and model architecture while offloading GPU orchestration and scaling.
Practical Tuning for Lower p95 Latency
Once the endpoint is live, tune it like you’d tune a performance-critical service.
1. Align Batch Size and Concurrency
- Small to moderate batch size for latency-sensitive chat (e.g., 1–4).
- Larger batch sizes for background jobs where p95 matters less.
- Coordinate client-side parallelism (RPS) with backend concurrency to avoid queue buildup.
2. Use Streaming for UX p95
Even if backend p95 is ~1.5–2s end-to-end, streaming can put perceived p95 under a second.
- Default to
stream: truefor chat and voice agents. - Use
max_tokenscaps appropriate to each endpoint to prevent outlier long generations.
3. Match Endpoint to Traffic Class
Instead of “one endpoint for everything,” split:
- Low-latency chat endpoint: smaller max context, lower batch size, aggressive SLOs.
- Heavy summarization endpoint: larger context, larger batch, relaxed p95 if offline.
- Evaluation / experimentation endpoint: separate from prod so testing doesn’t disturb p95.
4. Monitor & Iterate
Instrument:
- TTFT and total latency p50/p95.
- Prefill tokens/sec and decode tokens/sec.
- Error rates and rate-limit signals (if applicable).
Then work with together.ai to:
- Adjust GPU type or count if you’re consistently running at high utilization.
- Tune ATLAS speculators and decoding hyperparams where appropriate.
- Evaluate quantization options if you need more throughput per GPU without hurting quality.
Limitations & Considerations
- Minimum traffic expectations: Dedicated Endpoints shine when you have predictable or steady traffic; for low or sporadic workloads, Serverless Inference is more cost-efficient.
- Capacity planning overhead: You’ll need to periodically re-evaluate capacity as traffic scales; together.ai can help with sizing, but you should monitor usage and update requirements.
- Model rigidity per endpoint: Each Dedicated Model Endpoint is typically bound to a specific model/config; you may need multiple endpoints if you rely on several models or drastically different configs.
Pricing & Plans
Dedicated Endpoints are priced based on:
- Underlying GPU capacity (type and count).
- Model class and context length.
- Usage profile (steady-state throughput and SLO requirements).
In practice:
- Dedicated Model Inference: Best for teams that want the best economics in the market on popular open-source/partner models with no engine maintenance.
- Dedicated Container Inference: Best for teams that have their own runtimes or generative media models and want fully managed infrastructure and guaranteed capacity without re-writing their stack.
For exact pricing and right-sizing by workload (e.g., 30B tokens/month, target p95 < 800ms), reach out to Together sales.
Frequently Asked Questions
Do I have to change my application code to use a together.ai Dedicated Endpoint?
Short Answer: Typically no; you can keep your existing OpenAI-style client and just change the base URL, model name, and API key.
Details: together.ai’s APIs are OpenAI-compatible: /v1/chat/completions and /v1/completions work with standard SDKs in Python, Node.js, Go, etc. For most applications, you only need to:
- Update the
baseURLtohttps://api.together.xyz/v1. - Use the Dedicated Endpoint’s model identifier instead of your previous model.
- Rotate to the new API key.
If you’re moving from together.ai Serverless to Dedicated, the interface is identical; you can even run both behind a single gateway and toggle via config.
How much can a Dedicated Endpoint actually improve my p95 latency?
Short Answer: Expect meaningfully lower and more stable p95s vs shared serverless, especially at higher RPS and longer contexts—often in the range of 2x improvement in real deployments.
Details: With Dedicated Inference, you remove noisy neighbor issues and cold starts, and you get:
- Reserved GPUs tuned for your workload.
- ATLAS and CPD for faster decoding and long-context handling.
- Together Kernel Collection (FlashAttention lineage) for optimized attention and KV cache.
In practice, customers have seen:
- Up to 2.75x faster inference on open-source models.
- 2x reduction in latency and roughly one-third cost savings, as reported by Salesforce AI Research on steady workloads.
Your exact gains depend on model choice, context length, and traffic pattern, but Dedicated is consistently better on p95 than general-purpose serverless at the same load.
Summary
If your AI traffic is no longer “toy” and you’re watching p95 latency the way you watch error rates, a together.ai Dedicated Endpoint is the right next step. You get reserved, isolated GPUs; an inference stack built from research-grade components (FlashAttention, ATLAS, CPD, Together Kernel Collection); and an OpenAI-compatible API that lets you migrate without rewrites.
Use Dedicated Model Inference for steady, latency-sensitive production workloads, keep Serverless Inference for bursty traffic and experimentation, and lean on together.ai’s AI Native Cloud to give you better p95s, higher throughput, and stronger economics than trying to self-manage GPUs.