
How do I set up a together.ai Dedicated Endpoint for steady traffic and lower p95 latency?
For steady, predictable workloads, Dedicated Endpoints on together.ai let you trade “shared pool” variability for reserved, isolated capacity—so your p95 latency becomes a product feature you can actually control.
Quick Answer: A together.ai Dedicated Endpoint (Dedicated Model Inference) is a reserved, isolated inference endpoint backed by fixed GPU capacity and the Together inference engine. It’s ideal when you have steady or latency-sensitive traffic and want lower, more stable p95 latency than typical serverless pools can guarantee.
The Quick Overview
- What It Is: A Dedicated Endpoint is a Together AI “Dedicated Model Inference” deployment: a single-model endpoint backed by reserved GPUs and the Together inference stack (ATLAS, CPD, Together Kernel Collection).
- Who It Is For: Teams with predictable or steady traffic, latency-sensitive apps (chat, agents, realtime tools), and high-throughput production workloads that need stable p95/p99 latency and tight unit economics.
- Core Problem Solved: Removes noisy-neighbor effects and capacity variance of shared serverless pools, so you can meet SLOs (e.g., p95 < 700ms TTFB, p95 < 2s for 512–1k tokens) while keeping cost per 1M tokens low.
How It Works
Dedicated Model Inference gives you your own model endpoint running on reserved, tenant-isolated GPU instances managed by the Together AI Native Cloud. You keep the OpenAI-compatible API, but traffic no longer competes with other tenants on a shared pool—which is where most p95/p99 latency outliers come from.
At a high level, you:
- Profile Your Workload: Estimate QPS, input/output token sizes, and latency targets (p50/p95).
- Provision Dedicated Capacity: Work with Together (or via console) to spin up a Dedicated Model Inference endpoint with right-sized GPUs and model config.
- Integrate the Endpoint: Swap your serverless base URL for the dedicated endpoint URL in your existing OpenAI-compatible client and tune parameters (e.g., max_tokens, temperature, batching) for latency.
1. Workload profiling: know what you’re optimizing p95 against
To get meaningful p95 improvements, you need to pin down four numbers:
- Steady QPS: Typical and peak QPS over 1–5 minutes (e.g., 5 QPS baseline, 15 QPS peak).
- Token profile:
- Avg input tokens per request (prompt + history)
- Avg / max output tokens per request
- Latency target:
- Time-to-first-token (TTFT) p95
- Full-completion latency p95
- Concurrency shape: How spiky is your traffic over a minute/hour/day.
These shape:
- How many GPUs you reserve
- Whether you batch for throughput or prioritize per-request latency
- Whether you need long-context optimizations (CPD) for 8k, 32k, 100k+ token sequences
2. Provisioning a Dedicated Endpoint on together.ai
Together’s Dedicated Model Inference is:
“An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine. Best for predictable or steady traffic, latency-sensitive applications, and high-throughput production workloads.”
The typical setup flow:
-
Choose model + deployment mode
- Pick a model from Together’s catalog (e.g., Llama, Mixtral, Qwen, Code models, etc.).
- Decide: Dedicated Model Inference vs Dedicated Container Inference
- Use Dedicated Model Inference if you want Together’s optimized runtime and minimal ops.
- Use Dedicated Container Inference if you have your own engine or non-standard runtime.
-
Size capacity for your SLO
- Based on your workload profile, Together will recommend:
- GPU type & count (e.g., 1–N GPUs per endpoint)
- Concurrency limits
- Batching configuration for best latency/throughput tradeoff
- You can start small and scale up as you validate p95.
- Based on your workload profile, Together will recommend:
-
Get endpoint URL and API key
- Together provisions your dedicated endpoint with:
- A unique base URL (e.g.,
https://api.together.ai/v1/dedicated/<your-endpoint-id>) - The same OpenAI-compatible API semantics you already use on serverless.
- A unique base URL (e.g.,
- Together provisions your dedicated endpoint with:
-
Configure security & isolation
- Tenant-level isolation on reserved compute
- Encryption in transit and at rest
- SOC 2 Type II control environment
- Your data and models remain fully under your ownership
-
Go live and iterate
- Start sending a subset of traffic
- Measure p50/p95/p99, tokens/sec, cost per 1M tokens
- Tune model parameters and batching strategy
3. Integrate the Dedicated Endpoint and tune for p95 latency
Once your Dedicated Model Inference endpoint is ready, integration is usually a one-line change in your client.
Example: OpenAI-compatible client (Node.js)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: "https://api.together.ai/v1/dedicated/YOUR_ENDPOINT_ID", // dedicated base URL
});
async function runPrompt() {
const completion = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3-70B-Instruct", // or the model you deployed
messages: [
{ role: "system", content: "You are a concise, reliable assistant." },
{ role: "user", content: "Summarize this 300-word document in 3 bullet points." },
],
max_tokens: 256, // cap output length for lower latency
temperature: 0.2,
stream: false, // set true if you want lower TTFB via streaming
});
console.log(completion.choices[0].message.content);
}
Example: cURL
curl https://api.together.ai/v1/dedicated/YOUR_ENDPOINT_ID/chat/completions \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"messages": [
{"role": "system", "content": "You are a concise, reliable assistant."},
{"role": "user", "content": "Help me debug this function..."}
],
"max_tokens": 256,
"temperature": 0.1,
"stream": true
}'
From there, you tune:
max_tokensand prompt length to cut unnecessary work- Streaming vs non-streaming depending on whether you care more about TTFT or full completion
- Batch size and concurrency limits (configured during setup) to keep GPUs busy without hurting p95
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Reserved, Isolated Compute | Allocates dedicated GPUs and runtime for your endpoint only | Stable, predictable p95/p99 latency |
| Together Inference Engine | Runs models on ATLAS, CPD, Together Kernel Collection, FlashAttention, etc. | Up to 2.75x faster inference at lower unit cost |
| OpenAI-Compatible API | Keeps the same API shape as OpenAI / serverless Together | No code changes; simple cutover from serverless |
| Production-Grade Security | Tenant-level isolation; encryption in transit/at rest; SOC 2 Type II | Safe for regulated workloads and sensitive data |
| Latency & Throughput Tuning | Right-size GPUs, batching, and model config to your workload | Meet strict SLOs while optimizing cost per 1M tokens |
| Support for Batch & Real-time | Combines real-time dedicated endpoints with batch jobs | Process up to 30B tokens asynchronously at ~50% less cost |
Ideal Use Cases
- Best for steady, interactive traffic: Because Dedicated Model Inference removes noisy neighbors, you get more stable p95 for chatbots, copilots, and agentic workflows that need sub-second TTFT.
- Best for high-throughput pipelines: Because you own reserved capacity, you can batch aggressively for throughput while still controlling p95—for content generation, rerank, or complex multi-step workflows.
If you have huge offline workloads (e.g., labeling or summarizing a data lake), combine Dedicated Inference (for your always-on product) with Batch Inference for your big jobs:
- Batch can process up to 30 billion tokens asynchronously at up to 50% less cost.
- Dedicated stays focused on meeting tight latency SLOs for your end-user traffic.
Limitations & Considerations
- Requires traffic predictability: Dedicated endpoints shine when you have reasonably steady baseline load. If your traffic is extremely spiky with long idle periods, keep a serverless endpoint in the mix for burst handling, or start small on dedicated and scale up gradually.
- Capacity planning matters: Under-sizing GPUs will hurt p95 during peaks; over-sizing wastes budget. Plan around real QPS and token profiles, and iterate with Together’s team using your actual logs.
- One model per endpoint: Dedicated Model Inference is optimized around a specific model. If you need many different models/runtimes, use a mix of Dedicated Model Inference (for critical paths) and either serverless or Dedicated Container Inference for the rest.
Pricing & Plans
Pricing depends on:
- Model type and size (e.g., 8B vs 70B parameters, text vs multimodal)
- GPU class and count reserved for your endpoint
- Expected utilization (steady QPS and average tokens/request)
Typical patterns:
- Steady QPS + aggressive SLOs → Dedicated Model Inference
- You reserve capacity, tune it for your SLO, and fully amortize GPU cost over steady load.
- Burst-heavy workloads → Serverless + Dedicated Hybrid
- Use Dedicated as the primary path, fall back to Serverless Inference when demand exceeds dedicated capacity.
To get an exact quote and capacity plan:
- Standard Dedicated Endpoint: Best for product teams with known traffic patterns needing stable p95 and the best price-performance on a specific model.
- Enterprise Dedicated + GPU Clusters: Best for teams with multiple models, fine-tuning needs, or very high throughput, who want per-tenant GPU clusters plus Dedicated Container Inference.
Frequently Asked Questions
How much p95 latency improvement can I expect from a together.ai Dedicated Endpoint?
Short Answer: Many teams see p95 latency drop by 30–50% versus shared serverless pools, with far fewer p99 spikes, especially for steady, mid-to-high utilization workloads.
Details:
On serverless, your request competes for pooled capacity and may encounter cold starts, queueing, or contention during global peaks. With Dedicated Model Inference:
- You have reserved GPUs sized to your QPS and token profile.
- Together’s runtime (ATLAS for speculative decoding, CPD for long-context prefill–decode split, Together Kernel Collection for optimized kernels) is tuned for your model and GPU type.
- There’s no noisy neighbor effect, so variance shrinks, which is what actually improves p95/p99—even when p50 stays similar.
Measured outcomes vary by model and prompt shape, but it’s common to see:
- Up to 2.75x faster inference on supported models compared to unoptimized stacks.
- Consistent sub-second TTFT p95 for mid-length prompts with streaming.
- More predictable p99, which is critical for user-facing products.
How do I decide between Dedicated Model Inference and Dedicated Container Inference?
Short Answer: Use Dedicated Model Inference if you want Together’s optimized engine and minimal ops; use Dedicated Container Inference when you must run your own model runtime or custom pipeline.
Details:
-
Dedicated Model Inference
- You pick a model from Together’s catalog.
- Together manages the entire runtime (FlashAttention, ATLAS, CPD, quantization, etc.).
- Best for: predictable traffic, low p95, high-throughput text/code/embedding workloads with minimal operational overhead.
-
Dedicated Container Inference
- You bring your own container with a custom engine/model graph.
- Together runs it on fully-managed, scalable infrastructure.
- Best for:
- Generative media models (image, video, audio with custom runtimes)
- Non-standard runtimes (e.g., specific CUDA/cuDNN stacks)
- Custom inference pipelines (multi-model graphs, pre/post processing in-container)
If your goal is simply “steady traffic, lower p95 latency, minimal ops,” start with Dedicated Model Inference. Move to Dedicated Container Inference only when your runtime requirements go beyond what Together’s inference engine provides.
Summary
Setting up a together.ai Dedicated Endpoint (Dedicated Model Inference) is the most direct way to turn unpredictable p95 latency into a controllable SLO for steady workloads. You reserve isolated GPUs, run on Together’s research-grade inference engine (ATLAS, CPD, Together Kernel Collection), and keep the familiar OpenAI-compatible API—so the migration is typically a base URL change plus small tuning around tokens and streaming.
For latency-sensitive, high-throughput applications, the pattern is clear:
- Serverless Inference for variable or unpredictable traffic.
- Dedicated Model Inference for steady, latency-sensitive workloads.
- Batch Inference for massive offline jobs.
Together, they let you build on the AI Native Cloud with the best economics and latency profile for each workload.