
How do I run a large offline job using together.ai Batch Inference (async), and track status + spend?
Most teams reach for together.ai Batch Inference the moment a “single request per user” pattern breaks down — big offline summarization runs, dataset-wide classification, or synthetic data generation where latency doesn’t matter but cost and throughput do. The good news: you can push up to 30 billion tokens per model, at up to 50% lower cost than real-time serverless, and still keep tight visibility into job status and spend.
This guide walks through how to run a large offline job with together.ai Batch Inference (async), and how to track job progress, failures, and token-level cost as it runs.
The Quick Overview
- What It Is: together.ai Batch Inference is an asynchronous API for processing massive workloads (up to 30B tokens per model) against any serverless model or dedicated deployment.
- Who It Is For: AI product and data teams running large offline jobs — dataset classification, offline summarization, synthetic data generation, and backfills — where you care about cost per 1M tokens more than per-request latency.
- Core Problem Solved: Instead of orchestrating your own GPU queues, retries, and cost tracking, Batch Inference lets you submit a single job spec, scale out across the AI Native Cloud, and get predictable cost and completion times (<24h SLA) with detailed usage metrics.
How Batch Inference Works (Async Model)
At a high level, Batch Inference is “submit once, process at scale, retrieve results later.” You send a JSONL or structured job describing thousands to millions of prompts, together.ai schedules them across its inference engine (Serverless Inference or Dedicated Model Inference), and you periodically query job status or configure callbacks to consume results when ready.
Under the hood, the same research-to-production stack that powers real-time inference also drives batch:
- Throughput from systems research: Together Kernel Collection, FlashAttention-4, and ThunderKittens kernels boost tokens/sec.
- Cost efficiency from ATLAS: The AdapTive-LeArning Speculator System accelerates decoding for popular open-source models.
- Scalability & isolation: Jobs are distributed across tenant-isolated infrastructure with encryption in transit and at rest; your data and models remain fully under your ownership.
The async flow breaks down into three main phases:
-
Define and submit the batch job
You describe your workload: model, request parameters (e.g., temperature, max_tokens), and the list of items to process — often as JSONL. Send this to the Batch Inference endpoint to receive ajob_id. -
Track job status and intermediate progress
Poll the Batch API using thejob_idto see whether the job is queued, running, partially completed, completed, or failed. Large jobs finish well under 24 hours and are designed for “submit and forget” with predictable processing. -
Retrieve results and analyze spend
Once complete, download outputs (e.g., classifications, summaries, generations) and inspect token metrics. Use together.ai’s usage and billing views to understand tokens processed, per-model spend, and cost per 1M tokens.
Step 1: Design Your Offline Job for Batch Inference
Before touching the API, shape your workload for async processing.
Choose the right model and deployment mode
Batch Inference can run against:
-
Any serverless model
Best when:- You have variable or infrequent offline workloads.
- You want no-ops infrastructure and pay-as-you-go pricing.
- You’re experimenting with models for a one-off backfill.
-
Dedicated Model Inference (private endpoints)
Best when:- You have predictable or recurring offline jobs.
- You need tight SLOs on throughput and start times.
- You want to combine real-time and batch traffic on the same dedicated deployment.
In both cases, model access is “universal”: any serverless model or private deployment can be targeted by Batch Inference — no special “batch-only” models required.
Pick the right task shape
Batch Inference is optimized for:
-
Classifying large datasets
e.g., tagging millions of documents, routing tickets, labeling customer interactions. -
Offline summarization
e.g., compressing long PDFs, call transcripts, or log streams where latency is irrelevant but cost is critical. -
Synthetic data generation
e.g., generating instruction-tuning pairs, QA datasets, or augmentations for model shaping.
If your workload is highly interactive (chat-like agents, live tools, or streaming outputs), stick with Serverless Inference or Dedicated Inference instead.
Prepare input data (JSONL)
Most large jobs are easiest to manage with JSON Lines (JSONL), where each line is one request:
{"custom_id": "doc_0001", "input": "Summarize the following support ticket...", "metadata": {"user_segment": "SMB"}}
{"custom_id": "doc_0002", "input": "Summarize the following support ticket...", "metadata": {"user_segment": "Enterprise"}}
Best practices:
- Include a stable
custom_idso you can map outputs back to your internal IDs. - Optionally attach metadata (e.g., cohort, source system) — helpful for slice-wise spend and quality analysis.
- Keep per-item inputs within your chosen model’s context window; for very long-context workloads, pair Batch with models and runtimes that benefit from CPD (prefill–decode disaggregation) to keep throughput high.
Step 2: Submit a Batch Job (Async)
The exact endpoint and schema can evolve, so check the latest Batch docs, but the flow will look like this in an OpenAI-compatible pattern.
Example: Creating a batch job
curl https://api.together.xyz/v1/batches \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"input_file": "file-abc123",
"endpoint": "/v1/chat/completions",
"completion_params": {
"temperature": 0.2,
"max_tokens": 512
}
}'
A typical request includes:
model– any serverless model name or your dedicated deployment identifier.input_file– reference to an uploaded JSONL file containing all your items.endpoint– the API path to run against (e.g., chat completions).completion_params– shared parameters applied to all items (temperature, max_tokens, etc.).
The response will include a id (your batch_id / job_id) and a status:
{
"id": "batch_job_123",
"status": "queued",
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"created_at": 1712345678
}
Persist this ID in your tracking system; you’ll need it to monitor the job and retrieve results.
Step 3: Track Batch Status and Progress
Once the job is created, use the async lifecycle to see where it is in the pipeline.
Poll job status
curl https://api.together.xyz/v1/batches/batch_job_123 \
-H "Authorization: Bearer $TOGETHER_API_KEY"
Typical status values:
queued– job accepted, waiting for resources.running– items are actively being processed.completed– all items processed (check for per-item errors in results).failed– job-level failure; inspect error details.cancelling/cancelled– if you explicitly cancel a job.
Alongside status, the API typically returns counters like:
total_count– total number of items.completed_count– items processed successfully.failed_count– items that hit per-item errors (e.g., input too long).
These fields let you build a simple monitoring loop:
- Submit a job.
- Every N minutes, poll
statusand counts. - Trigger notifications (Slack, email, PagerDuty) if:
status == "failed", orstatus == "completed"butfailed_count > 0.
Time-to-completion expectations
together.ai targets <24 hour processing time SLA for large jobs, with many finishing well within a few hours depending on:
- Number of items and tokens per item.
- Model choice and quantization.
- Whether you’re on serverless or a dedicated deployment tuned for batch.
If your business needs a tighter SLO, consider:
- Dedicated Model Inference just for batch workloads.
- Coordinating batch windows during off-peak hours to leverage more capacity.
Step 4: Retrieve Outputs from the Batch Job
Once your job’s status is completed, you can download the results.
Download the result file
curl https://api.together.xyz/v1/batches/batch_job_123/results \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-o batch_job_123_results.jsonl
Each line typically contains:
- Your original
custom_idand input/metadata. - The model’s output (e.g.,
completion,classification,summary). - Token usage for that item (input tokens, output tokens).
Example output line (shape depends on endpoint):
{
"custom_id": "doc_0001",
"input": "Summarize the following support ticket...",
"output": {
"summary": "Customer reports intermittent login failures..."
},
"usage": {
"prompt_tokens": 452,
"completion_tokens": 128,
"total_tokens": 580
},
"status": "success"
}
Use this to:
- Join back to your source tables using
custom_id. - Validate coverage (every
custom_idshould appear exactly once). - Compute per-item and per-cohort token usage.
Handling per-item errors
If an item fails (e.g., input too long, malformed JSON), its status might be error with an error payload.
Typical workflow:
- Filter result lines with
status != "success". - Log or store them in a “re-run” table.
- Optionally correct inputs (truncate, clean text) and resubmit a small follow-up batch.
Step 5: Track Spend and Token Usage
Batch Inference is designed for price-performance: up to 50% cost savings versus running the same workload via the real-time serverless API for most models. To manage that spend in practice, you’ll want visibility at multiple levels: job, model, and project.
Understand cost drivers
Your total batch cost is essentially:
Total tokens (prompt + completion) × price per 1M tokens (batch rate)
Key levers:
-
Model choice and size
Larger models cost more per token but may offer higher quality. For many classification or summarization workloads, smaller or fine-tuned open-source models on Batch Inference hit the best cost–quality balance. -
Prompt design
Shorter, more focused prompts and templates reduce prompt_tokens without sacrificing accuracy. -
max_tokens
Constrain completions to realistic bounds for your task. Summaries usually don’t need 2,000-token caps.
Use together.ai’s usage and billing views
In the together.ai console:
-
Navigate to Usage to see:
- Tokens by model and deployment mode (serverless vs batch vs dedicated).
- Tokens and cost attributed to the batch API.
- Time-series metrics to understand spikes (e.g., monthly backfills).
-
Use tags/metadata or project-level segregation (when available) to:
- Attribute batch jobs to teams or cost centers.
- Separate R&D synthetic data runs from production-quality summarization.
If your batch jobs run against a Dedicated Model Inference deployment, the economics are a combination of:
- Fixed or reserved GPU capacity cost.
- Token volume processed during the billing window.
In that scenario, running large, well-packed batch jobs on dedicated endpoints often gives best-in-market unit economics for steady workloads.
Build your own internal cost reports
Because each batch result line can include usage, it’s straightforward to push cost analytics into your own data warehouse:
- Ingest JSONL results into your lake/warehouse (e.g., via Airbyte, custom ingestion).
- Extend schema with:
prompt_tokens,completion_tokens,total_tokens.- Derived
estimated_costusing your current batch rate for that model.
- Aggregate:
- Cost per source system, dataset, or team.
- Cost per label type or summarization target.
- Cost per time window (e.g., monthly synthetic data budget).
This approach keeps finance, data, and infra teams aligned on how Batch Inference is being used and what the payoffs are.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Scale to 30B tokens | Processes up to 30 billion tokens per model asynchronously in one job. | Run massive offline workloads without managing GPU fleets |
| Up to 50% lower cost vs real-time | Uses batch-optimized pricing and runtime scheduling for most serverless models. | Cut cost per 1M tokens for large jobs by up to half |
| Universal model access | Runs against any serverless model or Dedicated Model Inference endpoint. | Use the best model for each task without switching stacks |
Ideal Use Cases
-
Best for offline summarization backfills:
Because Batch Inference can chew through millions of long documents within hours, at up to 50% less cost than equivalent real-time calls, while keeping your interactive endpoints free for user traffic. -
Best for large-scale classification and labeling:
Because you can push entire datasets (support tickets, events, content) through a single async job, track per-item success, and get clean token usage for model shaping and downstream training.
Limitations & Considerations
-
Not for interactive or streaming flows:
Batch Inference is async with job-level latency measured in minutes/hours. For chatbots, voice agents, or tool-calling that require sub-second responses, use Serverless Inference or Dedicated Model Inference instead. -
Job size and context constraints:
While you can process up to 30B tokens per model, each individual item still must respect the model’s context window and token limits. For ultra-long-context needs, choose models and deployments that leverage long-context optimizations like CPD.
Pricing & Plans
Batch Inference is priced to reward large, offline workloads:
- Up to 50% cost savings compared to real-time serverless API calls for most serverless models.
- No long-term commitments required for serverless Batch — ideal for periodic backfills.
When you combine Batch with Dedicated Model Inference, you can further optimize:
- Run steady, high-volume workloads on dedicated endpoints with predictable capacity.
- Burst one-off or experimental runs on serverless Batch without overprovisioning.
Example positioning:
-
Serverless Batch: Best for teams with periodic or exploratory offline jobs that don’t justify dedicated capacity but still need 30B-token scale and strong economics.
-
Batch on Dedicated Inference: Best for teams with recurring or mission-critical offline workloads that want the best unit economics and tighter SLOs on completion time.
(Refer to together.ai’s pricing page and your sales contact for specific per-model batch rates.)
Frequently Asked Questions
How do I decide between Batch Inference and real-time Serverless Inference?
Short Answer: Use Batch Inference for large offline jobs where you can wait minutes or hours for results; use real-time serverless for user-facing flows that require second-level latency.
Details:
Batch Inference is optimized around throughput and cost. It shines when you want to process millions of items asynchronously — dataset labeling, synthetic data generation, offline summarization. You’ll typically see up to 50% lower cost versus issuing the same number of calls via the real-time serverless API, with a <24-hour SLA for completion.
Serverless Inference focuses on low-latency, on-demand calls. It’s the right choice when:
- A user is waiting on a response.
- You need streaming or interactive behavior.
- Requests arrive unpredictably and you don’t want to manage dedicated capacity.
Many teams do both: Dedicated/Serverless for production user flows, Batch for nightly or monthly backfills on the same models.
Can I run Batch Inference against my own dedicated models and still track per-job spend?
Short Answer: Yes. Batch Inference can target Dedicated Model Inference endpoints, and you can track token usage and derived costs per job.
Details:
When you deploy a model via Dedicated Model Inference, you get a private, tenant-isolated endpoint. Batch Inference can be configured to run against this endpoint, so your offline jobs benefit from:
- The same kernels and optimizations (TKC, FlashAttention-4, ATLAS).
- Predictable throughput tuned to your SLOs.
- Full data ownership: your data and models stay under your control, with encryption in transit and at rest.
For spend tracking:
- Together’s usage views will show token volumes per dedicated endpoint and per API (including batch).
- You can multiply token counts by your effective rate (reserved capacity plus overages) to get per-job cost.
- The per-item
usagefields in batch results make it easy to attribute cost at a job, project, or team level in your own systems.
Summary
Running large offline jobs on together.ai Batch Inference (async) lets you:
- Process up to 30B tokens per model without managing GPU queues.
- Achieve up to 50% lower cost than real-time serverless for most models.
- Maintain clear visibility into job status, per-item success, and token-level spend across serverless and dedicated deployments.
By structuring your workload as a batch job, targeting the right deployment mode (serverless vs dedicated), and instrumenting token usage in both the together.ai console and your own warehouse, you turn “massive backfill” into a deterministic, budgetable operation instead of an infrastructure fire drill.