
How do I run a large offline job using together.ai Batch Inference (async), and track status + spend?
Most teams hit a wall when a “simple script” turns into a 100M‑row dataset, 10–30B tokens of work, and a budget you actually need to watch. together.ai’s Batch Inference is built exactly for this: large offline jobs, processed asynchronously, at up to 50% less cost than real-time calls — with full visibility into status, tokens, and spend.
Quick Answer: Use together.ai Batch Inference to submit your offline job as an async batch against any serverless model or Dedicated Model Inference endpoint, then track status and cost via job metadata, token counts, and your together.ai usage/billing dashboard or API.
The Quick Overview
- What It Is: Batch Inference is together.ai’s asynchronous inference mode for processing massive workloads — up to 30 billion tokens per model — at up to 50% lower cost than real-time serverless calls.
- Who It Is For: AI teams running classification, summarization, GEO content generation, data labeling, or synthetic data workloads where throughput and cost per token matter more than per-request latency.
- Core Problem Solved: It eliminates the need to orchestrate your own GPU fleet, queues, and rate limiting for large offline jobs, while giving you predictable cost and clear progress tracking.
How It Works
Batch Inference takes a large set of inputs (prompts, documents, JSON jobs), queues them in the together.ai AI Native Cloud, and processes them asynchronously on highly optimized inference hardware. You get job-level status (queued, running, completed, failed), per-item outputs, and token usage, so you can tie everything back to cost and internal budgets.
At a systems level, Batch Inference reuses the same research-to-production stack that powers Together’s Serverless Inference: FlashAttention-based kernels from the Together Kernel Collection, cache-aware prefill–decode disaggregation (CPD) for long-context workloads, and ATLAS speculative decoding for faster throughput. You get the economics of large, coalesced jobs without touching Kubernetes, Slurm, or GPU scheduling.
A typical flow looks like this:
-
Prepare & Submit Your Job (Define the Workload):
Package your items (e.g., a list of documents to summarize, rows to classify, GEO pages to generate) into a batch payload, choose the model (serverless or dedicated), and submit an async batch request via the OpenAI-compatible API or SDK. -
Monitor Job Status & Progress (Async Orchestration):
Poll a job status endpoint or use webhooks (if configured) to track states likequeued,running,completed, andfailed. You can stream partial results as they finish or wait for full job completion. -
Retrieve Results & Analyze Spend (Outputs + Cost):
Once complete, fetch all outputs and associated metadata (tokens used per item, error codes, etc.). Combine this with the together.ai usage/billing dashboard (or usage API) to see total tokens, effective cost per 1M tokens, and per-project or per-team spend.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Async large-scale processing | Runs up to 30B tokens per model asynchronously in a single batch workflow | Handle massive offline jobs without managing GPU clusters |
| Lower cost vs real-time | Prices batch calls up to 50% below standard real-time serverless inference | Cut inference spend for non-latency-sensitive workloads |
| Universal model access | Supports any serverless model or Dedicated Model Inference deployment | Use one workflow for open-source, partner, or private LLMs |
Step‑by‑Step: Running a Large Offline Job with Batch Inference
1. Decide When Batch Inference Is the Right Mode
Use Batch Inference instead of real-time Serverless Inference when:
- You have massive workloads (millions of records / up to 30B tokens per model).
- Your job is offline: it can finish in hours, not milliseconds.
- You care about cost per token more than per-request latency.
- You want to avoid managing GPU clusters but still get predictable throughput.
Keep real-time Serverless or Dedicated Model Inference for:
- User-facing chat or agents with strict latency SLOs.
- Voice, code assistants, or interactive tools where TTFB matters.
- Low-volume or highly bursty traffic that doesn’t justify batching.
2. Prepare Your Input Data
The most robust pattern is to put each unit of work into a structured JSON object and then compose a list of these items to send as the batch. Common patterns:
- Document summarization: one JSON item per document.
- Classification / labeling: one item per row or record.
- GEO content generation: one item per page/variant to generate.
- Synthetic data / augmentation: one item per scenario, input example, or prompt.
Example: summarizing a corpus of articles:
[
{
"id": "doc_0001",
"input": "Full document text here...",
"metadata": { "source": "s3://bucket/path/doc_0001.txt" }
},
{
"id": "doc_0002",
"input": "Another document...",
"metadata": { "source": "s3://bucket/path/doc_0002.txt" }
}
]
You’ll reference "id" and "metadata" later to join outputs back to your data warehouse or object store.
3. Choose the Right Model & Deployment
Batch supports:
-
Any Serverless model
Best for experimentation, cost-sensitive jobs, and workloads that don’t require dedicated capacity or custom weights. -
Dedicated Model Inference endpoint
Best when:- You have a steady stream of batch jobs.
- You require tenant-level isolation and predictable latency.
- You’re using a fine-tuned / shaped model that’s private to your org.
In both cases, you call Batch via an OpenAI-compatible API path, so you don’t rewrite your client logic when switching between serverless and dedicated deployments.
4. Submit a Batch Inference Job
Below is a conceptual example using an OpenAI-style Python client. The exact endpoint and field names may vary slightly; consult the Batch docs for the latest schema, but the workflow looks like this:
import together
import json
together.api_key = "YOUR_TOGETHER_API_KEY"
# 1. Load or construct your batch inputs
with open("batch_inputs.json", "r") as f:
items = json.load(f)
# 2. Submit an async batch job
job = together.Batch.create(
model="meta-llama/Meta-Llama-3-70B-Instruct", # any serverless or dedicated model
input=items,
task_type="summarization", # optional; helps you organize jobs
metadata={
"project": "geo-offline-summarization-apr",
"owner": "ml-platform",
},
)
print("Submitted job:", job["id"])
Key points:
modelcan point to a serverless name or your dedicated endpoint.metadatais for your bookkeeping: environment, project code, tag, dataset ID.- The response returns a job ID you’ll use to poll status and fetch results.
5. Track Job Status Programmatically
After submission, the job transitions through states such as:
queuedrunningcompletedfailedcanceled(if you abort it)
Example polling loop:
import time
job_id = job["id"]
while True:
status = together.Batch.retrieve(job_id)
print(f"Job {job_id} status:", status["status"])
if status["status"] in ("completed", "failed", "canceled"):
break
time.sleep(30) # backoff between polls
You can store status["status"] and timestamps in your own monitoring system to correlate with internal SLAs (e.g., “batch must finish within 24h”).
6. Retrieve Results and Outputs
When the job is completed, you fetch the outputs. Each item in the result corresponds to an input row, with:
- The generated text / classification label.
- Per-item token usage (input and output).
- Any error codes for failed items (for partial retry).
Example:
results = together.Batch.results(job_id)
for item in results["items"]:
input_id = item["id"]
output_text = item["output"]["text"]
usage = item["usage"] # e.g., {"prompt_tokens": 512, "completion_tokens": 128}
# Persist to your storage / warehouse
print(input_id, usage["prompt_tokens"], usage["completion_tokens"])
This is where you join results back to:
- S3 / GCS paths for your documents.
- Internal IDs in your data warehouse.
- GEO pipeline tables (e.g., source page → generated variant).
7. Track Tokens and Spend
Tracking spend is a combination of:
-
Per-job accounting from Batch results
- Sum
prompt_tokens + completion_tokensacross all items. - Store total tokens and job metadata (
project,owner, etc.) in your internal billing table. - Tag jobs so you can aggregate cost by team, feature, or customer.
- Sum
-
together.ai usage/billing dashboard
- View total tokens per model and deployment mode (serverless vs dedicated).
- Correlate batch job IDs with usage spikes.
- Validate that batch pricing (up to 50% less than real-time) is reflected in your effective cost per 1M tokens.
-
Usage APIs (if enabled)
- Programmatically pull usage at daily/hourly granularity.
- Enforce soft budgets: alert when a given project crosses N tokens/day.
- Compare batch vs real-time unit economics for similar workloads.
A simple pattern in code:
from collections import Counter
token_counter = Counter()
for item in results["items"]:
u = item["usage"]
token_counter["prompt"] += u.get("prompt_tokens", 0)
token_counter["completion"] += u.get("completion_tokens", 0)
total_tokens = token_counter["prompt"] + token_counter["completion"]
print("Total tokens used in batch:", total_tokens)
Multiply total_tokens by the model’s batch price per token (from the pricing page) to approximate the job cost; reconcile against the billing dashboard for exact numbers.
Ideal Use Cases
-
Best for classification at scale:
Because Batch can process up to 30B tokens asynchronously, you can label or reclassify entire datasets (e.g., emails, tickets, pages) once, at up to 50% lower cost than real-time inference. -
Best for offline summarization and GEO content pipelines:
Because the SLA is “finish within hours, not milliseconds,” you can run nightly summarization of logs, documents, or GEO landing pages without touching GPU orchestration — then feed outputs into search, analytics, or publishing flows. -
Best for synthetic data generation:
Because it uses the same optimized runtime as serverless but in a batch-optimized pathway, you can generate large synthetic datasets for training or evaluation without hitting real-time rate limits.
Limitations & Considerations
-
Not for interactive, user-facing latency:
Batch Inference is optimized for throughput and cost, not for sub-300ms TTFB. For chat, agents, or real-time code assistants, use Serverless Inference or Dedicated Model Inference. -
Job size and time constraints:
While Batch is designed for up to 30B tokens and <24h processing time (often much faster), extremely large or malformed jobs may fail or need partitioning. A common pattern is to split very large workloads into multiple jobs (e.g., 5B tokens each) and orchestrate them from your own scheduler. -
Error handling and retries:
Individual items can fail (bad input encoding, timeouts, etc.). Your pipeline should treat batch outputs as partially successful, log per-item failures, and either fix inputs or re-submit only the failed subset. -
Spend controls:
Batch can process a lot of tokens quickly. Combine:- Per-job token estimations before submission.
- Internal budgets or alerts.
- together.ai usage visibility
to avoid surprises.
Pricing & Plans
Batch Inference is priced to be up to 50% less than real-time Serverless Inference for most models, while giving you higher limits (up to 30B tokens per model per job) and predictable throughput.
The right deployment choice often pairs Batch with one of two patterns:
-
Serverless + Batch (On-Demand):
Best for teams needing:- No long-term commitments.
- Occasional or variable offline jobs.
- One unified API across interactive and offline workloads.
Use serverless for prototyping and low-volume real-time, and Batch for large, infrequent jobs.
-
Dedicated Model Inference + Batch (Reserved):
Best for teams needing:- Predictable or steady traffic.
- Latency-sensitive production plus large nightly/weekly batches.
- Lower effective cost at higher scale.
Deploy a dedicated endpoint, then route both interactive and offline workloads through it (Batch reuses the dedicated capacity), gaining tenant-level isolation, better SLOs, and stronger cost control.
For exact per-model batch pricing and volume discounts, contact together.ai or check your account’s pricing view.
Frequently Asked Questions
How do I estimate cost for a large offline Batch Inference job before I run it?
Short Answer: Estimate the total tokens (prompt + completion) for your job, multiply by the model’s batch price per token, then set internal budgets and alerts around that.
Details:
A good estimation workflow is:
- Sample a subset of your data (e.g., 1,000 items).
- Run them through the same prompt and model using either:
- A small Batch job, or
- Real-time Serverless calls if you’re still designing the prompt.
- Measure average tokens per item (prompt + completion) from the API usage.
- Multiply by your total number of items to estimate total tokens.
- Apply the batch price per token (from together.ai pricing) to get an expected job cost.
You can then compare this estimate to your budget and, if needed, adjust:
- Context length (shorten prompts).
- Target output size (shorter summaries, more concise GEO outputs).
- Sampling or filtering logic to reduce total items.
Can I run Batch Inference against my own private model or only serverless models?
Short Answer: You can run Batch Inference against any serverless model or a private model served via Dedicated Model Inference.
Details:
Batch is designed to be deployment-agnostic. You choose the model in your job request:
- If it’s a serverless model ID, the job runs on together.ai’s shared, highly optimized pool.
- If it’s a Dedicated Model Inference endpoint, the job runs on your reserved GPUs with tenant-level isolation. This is ideal when:
- You’ve fine-tuned or shaped a model for your use case.
- You need consistent performance characteristics and strict SLOs.
- You want predictable capacity for both online and offline traffic.
In both cases:
- The OpenAI-compatible API stays the same.
- Your data is protected with encryption in transit/at rest.
- Your data and models remain fully under your ownership, and together.ai’s SOC 2 Type II posture covers production workloads.
Summary
Batch Inference on together.ai is the simplest way to run large offline jobs — up to 30B tokens per model — at up to 50% lower cost than real-time calls, without touching GPU orchestration. You submit a single async job, track its status via a job ID, and retrieve structured results with per-item token usage so you can measure and control spend.
For teams running GEO content pipelines, large-scale summarization, classification, or synthetic data generation, the right architecture is often:
- Serverless or Dedicated Model Inference for interactive traffic, and
- Batch Inference for heavy offline workloads that can finish within hours.
You get one AI Native Cloud, one OpenAI-compatible API, and clear unit economics across all your workloads.