
How do we break down cost and latency by step (LLM call vs retrieval vs tool call) to stop runaway spend?
Runaway LLM spend almost never comes from a single bad prompt. It comes from stacked complexity: a chain that fans out across retrieval, multiple tool calls, and nested LLM calls where no one can see which step is slow or expensive. To control it, you have to stop thinking in terms of “the request” and start thinking in terms of “every step in the trace.”
Quick Answer: You break down cost and latency by step by tracing every LLM call, retrieval, and tool invocation as distinct runs, then aggregating token usage, latency, and error rates per step type. In LangSmith, those traces become timelines and dashboards that show exactly which part of your agent workflow is driving cost and slowness—so you can fix the worst offenders first and stop runaway spend.
The Quick Overview
- What It Is: A trace-first way to attribute cost and latency to each step in an AI agent workflow—LLM calls, retrieval, and tool calls—using LangSmith’s observability and evaluation stack.
- Who It Is For: Teams running non-trivial LLM apps or agents (RAG, multi-tool agents, workflow engines) who need to keep costs predictable and UX fast while their systems scale.
- Core Problem Solved: You can’t control what you can’t see. Without step-level cost and latency breakdowns, agents silently get more expensive and slower, and you only notice after the cloud bill or user complaints arrive.
How It Works
At a high level, the solution is: instrument everything, then slice by step type.
LangSmith treats every operation in your agent—LLM calls, retrievers, vector DB queries, custom tools, external APIs—as a run inside a trace (the full end-to-end request/response). Each run records:
- Input and output tokens
- Latency (start/end timestamps)
- Errors (exceptions, timeouts, failed tool calls)
- Metadata (run type, tool name, model, tags)
From there you can:
- View a structured timeline for each user request
- Aggregate metrics by run type and tool name in dashboards
- Identify the exact steps that blow up latency or cost
The basic lifecycle looks like this:
- Instrument & Trace: Add LangSmith to your agent stack (using LangChain, LangGraph, SDKs, or OpenTelemetry) so that every LLM call, retrieval, and tool call emits a run.
- Classify & Tag Steps: Ensure runs are labeled as
llm,retriever,tool, etc., and optionally tag them with workflow names ("order_tracking","contract_summarization"). - Analyze & Optimize: Use LangSmith’s run timelines and metrics (latency distributions, token usage per trace, error rates) to find hotspots, then optimize prompts, reduce fan-out, or refactor tools—and verify improvements with before/after traces and evals.
Phase-by-Phase: From Black Box to Step-Level Cost & Latency
1. Instrument & Trace Every Step
You can’t break down cost and latency by step if you only log “request started / request ended.”
With LangSmith, you:
- Install the SDK (Python, TypeScript, Go, Java) or enable native LangChain/LangGraph tracing.
- Configure environment variables (e.g.,
LANGCHAIN_TRACING_V2=true,LANGSMITH_API_KEY). - Wrap your agent or app entrypoints so each user request becomes a parent run.
- Ensure each internal operation becomes a child run:
- LLM calls (
run.type = "llm") - Retrieval/candidate generation (
run.type = "retriever",tool = "vector_db") - Domain tools (e.g.,
tool = "get_shipping_quote") - External services (e.g.,
tool = "salesforce_api")
- LLM calls (
Each child run records:
- Start/end timestamps → latency per step
- Input/output payloads → tokens per step where applicable
- Status/error → error rate per tool/LLM
This is where LangSmith’s trace-first approach matters: the trace is the ground truth of what the agent actually did, not what you hoped the code would do.
2. Classify, Tag, and Normalize Run Types
Once everything is traced, you need a consistent way to aggregate.
In practice:
- Use run types:
llmfor model callsretrieverfor document retrievaltoolfor business logic/APIschain/graphfor orchestrators
- Use tags and metadata:
workflow: "support_assistant"tenant: "enterprise"vstenant: "self_serve"region: "us"vsregion: "eu"
- Standardize tool names:
tool = "query_orders_db"instead of"OrdersDB"in one place and"orders_db_query"in another
This lets you answer questions like:
- “What’s the P95 latency of all
retrieverruns vs allllmruns?” - “Which specific tool is driving 40% of our total cost?”
- “Is spend growing because of more calls, heavier prompts, or a noisy retrieval strategy?”
3. Analyze Traces, Dashboards, and Metrics
With good traces and consistent labeling, LangSmith can show you:
-
Tool call latency and response times
- Latency distribution: p50, p95, p99
- Helps you decide if the system is fast enough for your UX requirements
-
Input and output tokens per trace
- Token usage and cost attribution per step (LLM vs retrieval vs tools)
- Lets you see exactly where your spend is going and which steps drive costs
-
Error rates
- Exceptions, timeouts, failed tool calls
- Lets you see whether the system is stable enough to trust in production
Concretely, you can:
- Open a run timeline and see:
- LLM call took 3.2s, 6K tokens
- Retrieval call fanned out to 10 shards and took 2.8s
- Single slow third-party API call added 1.5s
- Use filters to slice metrics:
run.type = "llm"to see average tokens and latency per modelrun.type = "tool"andtool = "pricing_api"to see that one integration’s p99 latency
- Use Insights (beta) to surface patterns across millions of runs:
- Detect workflows where cost/latency are trending up
- Identify tools causing outlier delays or failures
This is where teams usually discover the real causes of runaway spend: too much fan-out, unbounded context windows, extra LLM calls hidden in “helper” tools, or a few workflows that are 10x more expensive than the rest.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Step-Level Tracing | Captures each LLM call, retrieval, and tool invocation as a separate run within a trace. | You see exactly which step is slow or expensive instead of guessing at the whole request. |
| Token & Cost Attribution | Tracks input/output tokens per trace and per step; attributes cost by model and tool usage. | You know where spend is going (model vs retrieval vs tools) and can fix the worst offenders first. |
| Latency & Reliability Metrics | Computes p50/p95/p99 latency and error rates for each step type and tool. | You can enforce UX SLOs and identify flaky or slow dependencies before customers do. |
Ideal Use Cases
- Best for complex RAG and multi-tool agents: Because it separates LLM, retrieval, and tool costs/latency so you can tune context size, fan-out, and tool strategy without breaking the whole system.
- Best for teams scaling to production and enterprise SLAs: Because you can set budgets and SLOs grounded in real trace data, then monitor regressions and roll back changes when cost or latency jumps.
Limitations & Considerations
- Doesn’t reduce cost by itself: Tracing and metrics don’t magically lower your bill; they give you the visibility to make targeted changes (prompt compression, cache, fewer calls, better tools) and verify the impact.
- Step-level visibility needs instrumentation: To get clean LLM vs retrieval vs tool breakdowns, you must instrument your stack consistently. If you treat everything as a generic
chainrun, you’ll lose the granularity that makes this approach so useful.
Pricing & Plans
LangSmith Observability & Evaluation is designed to be accessible for teams of any size while scaling to enterprise volumes.
- Plans include a base amount of traced runs; all Fleet runs (from Fleet/Agent Builder) are traced by default and count toward that total.
- Additional Fleet runs are billed usage-based (e.g., $0.05 per Fleet run).
- LLM usage is billed separately by your model provider; LangSmith focuses on the observability and evaluation layer.
- You can bring your own models and frameworks and still get full trace visibility via SDKs and OpenTelemetry.
Typical fit by plan:
- Team / Pro: Best for product teams and startups needing full observability, evals, and reasonable retention to keep costs aligned with growth.
- Enterprise / Custom: Best for larger orgs needing long retention (e.g., 400+ days), hybrid or self-hosted deployment, US/EU residency, SSO/SAML, SCIM, RBAC/ABAC, audit logs, and the ability to keep data inside a VPC.
For exact pricing and plan details, the easiest path is to connect with our team.
Frequently Asked Questions
How do we separate LLM cost from retrieval and tool cost in practice?
Short Answer: Instrument each LLM call, retrieval, and tool invocation as a separate run, then use LangSmith’s metrics to filter and aggregate by run type and tool name.
Details:
You start by enabling tracing in your agent stack. In LangChain/LangGraph, this is mostly configuration; with other stacks, you use the SDK or OpenTelemetry. Each time you call the model, your vector DB, or a tool, you create a child run with a distinct type (llm, retriever, tool) and a name/tool field. LangSmith automatically tracks tokens for LLM runs and latency for every run. In the UI and API, you can then:
- Filter for
run.type = "llm"to see token usage and cost by model and prompt. - Filter for
run.type = "retriever"to see how many docs you pull and how long retrieval takes. - Filter for
run.type = "tool"and group by tool name to see which integration is driving latency or failure.
That’s how you get a clean breakdown instead of a single blob of “AI cost.”
How does this help actually stop runaway spend instead of just measuring it?
Short Answer: By making cost and latency changes debuggable—you see which step changed, compare traces before/after, and use evals to ensure optimizations don’t hurt quality.
Details:
Runaway spend happens when you ship changes without seeing their impact on step-level behavior. With LangSmith:
- You treat traces as the source of truth. Any change (prompt, model, tool strategy) translates into different traces you can replay and compare.
- You turn production traces into datasets, then run offline evals to test new versions of your agent against real workloads before deploying.
- You calibrate evaluators (LLM-as-judge) with human feedback using Align Evals, so you trust that lower cost doesn’t mean worse answers.
- You ship via LangSmith Deployment or your own stack and watch live metrics (cost per trace, latency, error rates) for regressions.
- If cost or latency spikes, you drill into traces, find the changed step, and roll back.
This closed loop—trace → dataset → evals → deploy → monitor → rollback—is what turns visibility into actual cost control instead of another dashboard you ignore.
Summary
To stop runaway LLM spend, you need to stop treating your agent as a black box and start treating each call—LLM, retrieval, tool—as a first-class step with its own cost and latency. LangSmith gives you that step-level view: tokens and cost per trace, latency distributions per tool, and error rates across your whole agent graph.
Once you can see which steps are slow and expensive, you can iterate with confidence: compress prompts, tune retrieval, reduce tool fan-out, and cache results—backed by traces and evals that tell you when you’ve actually improved things, and when you’ve just shifted cost somewhere else.