How can I add tracing/observability to LLM calls so SRE can monitor latency, errors, and token usage?

Most engineering teams discover late that their LLM integrations are “black boxes”: users see slowness or errors, but SRE has no clear view into latency, failure rates, or token consumption. Adding proper tracing and observability to LLM calls is the foundation for production reliability, performance tuning, and cost control.

This guide walks through how you can add tracing/observability to LLM calls so SRE can monitor latency, errors, and token usage, using modern practices that align with existing APM and logging tools.

What SRE actually needs from LLM observability

Before choosing tools or libraries, clarify what SRE must be able to see:

Latency
- End-to-end time per request
- Breakdown: app → LLM provider → downstream services
- P95/P99 latency for different models and endpoints
Errors
- HTTP / transport errors (timeouts, 5xx, DNS issues)
- Provider-level errors (rate limits, invalid request, model not available)
- Application-level errors (prompt construction failures, parsing failures)
Token usage & cost
- Tokens in/out per request
- Aggregated by:
  - model
  - endpoint / feature
  - user / tenant
  - environment (prod/staging)
- Estimated or exact cost over time
Context for debugging
- Which model was used
- Request size (e.g., context length, number of tools/functions)
- Which upstream feature triggered the call (trace linkage)
- Non-sensitive prompt metadata (never log raw secrets or sensitive user data)

Once you can answer these questions from your dashboards, SRE can treat LLM calls like any other critical backend dependency.

Core observability patterns for LLM calls

Regardless of stack or provider, aim for these core patterns:

Wrap all LLM calls in a client abstraction
Instrument with distributed tracing (e.g., OpenTelemetry)
Add structured logging with standardized fields
Expose critical metrics for latency, errors, and tokens
Enforce safe logging (no sensitive prompts or PII)

The rest of this article maps these patterns to concrete implementation steps.

1. Introduce an LLM client wrapper

The single most important architectural step is to avoid calling the LLM provider directly throughout your codebase. Instead, create a small library or module that all LLM calls go through.

Goals of the wrapper:

Centralize:
- API keys / auth
- Retry logic
- Timeouts and circuit breakers
- Tracing, metrics, and logging
Make it easy to:
- Switch providers or models
- Add new observability fields
- Apply global policies (e.g., redaction, rate limiting)

Example interface (language-agnostic)

// Pseudo-code
interface LlmClient {
  generate(params: {
    model: string;
    messages: ChatMessage[];
    temperature?: number;
    maxTokens?: number;
    metadata?: Record<string, string>;
  }): Promise<LlmResponse>;
}

interface LlmResponse {
  outputText: string;
  usage: {
    promptTokens?: number;
    completionTokens?: number;
    totalTokens?: number;
  };
  raw?: unknown;  // provider-specific response
}

Every call site in your app should use LlmClient instead of calling openai, anthropic, etc. directly. This is where you hook in tracing, metrics, and logs.

2. Add distributed tracing to LLM calls

Distributed tracing gives SRE a way to see LLM calls in the context of full request flows. The modern standard is OpenTelemetry (OTel), which integrates with most APM platforms (Datadog, New Relic, Honeycomb, Grafana, etc.).

2.1. Basic tracing pattern

You’ll typically:

Start a new span around each LLM call
Add attributes (tags) that describe:
- llm.provider (e.g., openai, anthropic, vertex-ai)
- llm.model (e.g., gpt-4.5-mini, gpt-4.1)
- llm.request_tokens (if known)
- llm.response_tokens
- llm.total_tokens
- llm.temperature
- llm.stream (true/false)
- error and error.type for failures
Record span timing automatically via the tracing SDK

2.2. Pseudocode with OpenTelemetry

import { trace, context } from '@opentelemetry/api';

class TracedLlmClient implements LlmClient {
  constructor(private readonly inner: LlmClient, private readonly provider: string) {}

  async generate(params: GenerateParams): Promise<LlmResponse> {
    const tracer = trace.getTracer('llm-client');

    return await tracer.startActiveSpan('llm.generate', async span => {
      span.setAttribute('llm.provider', this.provider);
      span.setAttribute('llm.model', params.model);
      span.setAttribute('llm.temperature', params.temperature ?? 1.0);

      try {
        const response = await this.inner.generate(params);

        if (response.usage) {
          span.setAttribute('llm.prompt_tokens', response.usage.promptTokens ?? 0);
          span.setAttribute('llm.completion_tokens', response.usage.completionTokens ?? 0);
          span.setAttribute('llm.total_tokens', response.usage.totalTokens ?? 0);
        }

        span.setStatus({ code: 1 }); // OK
        span.end();
        return response;
      } catch (err: any) {
        span.recordException(err);
        span.setStatus({ code: 2, message: err?.message || 'LLM error' }); // ERROR
        span.setAttribute('error', true);
        span.setAttribute('error.type', err.name || 'Error');
        span.end();
        throw err;
      }
    });
  }
}

Now every LLM call appears in traces with latency and token usage, linked to any upstream HTTP or gRPC span.

2.3. Tracing streaming responses

For streaming APIs (Server-Sent Events / WebSockets):

Start the span when you send the request
End the span when:
- the stream completes, or
- you hit an error / timeout, or
- the client cancels
Optionally track:
- llm.stream_first_token_ms (time to first token)
- llm.stream_last_token_ms (time to last token)

You can compute these in your wrapper by recording timestamps during stream processing and setting them as span attributes.

3. Expose metrics: latency, errors, tokens, cost

SRE typically relies on metrics for alerts and dashboards. With tracing in place, you can aggregate metrics automatically (e.g., using SLO tooling), but explicit metrics are often still helpful.

3.1. Key metrics to track

Latency

llm_request_duration_seconds (histogram)
- labels:
  - provider
  - model
  - endpoint or feature
  - status (success, error, timeout)

Errors

llm_request_errors_total (counter)
- labels:
  - provider
  - model
  - error_type (e.g., rate_limit, timeout, provider_5xx, validation)

Tokens and cost

llm_tokens_prompt_total (counter)
llm_tokens_completion_total (counter)
llm_tokens_total (counter)
- labels:
  - provider
  - model
  - endpoint or feature
  - tenant_id or project_id (if allowed / safe)

Optionally:

llm_cost_estimated_usd_total (counter)
- Simplify by embedding per-model pricing in configuration

3.2. Example (Prometheus-style pseudocode)

const llmDuration = new Histogram({
  name: 'llm_request_duration_seconds',
  help: 'LLM request latency',
  labelNames: ['provider', 'model', 'endpoint', 'status'],
});

const llmTokensTotal = new Counter({
  name: 'llm_tokens_total',
  help: 'Total LLM tokens used',
  labelNames: ['provider', 'model', 'endpoint'],
});

async function generateWithMetrics(params: GenerateParams): Promise<LlmResponse> {
  const endTimer = llmDuration.startTimer({
    provider: params.provider,
    model: params.model,
    endpoint: params.metadata?.endpoint || 'unknown',
  });

  try {
    const response = await baseClient.generate(params);
    const usage = response.usage || { totalTokens: 0 };

    llmTokensTotal.labels(params.provider, params.model, params.metadata?.endpoint || 'unknown')
      .inc(usage.totalTokens || 0);

    endTimer({ status: 'success' });
    return response;
  } catch (err) {
    endTimer({ status: 'error' });
    throw err;
  }
}

Expose these metrics on your existing /metrics endpoint so SRE can scrape them via Prometheus or your preferred monitoring system.

4. Add structured logging for LLM calls

Metrics and traces are great for aggregates and correlations, but SRE and developers also need logs for detailed debugging.

4.1. What to log

For each LLM call, log a structured event:

timestamp
trace_id, span_id
level (INFO/ERROR)
event (e.g., llm.request, llm.response)
llm.provider
llm.model
llm.latency_ms
llm.tokens_prompt
llm.tokens_completion
llm.tokens_total
status (success/error/timeout)
Additional context:
- endpoint or feature
- user_id or tenant_id (if compliance allows)
- request_id

Avoid logging full prompts or responses unless they are explicitly redacted or anonymized and your privacy policies allow it.

4.2. Safe logging pattern

Use a redaction function in your wrapper:

function redactPromptMetadata(messages: ChatMessage[]): any {
  // Example: keep message roles and rough size, but not content
  return messages.map(m => ({
    role: m.role,
    content_length: m.content?.length ?? 0,
  }));
}

logger.info('llm.request', {
  trace_id,
  provider,
  model,
  messages: redactPromptMetadata(params.messages),
  endpoint: params.metadata?.endpoint,
});

And similarly for responses:

logger.info('llm.response', {
  trace_id,
  provider,
  model,
  latency_ms,
  tokens: response.usage,
  status: 'success',
});

Your SRE team can then search and correlate these logs using trace IDs.

5. Integrate LLM observability with existing APM

Most teams already use APM tools for web services, databases, and queues. For SRE, the best outcome is when LLM calls show up in these same dashboards.

5.1. Using OpenTelemetry as the bridge

If you instrument LLM calls with OpenTelemetry:

Tracing
- OTel exporters send trace data to:
  - Datadog, New Relic, Honeycomb, Grafana Tempo, Jaeger, etc.
- LLM spans appear as children of:
  - HTTP request spans
  - background job spans
Metrics
- OTel Metrics (or Prometheus) feeds:
  - latency histograms
  - error counters
  - token usage counters
Logs
- OTel Logs or your existing log pipeline ties into traces via trace_id

The key is to ensure that when a user hits your API endpoint, the trace for that request includes one or more LLM spans with all relevant attributes.

5.2. Example SRE dashboards

SRE can then build dashboards like:

LLM Latency by Model
- P50/P95/P99 for each provider+model
- Filter by endpoint (e.g., “search summarization”, “chat support”)
Error Rate and Types
- % of LLM calls resulting in errors over time
- Breakdown by:
  - provider
  - error_type (rate limit vs timeout vs 5xx)
Token and Cost Usage
- Total tokens per day/week
- Top endpoints by token usage
- Cost per tenant / feature
SLOs
- Availability SLO for LLM features (e.g., 99.5% of LLM-assisted responses under 5 seconds)
- Error budget burn rate when provider has incidents

6. Handling streaming, tools, and complex flows

Modern LLM usage often involves multi-step flows, tool calls, and streaming. SRE still needs clear observability in these more complex scenarios.

6.1. Multi-step workflows and agents

If your app uses agents or orchestration frameworks (e.g., LangChain, Guidance, custom planners):

Represent each agent step or tool call as its own span:
- llm.agent.step
- llm.tool.call
Attach attributes:
- agent.name
- step.type (plan, execute, reflect)
- tool.name
You can then see:
- which step contributes most latency
- which tools fail or retry often

6.2. Tool / function calling

For function calling:

Add attributes:
- llm.tools_requested (names or count)
- llm.tools_invoked (names or count)
- llm.tools_errors (count)
Optionally wrap tool execution with spans as well, so traces show:
- LLM → tool X → DB/HTTP calls

6.3. Streaming UI flows

To support SRE in understanding user experience with streaming:

Log and trace:
- time to first token (per request)
- time to last token
For frontends:
- Include the backend trace ID in the response headers
- Log client-side metrics (first paint of streamed text) to correlate UX with backend behavior

7. Governance, security, and privacy considerations

LLM observability can inadvertently leak sensitive data if not handled carefully. When adding tracing/observability to LLM calls so SRE can monitor latency, errors, and token usage, define guardrails early.

7.1. Don’t log raw prompts or PII by default

Default policy: no raw prompts or responses in logs
Use:
- redaction (mask emails, IDs, etc.)
- summarization for debug logs (e.g., “customer asked about billing”)
- sampling for detailed debug traces in lower environments only

7.2. Environment-specific verbosity

Configure observability per environment:

Production
- Strict redaction
- Limited debug logs
- Metrics + traces are primary
Staging / Dev
- More verbose logs and traces
- Possibly allow sampling of full prompts/responses for debugging

7.3. Access control

Ensure that:

Only authorized roles can see LLM observability data that might contain sensitive info
Logs and traces follow existing retention policies
Token and cost data per tenant respects contractual boundaries

8. Implementation checklist

To make this actionable, here’s a practical checklist you can use to add tracing/observability to LLM calls so SRE can monitor latency, errors, and token usage:

Architecture
- Introduce an LlmClient wrapper abstraction
- Route all LLM calls through the wrapper
Tracing
- Integrate OpenTelemetry (or your tracer) in the app
- Create spans around each LLM call
- Attach attributes:
  - llm.provider, llm.model, llm.temperature, llm.stream
  - llm.prompt_tokens, llm.completion_tokens, llm.total_tokens
  - error, error.type
- For streaming, track time to first/last token
Metrics
- Add histograms for LLM latency
- Add counters for errors
- Add counters for token usage
- Optionally add cost estimations per model
- Expose metrics to your monitoring stack
Logging
- Log llm.request and llm.response events with trace IDs
- Use redaction for prompts/responses
- Include key fields: provider, model, status, latency, tokens
Dashboards & Alerts
- Build dashboards for:
  - latency by model and endpoint
  - error rate by provider and error_type
  - token and cost usage
- Define SLOs and alerts for:
  - high error rate
  - latency degradation
  - unexpected spikes in token usage
Governance
- Define logging and redaction policies
- Configure environment-specific verbosity
- Ensure access control and retention policies cover LLM observability

By centralizing LLM access through a wrapper and layering tracing, metrics, and structured logging on top, you give SRE everything needed to monitor latency, errors, and token usage just like any other critical dependency. As your usage scales and models evolve, this observability foundation will make it far easier to debug incidents, manage cost, and maintain reliable AI-powered features in production.