
How do we instrument our agent in Galileo so we can see tool calls, branches, and latency/cost per trace?
Most teams don’t discover agent failures from their own telemetry—they hear about them from users. An agent quietly picks the wrong tool, loops on the same branch, or blows through your latency budget because of one bad sub-call, and all you see is a generic “request took 18s.” That’s not observability; that’s flying blind.
Galileo is built to fix that. When you instrument your agent correctly, you get a live tree of every session—tool calls, branches, retries, and guardrails—plus precise latency and cost per trace. That’s what lets you go from “something feels slow/unreliable” to “this exact span, this exact tool, this exact model call is the problem.”
Quick Answer: Instrument your agent with Galileo’s SDK so every session is captured as traces and spans. Wrap each LLM call, tool invocation, and branch decision with span metadata (latency, token usage, cost, outcome), then send these events to Galileo. The platform stitches them into a single, filterable view where you can see full agent graphs, tool behavior, and per-trace performance.
The Quick Overview
- What It Is: A way to wire your agent into Galileo so every interaction is tracked as sessions → traces → spans, exposing tool calls, branches, and latency/cost for each step.
- Who It Is For: Teams running production agents or RAG systems who need to debug tool usage, compare branches, enforce latency budgets, and track cost at the trace and span level.
- Core Problem Solved: Instead of opaque logs and averages, you get structured, agent-specific observability: exactly what your agent did, why it did it, how long it took, and what it cost.
How It Works
At a high level, instrumentation in Galileo follows a simple pattern:
- You initialize the Galileo logger in your app (using SDKs/APIs).
- You wrap your agent logic so each user interaction becomes a session, each high-level flow becomes a trace, and each tool/model step becomes a span.
- You enrich spans with metadata: tool names, branch labels, latency, token usage, cost, and guardrail outcomes. Galileo ingests this data and turns it into a live agent graph with metrics and Insights.
Think of it as adding a structured skeleton to your currently messy logs—Galileo’s Agent Observability Platform then uses that structure to power debugging, evals, and guardrails.
1. Session & Trace Setup
You start by defining the boundaries of a user interaction:
- Session: One end-to-end user journey (e.g., “user asked to update billing details and confirm payment”).
- Trace: A specific run of your agent within that session (e.g., “billing-assistant v2 run #123 using model gpt-4o”).
Instrumentation steps:
- Initialize the Galileo logger in your app process (backend, worker, or agent runtime).
- Create a session ID when a new user interaction starts.
- Start a trace for each agent run (you can version by prompt/agent config).
This gives Galileo the top-level structure to group all downstream spans.
2. Span-Level Instrumentation (Tools, Branches, LLM Calls)
Within each trace, you create spans for every meaningful step:
- LLM calls (planning, reasoning, and response generation).
- Tool calls (database queries, CRM updates, search APIs, internal services).
- Branch decisions (e.g., “picked refund_flow branch,” “escalated_to_human”).
- Guardrail checks (e.g., safety filters, hallucination detection via Protect).
For each span, you log:
- Span type:
llm_call,tool_call,branch_decision,guardrail. - Name: meaningful identifier (
"search_orders_tool","vector_search","select_refund_branch"). - Timing: start and end timestamps (Galileo computes latency).
- Tokens and cost (for LLM calls): prompt tokens, completion tokens, and your cost calculation.
- Inputs/outputs (where safe): sanitized or redacted payloads.
- Status: success/failure, error messages, or guardrail verdicts.
Galileo aggregates this into a real-time, color-coded tree so you can see how your agent executed across branches and tools.
3. Metrics, Guardrails, and Insights
Once your spans are flowing into Galileo:
- The Agent Graph view shows sessions → traces → spans with:
- Tool call sequences and branches.
- Protect stages and guardrail outcomes.
- Latency and cost per span and per trace.
- Agent Insights analyze patterns across traces:
- Slowest branches and tools.
- Mis-used tools (wrong selection, bad parameters).
- Common failure paths and cascading errors.
- Guardrails use this same structure to:
- Intercept risky outputs (hallucinations, PII leaks, policy violations).
- Block, redact, override, or send webhooks in < 200 ms.
- Version and roll back rules without redeploying code.
Instrument once, and the same data powers debugging, evaluation, and real-time protection.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Sessions → Traces → Spans | Models each agent run as a structured tree of steps, including tool calls and branches. | Gives a single, coherent view of complex agent behavior instead of scattered, unstructured logs. |
| Latency & Cost per Trace/Span | Captures timing and token usage for every LLM call and tool interaction. | Lets you enforce latency budgets and control spend with precision, not guesses. |
| Agent-Specific Metrics & Guardrails | Tracks tool selection quality, action completion, and safety/security guardrails on each span. | Helps you catch wrong tool actions, incomplete flows, and policy drift before users do. |
Ideal Use Cases
- Best for production agent debugging: Because it turns each support, sales, or operations session into a trace tree where you can see exactly which tool call or branch caused a failure or slowdown.
- Best for eval-to-guardrail workflows: Because the same span-level instrumentation used in offline evals can be promoted directly into Protect guardrails, enforcing behavior in real time with sub-200 ms latency.
Limitations & Considerations
- Instrumentation requires code changes: You’ll need to wrap your agent logic (LLM calls, tool invocations, decision points) with Galileo spans. Plan for a small engineering task rather than expecting “zero-touch” magic.
- You must design useful span boundaries: If you log everything as one giant span, you’ll lose the value. Take the time to model logical steps: planning, each tool call, each branch decision, each guardrail check.
Pricing & Plans
Galileo is built for teams moving from demos to production-scale agents and RAG systems, with deployment options that match enterprise constraints.
Typical structure:
- Growth / Team Plan: Best for lean teams or business units needing to instrument a few agent workflows, run evals, and get full trace-level observability on a limited volume (e.g., thousands of traces/month).
- Enterprise Plan: Best for organizations with multiple agents, strict latency and compliance requirements, and high-volume traffic (10,000+ requests/min), including VPC or on-prem deployment, SSO, SOC 2 Type II, and HIPAA-ready infrastructure with BAAs.
For detailed pricing and volume tiers, talk to the Galileo team so we can align to your traffic, latency budget, and evaluator mix.
Frequently Asked Questions
How do I actually start instrumenting my agent with Galileo?
Short Answer: Install the Galileo SDK, initialize it in your app, and wrap your agent’s LLM calls, tool calls, and decision points with spans that send events to Galileo.
Details:
In practice, teams do this in three steps:
-
Connect your app to Galileo:
- Add the Galileo SDK or API client to your backend/agent runtime.
- Configure credentials and environment (dev, staging, prod).
-
Define session and trace boundaries:
- Create a new Galileo session for each user interaction.
- Start a trace for each agent run, attaching metadata like agent name, version, and model.
-
Instrument spans around key steps:
- Wrap each LLM call and tool call in a span that records:
- Start/end times (latency).
- Tokens and cost (if applicable).
- Inputs/outputs (with redaction as needed).
- Branch labels and outcomes.
- Log guardrail checks as spans too, so you can see when Protect blocked, redacted, or overridden outputs.
- Wrap each LLM call and tool call in a span that records:
Once this is in place, Galileo automatically assembles the session tree. You can open any session and see the live tree of every step, color-coded by guardrail outcome with full latency details.
Can Galileo show tool selection quality and multi-step action completion?
Short Answer: Yes. Galileo is built specifically to evaluate and observe agents, including tool selection quality and whether multi-step actions fully complete user goals.
Details:
Generic observability tools treat everything as logs or spans without understanding agents. Galileo adds agent-specific metrics on top of your instrumentation:
- Tool selection quality: Evaluates whether your agent chose the correct tool(s) with appropriate parameters for a given user request. This is computed from your spans and evaluator results.
- Action completion: Measures if multi-step conversations actually accomplish the user’s goal, not just produce plausible text. This uses session-level context and outcomes.
Because your instrumentation already models the session as traces and spans:
- You can see where a tool was mis-selected (wrong tool, wrong args).
- You can identify flows where the agent partially completes an action and drops the ball.
- You can turn these insights into evaluators and then guardrails:
- e.g., “If action completion fails twice, escalate to a human via webhook.”
This is the eval-to-guardrail lifecycle in action: what you instrument and evaluate offline can be enforced in real time, on 100% of traffic.
Summary
If you can’t see your agent’s tool calls, branches, and per-trace latency/cost, you don’t have an AI system—you have a demo that might fail at any time. Instrumenting your agent with Galileo turns each user session into a structured graph of sessions → traces → spans, with clear visibility into tool usage, decision paths, guardrail outcomes, and performance metrics.
That same instrumentation powers everything else Galileo does: Evaluation Engine for offline testing, Signals for discovering unknown failure patterns, and Protect for real-time guardrails that intercept bad behavior in under 200 ms at ~97% lower cost than heavyweight LLM judges.
Instrument once, and you get debugging, evaluation, and run-time protection from the same trace-level data.