
How do we set up human annotation/review queues in LangChain LangSmith to label good/bad agent behavior?
Most teams only discover bad agent behavior when a user complains. By then, you’re debugging a long, branching trace with no clear way to get expert feedback into the improvement loop. LangSmith’s human annotation queues are designed to fix exactly that: they let subject-matter experts review production runs, label good/bad behavior, and feed that signal straight into your evals and datasets.
This guide walks through how to set up human annotation/review queues in LangSmith to label good and bad agent behavior, connect those labels to evaluation, and operationalize this as part of your agent quality workflow.
Quick Answer: In LangSmith, you (1) capture traces from your agents, (2) flag or sample runs into annotation queues, (3) let experts review and label those runs as good/bad (plus rich feedback), and (4) reuse that labeled data in evaluators, datasets, and regression tests so your agents consistently improve.
The Quick Overview
- What It Is: Human annotation and review queues in LangSmith let you route selected agent runs to experts, who can label outputs (good/bad, policy-violating, low quality, etc.), add comments, and correct or suggest better outputs.
- Who It Is For: AI and agent teams that are serious about production quality—ML engineers, product engineers, data/ML teams, and domain experts responsible for compliance, support, or operations.
- Core Problem Solved: LLM agents fail in subtle ways (partially correct, off-policy, tone issues). Basic logs can’t capture whether a run was “good” in your domain. Annotation queues turn raw traces into labeled data you can trust.
How It Works
LangSmith is trace-first: everything starts from runs/traces your agents send to LangSmith. Annotation queues sit on top of those traces, giving you a structured human-in-the-loop loop:
-
Instrument & Capture Runs:
You integrate LangSmith with your stack (Python/TS/Go/Java SDKs, OpenTelemetry, or native LangChain / LangGraph). Every agent interaction becomes a trace with tool calls, prompts, responses, and metadata. -
Route Runs to Queues:
You decide which runs should be reviewed—flagged by users, random samples, runs with low automated scores, or those from a new model version. These go into one or more annotation queues. -
Review, Label, and Correct:
Subject-matter experts open queues in the LangSmith UI, see the full timeline (“exactly what happened, in what order, and why”), and label each run as good/bad (plus custom fields), add comments, and optionally provide improved reference outputs.
Those labeled runs then feed back into evaluation: you can use them to calibrate LLM-as-judge, train/condition evaluators, create regression suites, and gate new releases.
Step 1: Instrument your agents to send traces
Before you can review anything, you need traces in LangSmith.
At a high level:
-
Choose an integration path:
- Use LangChain / LangGraph with built‑in LangSmith tracing.
- Use SDKs (Python, TypeScript, Go, Java) to wrap your own agent stack.
- Or use OpenTelemetry to emit traces in a standardized format.
-
Set project and environment:
- Use separate projects for dev/staging/production.
- Tag runs with versions (
model_version,agent_version), user segment, or feature flags so you can filter them into queues later.
-
Capture full context:
- Ensure tools, prompts, intermediate steps, and final outputs are all traced.
- Include metadata like user id (or hashed id), use case, and channel.
Once that’s in place, every agent call becomes a replayable run in LangSmith—your raw material for human review.
Step 2: Decide what “good” and “bad” behavior means
Human annotation only works if reviewers are aligned on what they’re labeling. Before creating queues, define:
-
Quality criteria:
- Correctness vs. ground truth.
- Faithfulness to context (no hallucinations).
- Policy adherence (compliance, safety).
- Tone and style (brand voice, empathy).
- Latency and interaction pattern (e.g., unnecessary tool calls).
-
Label schema:
- Minimal core:
good/bad. - More informative:
good,needs minor fix,bad,unsafe. - Optional secondary tags:
hallucination,policy_violation,low_recall,tone_issue,tool_misuse.
- Minimal core:
-
What reviewers should capture:
- Binary/graded label (good/bad or 1–5).
- Short free-text explanation.
- Optional corrected/better answer (a “reference output”).
You’ll encode this schema in your annotation queue configuration and reviewer instructions.
Step 3: Create annotation queues for human review
Annotation queues are the workbench for your human reviewers.
At a high level, you’ll:
-
Create a queue per workflow:
- Examples:
prod_support_agent_qualityfinance_policy_safety_reviewnew_model_ab_test_feedback
- Each queue targets a specific slice of traffic and set of criteria.
- Examples:
-
Define queue scope and filters:
- Filter runs by:
- Project (e.g., production project only).
- Tags/metadata (
agent=“support_bot”,version=“v3”). - Time windows (last 24 hours, last week).
- Decide whether runs are:
- Manually added (flagged by engineers or PMs).
- Automatically sampled (e.g., 1% of traffic).
- Pushed based on evaluator scores (e.g., low LLM-as-judge scores).
- Filter runs by:
-
Set reviewer instructions:
- For each queue, define:
- What “good” means for this agent.
- When to mark as “bad”.
- How to treat edge cases (partial correctness, missing details).
- How long comments should be and when to add reference outputs.
- Put examples directly in the instructions (good/bad runs with rationale).
- For each queue, define:
-
Assign owners and reviewers:
- Assign domain experts, not just engineers.
- Example groups:
- Support leads for customer-support agents.
- Legal/compliance for policy-sensitive agents.
- Ops managers for logistics or finance workflows.
Once configured, a queue continuously pulls in runs that match your target criteria and presents them in a review-first UI.
Step 4: Add runs into queues (flagging and sampling)
There are several ways to feed runs into annotation queues so human effort is used where it matters most.
1. Manual flagging from the trace view
- In LangSmith, open a specific run that looks off.
- Use the “flag” or “send to annotation” action to push it into a designated queue (e.g.,
support_agent_manual_review). - Optionally add a note for the reviewer (what looked wrong, suspected policy issue).
This is useful for debugging incidents, support escalations, or one-off edge cases.
2. Programmatic sampling (online evals + queues)
- Configure online evaluation for production traffic.
- Define one or more evaluators (LLM-as-judge, heuristic checks).
- Use their scores to route runs:
- Low score → annotation queue.
- Disagreement between evaluators → annotation queue.
- Random sample (e.g., 1–5% of runs) → annotation queue for spot checks.
This pattern keeps human review focused on likely failures and blind spots.
3. User-initiated flags
If your front-end or product surface allows users to flag bad answers:
- Add a mechanism (e.g., “Thumbs down” button).
- When users flag, include that signal in the run’s metadata (
user_flagged=true, reason code). - Filter and send these runs into a dedicated queue (
user_flagged_runs) to reconcile user complaints with expert review.
Step 5: Review and label good/bad agent behavior
Inside the annotation queue, reviewers get a workflow optimized for domain expertise rather than engineering skills.
For each run in the queue, reviewers can:
-
Inspect the full trace timeline
- See:
- User input(s) and multi-turn conversation thread.
- All prompts, tool calls, responses, and intermediate decisions.
- Metadata (version, environment, user attributes).
- This makes it obvious why the agent behaved badly: wrong tool, mis-parsed instruction, missing context, or model hallucination.
- See:
-
Apply labels (good/bad + custom fields)
- Choose
goodorbad(or your custom scale). - Optionally:
- Tag failure type (
hallucination,policy_violation,tool_misuse). - Provide severity (e.g.,
minor,major).
- Tag failure type (
- Choose
-
Add narrative feedback
- Short explanation of what went wrong/right.
- Guidance like:
- “Hallucinated product availability; inventory shows out of stock.”
- “Correct answer but tone is too casual for enterprise.”
- “Used the wrong pricing tool; should have called
quote_service.”
-
Provide a better answer (reference output)
- Where feasible, reviewers write the answer that should have been returned.
- These references are incredibly valuable for:
- Aligning LLM-as-judge evaluators.
- Creating gold-standard test cases.
- Demonstrating expected tone and structure.
-
Submit and move to the next run
- Once submitted, the annotation is stored and the queue loads the next item.
- You get metrics like queue size, throughput, and distribution of good/bad labels over time.
Step 6: Turn annotation queues into evaluation and datasets
Annotations are only valuable if they change how you ship agents. LangSmith’s evaluation stack is built to connect those dots.
You can use your annotated runs to:
- Calibrate LLM-as-judge evaluators
LLM-as-judge is powerful but imperfect. To trust it, you need to see where it disagrees with humans.
- Take annotated runs from queues (good/bad labels plus comments).
- Use them as a calibration set:
- Run your LLM-as-judge evaluators over these labeled runs.
- Compare evaluator scores vs. human labels.
- Adjust prompts, criteria, and thresholds accordingly.
- Iterate until:
- The evaluator catches most human-identified “bad” runs.
- False positives are acceptable for your workflow.
LangSmith makes it easy to route samples where evaluators disagree with humans so you can continuously refine them.
- Build evaluation datasets from production traces
Your best eval sets come from reality, not synthetic examples.
- Convert annotated runs into datasets:
- Inputs: user query + context.
- Expected outputs: reviewer-provided reference answer.
- Metadata: labels, failure type, environment, version.
- Run offline evaluations:
- Compare different prompts, tools, or models against this dataset.
- Use LLM-as-judge and/or custom evaluators plus the reference outputs.
This is how you prevent regressions: every bad failure you see in production becomes a test case that new versions must pass.
- Drive regression and A/B testing before deploy
Before pushing a new version:
- Run your candidate agent against the datasets derived from annotation queues.
- Use LangSmith’s evaluation framework to:
- Score each variant.
- Compare them side-by-side.
- Check that the new version fixes past failures instead of reintroducing them.
Human labels make these comparisons meaningful; you’re no longer optimizing abstract scores, but real user-impacting behavior.
Step 7: Operationalize annotation as an ongoing workflow
The goal is not a one-time review sprint; it’s a continuous quality loop:
-
Define targets:
- E.g., review 1–5% of production runs weekly.
- Or review all low-scoring runs for high-risk workflows (finance, legal).
-
Rotate and train reviewers:
- Onboard new subject-matter experts with example-laden instructions and prior annotated runs.
- Periodically sample each reviewer’s decisions for consistency.
-
Monitor annotation metrics:
- % of runs marked bad by queue.
- Breakdown of failure types over time.
- Time-to-review and backlog size.
-
Close the loop with engineers:
- Engineers inspect clusters of similar failures (e.g., all hallucination-labeled runs) in LangSmith.
- Use this to:
- Improve prompts or routing logic.
- Add new tools or guardrail checks.
- Adjust model choice or temperature.
- Re-run evals to confirm the fix before redeploying.
This is the same discipline you’d apply to any critical system: instrument, observe, label failures, then iterate.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Annotation Queues | Route selected traces to a review inbox for subject-matter experts. | Create a structured, repeatable human-in-the-loop loop for agent quality. |
| Trace-Centric Review UI | Shows full run timelines, tools, prompts, and messages for each run. | Helps reviewers understand why behavior was good/bad, not just the output. |
| Evaluation & Dataset Integration | Converts annotated runs into eval datasets and calibrates LLM-as-judge. | Turns production failures into guardrails and regression tests. |
Ideal Use Cases
-
Best for production agents with silent failures:
Because annotation queues identify runs that are “technically valid but wrong for your domain” and turn them into labeled data you can act on. -
Best for regulated or policy-sensitive workflows:
Because you can route high-risk runs to compliance or legal reviewers, label policy violations, and enforce improvements before scaling traffic.
Limitations & Considerations
-
Human time is limited:
You won’t review every run. Use evaluators, sampling, and metadata filters to prioritize which traces go into queues (low scores, high-value customers, risky actions). -
Annotation quality varies by reviewer:
Consistency matters. Provide clear guidelines, examples, and periodic calibration sessions where reviewers discuss borderline cases and align labels.
Pricing & Plans
LangSmith is designed for teams of any size, with usage-based pricing on top of seats:
- You pay for:
- Seats for your team (engineers, PMs, and domain experts who use annotation queues).
- Traces/events stored and retained, plus evaluation volume.
- Annotation queues themselves are part of the LangSmith evaluation and feedback tooling; pricing depends on your overall LangSmith plan and usage.
Common patterns:
- Team / Growth plans: Best for product and ML teams rolling out their first serious agents, needing trace visibility, annotation queues, and evals for one or a few core use cases.
- Enterprise plans: Best for organizations running multiple agents across business units, requiring SSO/SAML, SCIM, RBAC/ABAC, audit logs, extended retention (e.g., 400 days), and deployment options (US/EU residency, hybrid, or self-hosted) so data stays within your VPC.
For exact pricing and deployment options, talk to the LangChain team.
Frequently Asked Questions
How reliable is LLM-as-judge, and why do I still need human annotation?
Short Answer: LLM-as-judge is useful but imperfect. You still need human annotation queues to catch domain-specific failures and calibrate evaluators.
Details:
LLM-as-judge evaluators can score outputs against criteria you define (correctness, policy adherence, tone), but they can mis-score nuanced or domain-heavy cases. LangSmith’s annotation queues let subject-matter experts review a subset of runs and flag disagreements with automated scores. You can then use those disagreements to:
- Refine evaluator prompts and thresholds.
- Train better evaluators or heuristics.
- Decide when to require human-in-the-loop approvals for specific actions.
Over time, this mix of human labels and evaluator scores gives you a robust quality signal at scale.
Can non-technical reviewers use annotation queues without knowing how the agent is built?
Short Answer: Yes. Annotation queues are designed for domain experts, not just engineers.
Details:
Reviewers work from a UI that shows:
- User questions and agent responses in plain language.
- Key context and tool results in human-readable form.
- Simple controls for marking runs as good/bad and adding comments.
They don’t need to understand model architectures or underlying code. Engineers, in turn, can inspect the same traces—prompts, tool calls, and branching logic—to implement fixes based on that feedback. This separation is intentional: domain experts judge quality; engineers use traces and labels to improve behavior.
Summary
Human annotation and review queues in LangSmith give you a practical way to label good and bad agent behavior directly from production traces. You instrument your agents once, route selected runs into annotation queues, let subject-matter experts review and label those runs, and then feed that labeled data into evaluation, datasets, and regression tests. Instead of guessing why your agent failed—or finding out from angry users—you build a repeatable, trace-first workflow where every failure becomes a test case and every deployment is backed by observable, measured quality.