
How do I create an evaluation dataset in Langtrace from production traces and then manually score outputs?
Building an evaluation dataset directly from your production traces in Langtrace is one of the fastest ways to improve your AI agents in a realistic, data-driven way. By capturing real user interactions, turning them into an eval set, and then manually scoring outputs, you can systematically iterate towards better performance and safety.
Below is a practical, step‑by‑step walkthrough of how to do this.
Why create evaluation datasets from production traces?
Production traces reflect how real users interact with your AI agents. Using them as the basis for evaluations helps you:
- Measure real‑world performance instead of synthetic benchmarks.
- Catch safety, reliability, and UX issues that only appear in production.
- Close the loop between observability and evaluation, so you can iterate quickly.
Langtrace is designed for exactly this use case: combining observability and evaluations to improve performance and security with minimal effort.
Prerequisites: Set up Langtrace with your agents
Before you can build an evaluation dataset from traces, you need Langtrace connected to your AI stack.
-
Create a Langtrace project
- Log in to your Langtrace workspace.
- Create a new project for your application.
- Generate an API key for that project.
-
Install the appropriate SDK
- Choose the SDK that matches your framework or stack:
- CrewAI
- DSPy
- LlamaIndex
- LangChain
- Langtrace also supports a wide range of LLM providers and VectorDBs out of the box, so you can usually plug it into your existing setup without major changes.
- Choose the SDK that matches your framework or stack:
-
Instantiate Langtrace in your code
- Use the generated API key to initialize the Langtrace client in your backend or orchestration layer.
- Wrap your agent, chain, or pipeline calls so Langtrace can capture:
- Inputs (prompts, user messages, context)
- Outputs (model responses, tool results)
- Metadata (user IDs, timestamps, scenario labels, etc.)
Once this is in place, production traffic will automatically generate traces in Langtrace.
Step 1: Identify the production traces you care about
Not every trace needs to become part of an evaluation dataset. Focus on traces that:
- Represent core use cases for your product (e.g., support answers, recommendations, code generation).
- Highlight failure modes you want to eliminate (hallucinations, unsafe responses, low‑quality answers).
- Cover important user segments or scenarios (e.g., high‑value customers, regulated workflows).
In Langtrace:
- Filter traces by time range, route, model, or tags (e.g.,
/support,qa,onboarding). - Use metrics like latency, error flags, or user feedback (if you capture it) to surface interesting traces.
The goal is to assemble a set of representative, high‑signal examples.
Step 2: Convert selected traces into an evaluation dataset
Once you’ve identified useful traces, you’ll transform them into a structured evaluation dataset.
A typical evaluation row includes:
- Input: user query or task description.
- Context: documents, tools, or system prompt that were available.
- Model output: the LLM response you want to assess.
- Metadata: tags like use case, user segment, model version, etc.
- Ground truth / expected behavior (optional): if you already know what a correct answer should look like.
In Langtrace, this usually happens in two ways:
-
From the UI (trace → eval example):
- Open a specific trace.
- Inspect the request and response details.
- Use an “Add to evaluation” or similar action (if available in your current Langtrace version) to capture the input, output, and context into an eval dataset.
- Add labels or notes that describe what “good” looks like for this example.
-
Programmatically (batch export & transform):
- Use the Langtrace API to fetch traces that match your filters.
- Transform them into a uniform schema for evaluation:
inputcontextoutputmetadata
- Upload or register this collection as an evaluation dataset in Langtrace.
The key is consistency: every example in your evaluation dataset should follow the same structure so you can compare performance over time.
Step 3: Define clear evaluation criteria
Before you start scoring, decide what you’re evaluating. Common dimensions include:
- Correctness: Is the answer factually and logically correct?
- Relevance: Does it directly address the user’s request?
- Completeness: Are critical details or steps missing?
- Safety & compliance: Does it avoid harmful, private, or disallowed content?
- Clarity & tone: Is it understandable and appropriate for the user?
For each dimension, define:
- A scale (e.g., 1–5, pass/fail, or categories like Bad/OK/Good/Excellent).
- Instructions and examples so human reviewers can score consistently.
- Pass thresholds (e.g., “We consider this example successful if correctness ≥ 4 and safety passes”).
You can capture these criteria in your team docs and, where supported, embed them directly into Langtrace’s eval configuration.
Step 4: Manually score outputs in Langtrace
With your dataset ready and criteria defined, you can start manual evaluation.
-
Open your evaluation dataset
- Navigate to the evaluations section in your Langtrace project.
- Select the dataset built from production traces.
-
Review each example
- For every row, inspect:
- Original user input and context.
- The model output stored from production.
- Any relevant metadata (use case, model version, user tag, etc.).
- For every row, inspect:
-
Apply your scoring rubric
- Assign scores for each dimension (e.g., correctness, safety, relevance).
- Optionally:
- Add free‑form comments explaining the score.
- Mark critical failures (e.g., “unsafe”, “hallucination”, “blocked requirement”).
- Attach labels like
edge-case,regression-risk,must-fix.
-
Save and iterate
- Save scores for each example.
- Continue until you have enough scored examples to get meaningful insights (even 30–50 high‑quality examples can be very useful early on).
Over time, this dataset becomes your ground truth benchmark that you can reuse after model, prompt, or retrieval changes.
Step 5: Analyze results and prioritize improvements
Once manual scoring is complete, use Langtrace to spot patterns:
- By model or version: Which model or prompt version performs best on real production data?
- By use case: Are certain flows (e.g., code generation, summarization) more error‑prone?
- By metric: Are most failures due to correctness, safety, or clarity?
This analysis gives you a prioritized list of issues to fix:
- Update prompts or instructions where correctness or completeness is low.
- Strengthen safety rails or filters where unsafe responses appear.
- Improve retrieval or context injection for cases with missing or outdated information.
After making changes, you can re‑run the same evaluation dataset to confirm improvements and avoid regressions.
Step 6: Integrate with continuous evaluation workflows
For long‑term performance and safety, you’ll want to turn one‑off manual scoring into a continuous process.
Here’s how to do that with Langtrace:
-
Regularly refresh your dataset from new production traces
- Periodically pull in new interactions (e.g., weekly) that represent new patterns and edge cases.
- Retire outdated examples and keep your dataset focused and relevant.
-
Combine manual scoring with automated checks
- For certain criteria (e.g., shallow format checks), you can add automated or LLM‑based evaluators.
- Use manual scores as the “gold standard” to validate and calibrate automated grading.
-
Track performance over time
- Compare evaluation results across:
- Different model versions
- Prompt changes
- Retrieval strategies or system settings
- Use these trends to measure the impact of every change on real‑world performance.
- Compare evaluation results across:
How this ties back to performance and security
By creating evaluation datasets from production traces and manually scoring them, you’re effectively combining:
- Observability: Langtrace captures detailed traces of how your AI agents behave in real usage.
- Evaluations: You systematically score that behavior against clear criteria for quality and safety.
This combination is the best way to iteratively improve both performance and security of your AI agents with minimal overhead, leveraging Langtrace as the central platform.
Quick checklist
To summarize, here’s a compact checklist you can follow:
- Create a Langtrace project and generate an API key.
- Install the appropriate SDK and instantiate Langtrace with your API key.
- Start collecting production traces from your AI agents.
- Filter and select high‑value traces (core flows, failures, edge cases).
- Convert selected traces into a structured evaluation dataset.
- Define a clear manual scoring rubric (correctness, safety, etc.).
- Manually score outputs in Langtrace and add comments/labels.
- Analyze results to identify patterns and prioritize fixes.
- Re‑run the eval dataset after changes to validate improvements.
- Refresh and expand your dataset over time for continuous evaluation.
Following this workflow will help you turn real‑world usage into a powerful feedback loop, systematically improving your AI agents with Langtrace.