
How do I create an evaluation dataset in Langtrace from production traces and then manually score outputs?
Creating an evaluation dataset directly from production traces in Langtrace is a powerful way to measure real-world performance and iterate towards safer, higher‑quality AI agents. By turning actual user interactions into reusable test cases, you can run evaluations, compare model versions, and manually score outputs with minimal setup.
Below is a step‑by‑step guide to go from raw traces to a scored evaluation dataset.
Prerequisites: Set up Langtrace and connect your app
Before you can create evaluation datasets from production traces, you need to:
-
Create a project and API key
- Sign in to Langtrace.
- Create a new project.
- Generate an API key for that project.
-
Install the appropriate SDK
- Choose the SDK for your framework (for example: CrewAI, DSPy, LlamaIndex, LangChain, or your preferred stack).
- Install it in your application.
- Instantiate Langtrace with your project’s API key.
-
Enable observability for your agents
- Wrap or configure your LLM calls, tools, and agents so they send traces to Langtrace.
- Deploy your updated application so production traffic starts generating traces in Langtrace.
Once this is in place, you’ll see live traces flowing into your Langtrace project—these are the raw material for your evaluation datasets.
Step 1: Identify the production traces you want to evaluate
Start by finding the subset of production interactions that you want to turn into an evaluation dataset. Common strategies include:
-
High‑value or high‑risk flows
For example, billing questions, policy decisions, or code generation where correctness and safety are critical. -
Frequently occurring use cases
Typical customer questions that represent the bulk of your traffic. -
Problematic or edge‑case traces
Interactions where users complained, where your internal QA flagged issues, or where your agent behaved unexpectedly.
How to filter and select traces
Within Langtrace’s observability view (names may differ slightly depending on UI updates), use filters such as:
- Time range (e.g., “last 7 days”)
- User segment or tenant
- Model name or agent name
- Status or error signals
- Custom metadata you’ve logged (e.g., use_case, language, plan_tier)
Open a few individual traces and confirm that:
- The input (user prompt) is captured.
- The output (LLM or agent response) is visible.
- Any relevant context (tools invoked, retrieved documents, metadata) is logged.
These traces will be the basis for your evaluation dataset.
Step 2: Convert selected traces into an evaluation dataset
The goal is to turn production traces into structured test cases: each row (or item) in your dataset should represent one evaluation example derived from a trace.
While the specific UI steps may vary, the typical workflow in Langtrace is:
-
Select traces
- In the traces list, use checkboxes or bulk selection for the traces you want to include.
- You can select a mix of normal, failure, and edge‑case traces to build a well‑rounded dataset.
-
Create an evaluation dataset from selection
- Use an action such as “Add to dataset” or “Create evaluation dataset from traces.”
- Choose whether to:
- Add to an existing dataset, or
- Create a new dataset (e.g.,
billing_support_production_eval).
-
Map trace fields to dataset fields
For each evaluation item, Langtrace will typically capture:- Input (e.g., user prompt, conversation so far)
- Model output (the response you want to evaluate)
- Optional: Context fields (such as retrieved documents, tool outputs, or user metadata)
Ensure:
- The user’s question/intent is clearly captured.
- The response is the exact text you want to grade.
- If you plan to do advanced evaluations later, preserve context fields in the dataset.
-
Save the dataset
- Name it clearly: include source, timeframe, and purpose, such as:
prod_traces_aug_2026_support_evalprod_critical_flows_safety_eval
- Add a description so future collaborators know it was derived from production traces and what it’s meant to evaluate (accuracy, safety, tone, etc.).
- Name it clearly: include source, timeframe, and purpose, such as:
You now have an evaluation dataset in Langtrace constructed entirely from real production interactions.
Step 3: Decide on your scoring rubric and labels
Before you manually score outputs, define what “good” looks like so scores are consistent.
Typical scoring dimensions include:
- Correctness / Factual accuracy
- Helpfulness / Completeness
- Safety / Policy compliance
- Tone / Professionalism
- Format / Structure (if your responses must follow templates)
For each dimension, choose:
- A scale: e.g., 1–5, 1–10, or a small set of labels (Pass / Fail, Good / Needs Improvement).
- Clear guidelines for each score on that scale.
Example rubric for correctness (1–5):
- 1 – Completely wrong or misleading.
- 2 – Mostly incorrect, with major gaps.
- 3 – Partially correct but missing important details.
- 4 – Mostly correct with only minor issues.
- 5 – Fully correct and precise.
Document this rubric (in your internal docs or directly in Langtrace notes if supported) so everyone scoring uses the same criteria.
Step 4: Manually score outputs in Langtrace
With your dataset and rubric ready, you can now manually evaluate each item.
4.1 Open the evaluation dataset
- Navigate to the Evaluations or Datasets section in your Langtrace project.
- Select the dataset you just created from production traces.
You should see each example with at least:
- The input (user’s message or context).
- The captured output (model/agent response).
- Any extra context you included from the trace.
4.2 Review each example
For each row/example in the dataset:
-
Read the input and output carefully
- Check whether the response actually addresses the user’s question.
- Consider any context that was part of the original production trace.
-
Evaluate against your rubric
- Decide how the response scores on each chosen dimension.
- Pay special attention to:
- Factual errors.
- Policy or safety issues.
- Missing key information.
- Over‑confident but incorrect statements.
-
Assign manual scores in Langtrace
Depending on the product’s current capabilities, you will typically:
- Use a numeric rating field (e.g., a drop‑down or slider) to set a score.
- Apply labels or tags (e.g.,
safe,unsafe,hallucination,off_topic). - Add free‑text notes or comments explaining why you gave a specific score.
If the UI allows multiple metrics, you might see fields like:
correctness_scorehelpfulness_scoresafety_scoreoverall_scorenotes
Fill these in consistently according to your rubric.
- Repeat for a representative sample
You don’t need to score every single trace from production. Instead:
- Score enough examples to:
- See patterns.
- Benchmark your current performance.
- Compare model versions later.
- Prioritize:
- High‑risk or critical flows.
- Common, high‑volume use cases.
- Known problem areas.
Step 5: Use your scored dataset for evaluations and iteration
Once you’ve manually scored outputs, your dataset becomes a powerful evaluation asset.
5.1 Benchmark current performance
By aggregating scores across the dataset, you can answer:
- What is the average correctness of our responses?
- How often do we have safety violations or policy issues?
- Are certain use cases consistently underperforming?
Use Langtrace’s dashboards or exports to visualize:
- Score distributions.
- Outliers.
- Trends by use case, model, or time period.
5.2 Compare models, prompts, or agent configurations
You can reuse the same evaluation dataset to test new:
- LLM providers or model versions.
- Prompt templates or system messages.
- Agent architectures (e.g., switching to CrewAI, DSPy, LlamaIndex, LangChain workflows).
Typical workflow:
- Run your new configuration against the same evaluation dataset.
- Collect outputs via Langtrace.
- Manually (or semi‑automatically) score the new outputs.
- Compare results:
- Did correctness improve?
- Did safety scores go up?
- Did we reduce hallucinations?
Because the dataset comes from production traces, improvements here are more likely to translate into real‑world impact.
5.3 Use scores to drive targeted improvements
Your manually scored dataset shows you exactly where your system is failing and why.
Common follow‑ups include:
- Updating prompts to fix recurring misunderstandings.
- Adding guardrails or policies where safety scores are low.
- Improving retrieval or context construction for cases with missing information.
- Creating more specific sub‑agents for difficult categories.
You can then re‑run evaluations on the same dataset to see whether those changes improved the scores, closing the loop between observability and evaluation.
Best practices when building evaluation datasets from production traces
To get the most out of Langtrace for evaluation and safety:
-
Sample from different traffic segments
Include examples from free users, enterprise customers, different geos, and different time windows to avoid bias. -
Keep datasets versioned and labeled
For example:prod_eval_v1– baseline dataset.prod_eval_v2_critical_flows– focused on high‑risk traces.prod_eval_v3_multilingual– focused on language diversity.
-
Mix normal and problematic traces
Don’t only include failures; a mix of “easy” and “hard” cases gives you a realistic performance picture. -
Involve multiple reviewers
For critical systems, have multiple people score a subset of the dataset and compare their scores. This helps refine the rubric and reduce subjectivity. -
Refresh datasets regularly
As your product and users evolve, periodically create new evaluation datasets from more recent production traces to keep your evaluations aligned with reality.
How Langtrace supports this workflow
Tying it back to the core strengths of Langtrace:
-
Observability
You see detailed traces of agent chains, LLM calls, tools, and vector DB interactions from production. -
Evaluations
You turn those traces into evaluation datasets, manually score outputs, and measure performance and safety over time.
Together, these capabilities make it much easier to iterate towards better performance and security for your AI agents using data drawn directly from real user interactions.
If you need more specific guidance (e.g., how this looks in CrewAI, DSPy, LlamaIndex, or LangChain), you can follow the framework‑specific SDK instructions after creating your project and API key, then repeat the same trace‑to‑dataset and manual scoring process described above.