Introduction

The evaluation library is a core feature of Traceloop, providing comprehensive tools to assess LLM outputs, data quality, and performance across various dimensions. Whether you need automated scoring or human judgment, the evaluation system has you covered.

Why Do We Need Evaluators?

LLM agents are more complex than single-turn completions. They operate across multiple steps, use tools, and depend on context and external systems like memory or APIs. This complexity introduces new failure modes: agents may hallucinate tools, get stuck in loops, or produce final answers that hide earlier mistakes. Evaluators make these issues visible by checking correctness, relevance, task completion, tool usage, memory retention, safety, and style. They ensure outputs remain consistent even when dependencies shift and provide a structured way to measure reliability. Evaluation is continuous, extending into production through automated tests, drift detection, quality gates, and online monitoring. In short, evaluators turn outputs into trustworthy systems by providing measurable and repeatable checks that give teams confidence to deploy at scale.

Evaluators types

The system supports:

Custom evaluators - Create your own evaluation logic tailored to specific needs
Built-in evaluators - pre-configured evaluators by Traceloop for common assessment tasks

In the Evaluator Library, select the evaluator you want to define. You can either create a custom evaluator by clicking New Evaluator or choose one of the prebuilt Made by Traceloop evaluators.

Clicking on existing evaluators will present their input and output schema. This is valuable information in order to execute the evaluator through the SDK.

Where to Use Evaluators

Evaluators can be used in multiple contexts within Traceloop:

Guardrails - Apply evaluators in real-time as inline safety checks and quality gates that run synchronously with your application code to prevent issues before they reach users
Playgrounds - Test and iterate on your evaluators interactively, compare different configurations, and validate evaluation logic before deployment
Experiments - Run systematic evaluations across datasets programmatically using the SDK, track performance metrics over time, and easily compare experiment results
Monitors - Continuously evaluate your LLM applications in production with real-time monitoring and alerting on quality degradation

Learn

Self-host

Datasets

Playgrounds

Evaluators

Experiments

Monitoring

Prompt Management

Settings

Integrations

Why Do We Need Evaluators?

Evaluators types

Where to Use Evaluators

Learn

Self-host

Datasets

Playgrounds

Evaluators

Experiments

Monitoring

Prompt Management

Settings

Integrations

​Why Do We Need Evaluators?

​Evaluators types

​Where to Use Evaluators

Why Do We Need Evaluators?

Evaluators types

Where to Use Evaluators