Made by Traceloop - traceloop

The Evaluator Library provides a comprehensive collection of pre-built quality checks designed to systematically assess AI outputs. Each evaluator comes with a predefined input and output schema. When using an evaluator, you’ll need to map your data to its input schema.

Evaluator Types

Style

Character Count

Analyze response length and verbosity to ensure outputs meet specific length requirements.

Character Count Ratio

Measure the ratio of characters to the input to assess response proportionality and expansion.

Word Count

Ensure appropriate response detail level by tracking the total number of words in outputs.

Word Count Ratio

Measure the ratio of words to the input to compare input/output verbosity and expansion patterns.

Quality & Correctness

Answer Relevancy

Verify responses address the query to ensure AI outputs stay on topic and remain relevant.

Faithfulness

Detect hallucinations and verify facts to maintain accuracy and truthfulness in AI responses.

Answer Correctness

Evaluate factual accuracy by comparing answers against ground truth.

Answer Completeness

Measure how completely responses use relevant context to ensure all relevant information is addressed.

Topic Adherence

Validate topic adherence to ensure responses stay focused on the specified subject matter.

Semantic Similarity

Validate semantic similarity between expected and actual responses to measure content alignment.

Instruction Adherence

Measure how well the LLM response follows given instructions to ensure compliance with specified requirements.

Measure Perplexity

Measure text perplexity from logprobs to assess the predictability and coherence of generated text.

Uncertainty Detector

Generate responses and measure model uncertainty from logprobs to identify when the model is less confident in its outputs.

Conversation Quality

Evaluate conversation quality based on tone, clarity, flow, responsiveness, and transparency.

Context Relevance

Validate context relevance to ensure retrieved context is pertinent to the query.

Security & Compliance

PII Detection

Identify personal information exposure to protect user privacy and ensure data security compliance.

Profanity Detection

Flag inappropriate language use to maintain content quality standards and professional communication.

Sexism Detection

Detect sexist and discriminatory content.

Prompt Injection

Detect prompt injection attacks in user inputs.

Toxicity Detector

Detect toxic content including personal attacks, mockery, hate, and threats.

Secrets Detection

Monitor for credential and key leaks to prevent accidental exposure of sensitive information.

Formatting

SQL Validation

Validate SQL queries to ensure proper syntax and structure in database-related AI outputs.

JSON Validation

Validate JSON responses to ensure proper formatting and structure in API-related outputs.

Regex Validation

Validate regex patterns to ensure correct regular expression syntax and functionality.

Placeholder Regex

Validate placeholder regex patterns to ensure proper template and variable replacement structures.

Agents

Agent Goal Accuracy

Validate agent goal accuracy to ensure AI systems achieve their intended objectives effectively.

Agent Tool Error Detector

Detect errors or failures during tool execution to monitor agent tool performance.

Agent Flow Quality

Validate agent trajectories against user-defined natural language tests to assess agent decision-making paths.

Agent Efficiency

Evaluate agent efficiency by checking for redundant calls and optimal paths to optimize agent performance.

Agent Goal Completeness

Measure whether the agent successfully accomplished all user goals to verify comprehensive goal achievement.

Intent Change

Detect whether the user’s primary intent or workflow changed significantly during a conversation.

Learn

Self-host

Datasets

Playgrounds

Evaluators

Experiments

Monitoring

Prompt Management

Settings

Integrations

​Evaluator Types

​Style