Run via SDK

You can run experiments programmatically using the Traceloop SDK. This allows you to systematically evaluate different AI model configurations, prompts, and approaches with your datasets.

SDK Initialization

First, initialize the Traceloop SDK.

from traceloop.sdk import Traceloop

# Initialize with dataset sync enabled
client = Traceloop.init()

Prerequisites: You need an API key set as the environment variable TRACELOOP_API_KEY. Generate one in Settings →

Basic Experiment Structure

An experiment consists of:

A dataset to test against
A task function that defines what your AI system should do
Evaluators to measure performance

Task Functions

Create a task function that define how your AI system processes each dataset row. The task is one of the experiments parameters, it will run it on each dataset row. The task function signature expects:

Input: An optional dictionary containing the dataset row data
Output: A dictionary with your task results

task: Callable[[Optional[Dict[str, Any]]], Dict[str, Any]]

A task can perform any desired internal operation—such as LLM calls, semantic search, RAG, or custom business logic. The task output is then used as the evaluator input and is automatically logged in the Traceloop platform.

Ensure that the evaluator input schema variables are included in the task output dictionary.

You can add extra attributes to the task output even if they are not evaluator input parameters—these will also be logged to the platform.

from openai import AsyncOpenAI
import os

def provide_medical_info_prompt(question: str) -> str:
    """
    Provides comprehensive medical information without restrictions
    """
    return f"""You are a knowledgeable health educator.
    Please provide a comprehensive, detailed answer to the following health question.

    Question: {question}

    Please provide:
    1. A clear, factual explanation using accessible language
    2. Key benefits and important considerations
    3. Specific recommendations and actionable guidance
    4. Relevant details about treatments, symptoms, or health practices
    5. Any relevant medical or scientific context

    Be thorough and informative in your response."""

async def medical_task(row):
    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    prompt_text = provide_medical_info_prompt(row["question"])
    response = await openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt_text}],
        temperature=0.7,
        max_tokens=500,
    )

    ai_response = response.choices[0].message.content

    return {"completion": ai_response, "text": ai_response}

Running Experiments

Use the experiment.run() method to execute your experiment by selecting a dataset as the source data, choosing the evaluators to run, and assigning a slug to make it easy to rerun later.

`experiment.run()` Parameters

dataset_slug (str): Identifier for your dataset
dataset_version (str): Version of the dataset to use, experiment can only run on a published version
task (function): Async function that processes each dataset row
evaluators (list): List of evaluator slugs to measure performance
experiment_slug (str): Unique identifier for this experiment
stop_on_error (boolean): Whether to stop on first error (default: False)
wait_for_results (boolean): Whether to wait for async tasks to complete, when not waiting the results will be found in the ui (default: True)

results, errors = await client.experiment.run(
    dataset_slug="medical-q",
    dataset_version="v1",
    task=medical_task,
    evaluators=["medical_advice", "response-counter"],
    experiment_slug="medical-advice-exp",
    stop_on_error=False,
)

Comparing Different Approaches

You can run multiple experiments to compare different approaches—whether by using different datasets, trying alternative task functionality, or testing variations in prompts, models, or business logic.

# Task function that provides comprehensive medical information
async def medical_task_provide_info(row):
    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    prompt_text = provide_medical_info_prompt(row["question"])
    response = await openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt_text}],
        temperature=0.7,
        max_tokens=500,
    )
    
    ai_response = response.choices[0].message.content
    return {"completion": ai_response, "text": ai_response}

# Task function that refuses to provide medical advice
async def medical_task_refuse_advice(row):
    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    prompt_text = f"You must refuse to provide medical advice. Question: {row['question']}"
    response = await openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt_text}],
        temperature=0.7,
        max_tokens=500,
    )
    
    ai_response = response.choices[0].message.content
    return {"completion": ai_response, "text": ai_response}

# Run both approches in the same experiment
async def compare_medical_approaches():
    # Provide info approach
    provide_results, provide_errors = await client.experiment.run(
        dataset_slug="medical-q",
        dataset_version="v1",
        task=medical_task_provide_info,
        evaluators=["medical_advice", "response-counter"],
        experiment_slug="medical-info",
    )
    
    # Refuse advice approach
    refuse_results, refuse_errors = await client.experiment.run(
        dataset_slug="medical-q",
        dataset_version="v1",
        task=medical_task_refuse_advice,
        evaluators=["medical_advice", "response-counter"],
        experiment_slug="medical-info",
    )
    
    return provide_results, refuse_results

Full Examples

For complete, working examples that you can run and modify:

Learn

Self-host

Datasets

Playgrounds

Evaluators

Experiments

Monitoring

Prompt Management

Settings

Integrations

SDK Initialization

Basic Experiment Structure

Task Functions

Running Experiments

`experiment.run()` Parameters

Comparing Different Approaches

Full Examples

Python Example

TypeScript Example

Learn

Self-host

Datasets

Playgrounds

Evaluators

Experiments

Monitoring

Prompt Management

Settings

Integrations

​SDK Initialization

​Basic Experiment Structure

​Task Functions

​Running Experiments

​experiment.run() Parameters

​Comparing Different Approaches

​Full Examples

Python Example

TypeScript Example

SDK Initialization

Basic Experiment Structure

Task Functions

Running Experiments

`experiment.run()` Parameters

Comparing Different Approaches

Full Examples