You can run experiments programmatically using the Traceloop SDK. This allows you to systematically evaluate different AI model configurations, prompts, and approaches with your datasets.

SDK Initialization

First, initialize the Traceloop SDK.
from traceloop.sdk import Traceloop

# Initialize with dataset sync enabled
client = Traceloop.init()
Make sure you’ve created an API key and set it as an environment variable TRACELOOP_API_KEY before you start. Check out the SDK’s getting started guide for more information.

Basic Experiment Structure

An experiment consists of:
  • A dataset to test against
  • A task function that defines what your AI system should do
  • Evaluators to measure performance

Task Functions

Create a task function that define how your AI system processes each dataset row. The task is one of the experiments parameters, it will run it on each dataset row. The task function signature expects:
  • Input: An optional dictionary containing the dataset row data
  • Output: A dictionary with your task results
task: Callable[[Optional[Dict[str, Any]]], Dict[str, Any]]
A task can perform any desired internal operation—such as LLM calls, semantic search, RAG, or custom business logic. The task output is then used as the evaluator input and is automatically logged in the Traceloop platform.
Ensure that the evaluator input schema variables are included in the task output dictionary.
You can add extra attributes to the task output even if they are not evaluator input parameters—these will also be logged to the platform.
from openai import AsyncOpenAI
import os

def provide_medical_info_prompt(question: str) -> str:
    """
    Provides comprehensive medical information without restrictions
    """
    return f"""You are a knowledgeable health educator.
    Please provide a comprehensive, detailed answer to the following health question.

    Question: {question}

    Please provide:
    1. A clear, factual explanation using accessible language
    2. Key benefits and important considerations
    3. Specific recommendations and actionable guidance
    4. Relevant details about treatments, symptoms, or health practices
    5. Any relevant medical or scientific context

    Be thorough and informative in your response."""

async def medical_task(row):
    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    prompt_text = provide_medical_info_prompt(row["question"])
    response = await openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt_text}],
        temperature=0.7,
        max_tokens=500,
    )

    ai_response = response.choices[0].message.content

    return {"completion": ai_response, "text": ai_response}

Running Experiments

Use the experiment.run() method to execute your experiment by selecting a dataset as the source data, choosing the evaluators to run, and assigning a slug to make it easy to rerun later.

experiment.run() Parameters

  • dataset_slug (str): Identifier for your dataset
  • dataset_version (str): Version of the dataset to use, experiment can only run on a published version
  • task (function): Async function that processes each dataset row
  • evaluators (list): List of evaluator slugs to measure performance
  • experiment_slug (str): Unique identifier for this experiment
  • stop_on_error (boolean): Whether to stop on first error (default: False)
  • wait_for_results (boolean): Whether to wait for async tasks to complete, when not waiting the results will be found in the ui (default: True)
results, errors = await client.experiment.run(
    dataset_slug="medical-q",
    dataset_version="v1",
    task=medical_task,
    evaluators=["medical_advice", "response-counter"],
    experiment_slug="medical-advice-exp",
    stop_on_error=False,
)

Comparing Different Approaches

You can run multiple experiments to compare different approaches—whether by using different datasets, trying alternative task functionality, or testing variations in prompts, models, or business logic.
# Task function that provides comprehensive medical information
async def medical_task_provide_info(row):
    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    prompt_text = provide_medical_info_prompt(row["question"])
    response = await openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt_text}],
        temperature=0.7,
        max_tokens=500,
    )
    
    ai_response = response.choices[0].message.content
    return {"completion": ai_response, "text": ai_response}

# Task function that refuses to provide medical advice
async def medical_task_refuse_advice(row):
    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    prompt_text = f"You must refuse to provide medical advice. Question: {row['question']}"
    response = await openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt_text}],
        temperature=0.7,
        max_tokens=500,
    )
    
    ai_response = response.choices[0].message.content
    return {"completion": ai_response, "text": ai_response}

# Run both approches in the same experiment
async def compare_medical_approaches():
    # Provide info approach
    provide_results, provide_errors = await client.experiment.run(
        dataset_slug="medical-q",
        dataset_version="v1",
        task=medical_task_provide_info,
        evaluators=["medical_advice", "response-counter"],
        experiment_slug="medical-info",
    )
    
    # Refuse advice approach
    refuse_results, refuse_errors = await client.experiment.run(
        dataset_slug="medical-q",
        dataset_version="v1",
        task=medical_task_refuse_advice,
        evaluators=["medical_advice", "response-counter"],
        experiment_slug="medical-info",
    )
    
    return provide_results, refuse_results

Full Examples

For complete, working examples that you can run and modify: