Generating Synthetic Datasets for LLM Evaluators & Agents

Learn different strategies for dataset generation and show how they can be used to run experiments and test evaluators

Synthetic datasets are a powerful way to test and refine your LLM applications, especially when real-world data is limited, sensitive, or hard to collect. By guiding the model to generate structured examples, you can quickly create datasets that cover common scenarios, complex multi-step cases, and edge cases like typos or out-of-scope queries.

In this tutorial, you will learn different strategies for dataset generation and show how they can be used to run experiments and test evaluators. You will:

  • Generate synthetic benchmark datasets to test evaluator accuracy and coverage

  • Use few-shot examples to guide LLM generation for more consistent outputs

  • Create agent-specific datasets that cover happy paths, edge cases, and adversarial scenarios

  • Upload datasets to Phoenix and run experiments to validate your evaluators

⚠️ This tutorial requires an OpenAI API key and a Phoenix Cloud account.


Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or video above.

Strategy 1: Creating Synthetic Benchmark Datasets

Goal: Create a synthetic dataset that allows you to test the accuracy and coverage of your evaluator.

Use Case: Feed the generated dataset into an LLM-as-a-Judge or other evaluator to ensure it correctly labels intent, identifies errors, and handles a variety of query types including edge cases and noisy inputs.

Synthetic data is especially useful when you want to stress-test evaluators such as an LLM-as-a-Judge across a wide range of scenarios. By generating examples systematically, you can cover straightforward cases, tricky edge cases, ambiguous queries, and noisy inputs, ensuring your evaluator captures different angles of behavior.

Generate Customer Support Queries

generate_queries_template = """
Generate 25 synthetic customer support classification examples.
Ensure good coverage across intents (refund, order_status, product_info),
and include both correct and incorrect classifications.
Each entry should follow this JSON schema:

{
  "input": "string (the user query)",
  "output": "refund | order_status | product_info (the predicted intent)",
  "classification": "correct | incorrect"
}
Respond ONLY with valid JSON array, no code fences, no extra text.
"""

resp = openai_client.chat.completions.create(
    model="gpt-4o-mini", 
    messages=[{"role": "user", "content": generate_queries_template}]
)

support_data = json.loads(resp.choices[0].message.content)
df_support_data = pd.DataFrame(support_data)
df_support_data.head()

Upload Dataset to Phoenix

import phoenix as px

client = px.Client()

df = client.upload_dataset(
    dataframe=df_support_data,
    dataset_name="customer_support_queries",
    input_keys=["input"],
    output_keys=["output", "classification"],
)

Test LLM Judge Effectiveness

Now let's test how well an LLM-as-a-Judge performs on our synthetic dataset:

from phoenix.evals import OpenAIModel, llm_classify

llm_judge_template = """
You are an evaluator judging whether a model's classification of a customer support query is correct.
The possible classifications are: refund, order_status, product_info

Query: {input}
Model Prediction: {output}

Decide if the model's prediction is correct or incorrect.
Respond ONLY with one of: "correct" or "incorrect".
"""

def task_function(input, reference):
    response_classification = llm_classify(
        data=pd.DataFrame([{"input": input["input"], "output": reference["output"]}]),
        template=llm_judge_template,
        model=OpenAIModel(model="gpt-4.1"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    label = response_classification.iloc[0]["label"]
    return label

def evaluate_response(output, reference):
    expected_label = reference["classification"]
    predicted_label = output
    return 1 if expected_label == predicted_label else 0

from phoenix.experiments import run_experiment

initial_experiment = run_experiment(
    df, 
    task=task_function, 
    evaluators=[evaluate_response], 
    experiment_name="evaluator performance"
)

Strategy 2: Using Few-Shot Examples for Dataset Generation

Goal: Guide the LLM to generate synthetic examples that reflect different types of queries and scenarios while maintaining consistent labeling and structure.

Few-shot prompting allows you to guide an LLM by showing a handful of examples, which helps produce more consistent and realistic outputs. This approach is particularly useful for testing evaluators because it ensures the synthetic dataset reflects patterns, labels, and structures the evaluator is expected to handle.

Generate Examples with Few-Shot Prompting

few_shot_prompt = """
Generate synthetic customer support classification examples.
Ensure good coverage across intents (refund, order_status, product_info),
and include both correct and incorrect classifications.
Here are some examples of synthetic customer queries and labels:

Example 1:
{
  "user_query": "Ughhh I bought sneakers that squeak louder than a rubber duck... how do I return these?",
  "intent": "refund",
  "response": "Oh no, squeaky shoes aren’t fun! Let’s get that return started. Could you share your order number?",
  "classification": "correct"
}

Example 2:
{
  "user_query": "My package has been saying 'out for delivery' since last Tuesday… did it decide to take a vacation? Is it actually going to show up?",
  "intent": "refund",
  "response": "Looks like your package is taking its sweet time. Let me check where it’s stuck — can you give me the tracking number?",
  "classification: "incorrect"
}


Example 3:
{
  "user_query": "Thinking about upgrading my blender… does your new model actually crush ice?",
  "intent": "product_info",
  "response": "Haha our blender keeps its promises! It can definitely crush ice. Would you like more details on the specs?",
  "classification": "correct"
}

Now generate 25 new examples in the same format, keeping the reesponses friendly.
Respond ONLY with valid JSON array, no code fences, no extra text.
"""

resp = openai_client.chat.completions.create(
    model="gpt-4o-mini", 
    messages=[{"role": "user", "content": few_shot_prompt}]
)

few_shot_data = json.loads(resp.choices[0].message.content)
few_shot_df = pd.DataFrame(few_shot_data)
few_shot_df.head()

Upload Few-Shot Dataset

df = client.upload_dataset(
    dataframe=few_shot_df,
    dataset_name="customer_support_queries_few_shot",
    input_keys=["user_query"],
    output_keys=["intent", "response", "classification"],
)

Test LLM Judge Effectiveness

llm_judge_template = """
You are an evaluator judging whether a model's classification of a customer support query is correct.
The possible classifications are: refund, order_status, product_info

Query: {query}
Model Prediction: {intent}

Decide if the model's prediction is correct or incorrect.
Respond ONLY with one of: "correct" or "incorrect".
"""

from phoenix.evals import llm_classify, OpenAIModel

def task_function(input, reference):
    response_classification = llm_classify(
        data=pd.DataFrame([{"query": input["user_query"], "intent": reference["intent"]}]),
        template=llm_judge_template,
        model=OpenAIModel(model="gpt-4.1"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    label = response_classification.iloc[0]["label"]
    return label


def evaluate_response(output, reference):
    expected_label = reference["classification"]
    predicted_label = output
    return 1 if expected_label == predicted_label else 0
from phoenix.experiments import run_experiment


initial_experiment = run_experiment(
    df, task=task_function, evaluators=[evaluate_response], experiment_name="evaluator performance"
)

Strategy 3: Creating Synthetic Datasets for Agents

Goal: Build synthetic test data that captures a wide range of queries to evaluate an agent's reliability and safety.

Use Case: Test how an agent handles in-scope requests, refuses out-of-scope queries, and manages edge cases, adversarial inputs, and noisy data.

When creating synthetic datasets for agents, first define the agent's capabilities and boundaries (tools, in-scope vs. out-of-scope). Then organize queries into categories to ensure balanced coverage:

  1. Happy-path: simple, common requests

  2. Complex: multi-step or reasoning-heavy

  3. Adversarial / refusal: out-of-scope or unsafe

  4. Edge cases: ambiguous or incomplete inputs

  5. Noise: typos, slang, multilingual

Generate Agent Test Dataset

AGENT_DATASET_PROMPT = """
You are helping me create a synthetic test dataset for evaluating an AI agent.
The agent has the following capabilities:
- search products, compare items, track orders, answer shipping questions

The dataset should cover a wide variety of use cases, not just the "happy path."
Generate realistic **user queries**, grouped into categories:

1. **Happy-path**: straightforward, common use cases where the agent should succeed.
2. **Complex / multi-step**: queries requiring reasoning, multiple steps, or tool calls.
3. **Edge cases**: ambiguous requests, incomplete info, or queries with constraints.
4. **Adversarial / refusal**: queries that are out-of-scope or unsafe (where the agent should refuse or fallback).
5. **Noise / robustness**: queries with typos, slang, or in multiple languages.

For each example, return JSON with this schema:
{
  "category": "happy_path | multi_step | edge_case | adversarial | noise",
  "query": "string (the user's input)",
  "expected_action": "string (the tool, behavior, or refusal the agent should take)",
  "expected_outcome": "string (what a correct response would look like at a high level)"
}

Generate **10 examples total**, ensuring at least a few from each category.
The queries should be diverse, realistic, and not repetitive.

Respond ONLY with valid JSON, no code fences, no extra text.
"""

resp = openai_client.chat.completions.create(
    model="gpt-4o-mini", 
    messages=[{"role": "user", "content": AGENT_DATASET_PROMPT}]
)

agent_data = json.loads(resp.choices[0].message.content)
agent_data_df = pd.DataFrame(agent_data)
agent_data_df.head()

Upload Agent Dataset

df = client.upload_dataset(
    dataframe=agent_data_df,
    dataset_name="customer_support_agent",
    input_keys=["category", "query"],
    output_keys=["expected_action", "expected_outcome"],
)

Best Practices for Synthetic Dataset Generation

  • Set Clear Goals – Define scenarios, edge cases, and failure modes to test.

  • Structure Prompts – Use JSON schemas, validation rules, and explicit output formats.

  • Ensure Coverage – Mix positive/negative cases, edge conditions, and diverse inputs.

  • Validate Data – Check schema compliance, logical consistency, and realism.

  • Refine Iteratively – Test, find gaps, and improve prompts and datasets.

Last updated

Was this helpful?