Improving Structured Output Generation with Prompt Learning

In this cookbook we use Prompt Learning to improve the accuracy of GPT-4o-mini on structured output generation.

We will be using the Prompt Learning SDK. You can read our docs page for the SDK at that link.

What is Prompt Learning?

Prompt Learning is an algorithm developed by Arize to optimize prompts based on data.

See our detailed blog on Prompt Learning, and/or a quick summary of the algorithm below.

The pipeline works as follows:

Build dataset of inputs/queries
Generate outputs with your unoptimized, base prompt
Build LLM evals or human annotations to return natural language feedback
- e.g. explanations -> why this output was correct/incorrect (most powerful)
- e.g. confusion reason -> why the model may have been confused
- e.g. improvement suggestions -> where the prompt should be improved based on this input/output pair
Use meta-prompting to optimize the original prompt
- feed prompt + inputs + outputs + evals + annotations to another LLM
- ask it to generate an optimized prompt!
Run and evaluate new, optimized prompt with another experiment

Prompt Learning for Structured Output Generation

In this cookbook we use Prompt Learning to improve accuracy of GPT-4o-mini on JSON webpage generation.

To view and run the notebook, first clone the Prompt Learning repository.

git clone https://github.com/Arize-ai/prompt-learning.git

Navigate to notebooks -> phoenix_support_query_classification.ipynb.

You can see the notebook here. But keep in mind you will have to clone the repository and run the notebook within the notebooks folder for the notebook to run!

Ruleset

We measure accuracy of a prompt based on whether the JSON outputs generated from using the prompt follow a pre-defined set of JSON rules. We have 3 benchmarks - 10 rules, 50 rules, and 100 rules. The more rules the outputs have to follow, the harder we are evaluating the prompt, because we are setting more constraints for its outputs to be correct.

An output is only considered "correct" if it meets all the rules within the chosen ruleset.

See prompts -> JSON_webpage_generation -> evaluator-prompt-{num_rules}.txt to see the differing rulesets.

Dataset Design

The dataset used in this notebook consists of queries that ask a model to generate JSON webpages. Each row includes an input query describing the page that should be built. For eaxample:

Create a webpage with a navigation bar containing 'Home', 'About', 'Services', and 'Contact' links. Below it, display an introduction section with a brief welcome message and a background image of a city skyline. The bottom of the page should have a footer with social media icons for Facebook, Twitter, and LinkedIn.

To make experimentation faster and reproducible, the notebook allows configuration of:

NUM_SAMPLES: how many rows to sample from the full dataset
TRAIN_SPLIT_FRACTION: proportion of rows used for training vs. testing
NUM_RULES: number of evaluator rules loaded (e.g., 10, 50)
NUM_OPTIMIZATION_LOOPS: how many iterations of optimization to run

This design ensures that experiments are controlled, scalable, and easily extendable to larger rule sets or datasets.

Train/Test Split

To avoid overfitting, we will split our dataset into Train and Test sets. The optimizer will run on the train set, optimizing the prompt based on the input/output/eval pairs from the training set. We will then test each generated prompt on the test set to get an extrapolated estimate of how much our prompt has improved.

import pandas as pd

# download dataset
dataset_1000 = pd.read_csv("https://storage.googleapis.com/arize-assets/dev-rel/prompt-learning/queries.csv")
dataset_sample = dataset_1000.sample(NUM_SAMPLES) # 100 rows

# 80-20 split
train_set = dataset_sample.sample(frac=TRAIN_SPLIT_FRACTION, random_state=42)
test_set = dataset_sample.drop(train_set.index)

train_set.to_csv("train.csv", index=False)
test_set.to_csv("test.csv", index=False)

Base Prompt

We begin with a minimal, unoptimized, baseline system prompt:

You are an expert in JSON webpage creation. This is your task: {input}

This initial prompt provides only the most general instruction. Through optimization, the SDK will refine this into more detailed prompts that guide the model toward consistently producing rule-compliant JSON.

Evaluators

Evaluators are the core of the feedback loop. They assess the quality of model outputs and provide structured feedback to the optimizer, which drives optimization.

Two evaluators are used here:

evaluate_output — checks if the generated JSON is correct and provides an explanation.
rule_checker — identifies which specific rules were violated.

Code

def evaluate_output(dataset):
    with open(f"../prompts/evaluator-prompt-{NUM_RULES}.txt", "r") as file:
        evaluation_template = file.read()

    eval_model = OpenAIModel(
        model="gpt-4.1-2025-04-14",
        model_kwargs={
            "response_format": {"type": "json_object"},
            "temperature": 0
        }
    )

    evaluation_results = llm_generate(
        dataframe=dataset,
        template=evaluation_template,
        model=eval_model,
        output_parser=evaluate_output_parser,
        concurrency=40,
        verbose=True
    )

    dataset = dataset.copy()
    for col in ["correctness", "explanation"]:
        if col in evaluation_results.columns:
            dataset[col] = evaluation_results[col]
    return dataset, ["correctness", "explanation"]

def rule_checker(dataset):
    with open(f"../prompts/rule-checker-prompt-{NUM_RULES}.txt", "r") as file:
        rule_check_template = file.read()

    eval_model = OpenAIModel(
        model_name="gpt-4.1-2025-04-14",
        model_kwargs={
            "response_format": {"type": "json_object"},
            "temperature": 0
        }
    )

    rule_check_results = llm_generate(
        dataframe=dataset,
        template=rule_check_template,
        model=eval_model,
        output_parser=rule_checker_parser,
        concurrency=40,
        verbose=True
    )

    dataset = dataset.copy()
    if "rule_violations" in rule_check_results.columns:
        dataset["rule_violations"] = rule_check_results["rule_violations"]
    return dataset, ["rule_violations"]

Explanation

Both evaluators use LLM-as-judge: a GPT-4.1 model evaluates outputs instead of humans.
evaluate_output assigns a binary correct/incorrect label and a natural-language explanation.
rule_checker checks compliance against the full rule set and outputs detailed violations.
This dual feedback gives both coarse signals (correctness) and fine-grained guidance (specific rules broken).

Optimization Loop

The optimization loop ties everything together. It repeatedly generates outputs, evaluates them, and updates the system prompt using evaluator feedback.

NOTE: The code in the notebook implements a few more things, like storing prompts and accuracies. Only the most minimal implementation is shown below for simplicity.

Code

def optimize_loop(
    train_set,
    test_set,
    system_prompt,
    evaluators,
    threshold=0.7,
    loops=5,
    scorer="accuracy",
):
    print(f"🚀 Starting prompt optimization with {loops} iterations (scorer: {scorer}, threshold: {threshold})")

    # Initial evaluation
    test_set["output"] = generate_output(test_set, system_prompt)
    test_evals_all = evaluate_output(test_set)[0]
    initial_metric_value = compute_metric(
        ["correct"] * len(test_evals_all),
        test_evals_all["correctness"],
        scorer=scorer
    )
    print(f"✅ Initial test {scorer}: {initial_metric_value}\n")

    # Iterative optimization
    while loops > 0:
        print(f"📊 Loop: Optimizing prompt...")

        # 1. Train set evaluation
        optimizer = PromptLearningOptimizer(
            prompt=system_prompt,
            model_choice="gpt-4o",
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )
        train_set, _ = optimizer.run_evaluators(
            train_set,
            evaluators,
            feedback_columns=["correctness", "explanation", "rule_violations"]
        )

        # 2. Prompt optimization
        system_prompt = optimizer.optimize(
            train_set,
            "output",
            feedback_columns=["correctness", "explanation", "rule_violations"],
            context_size_k=128000
        )

        # 3. Test set evaluation
        test_set["output"] = generate_output(test_set, system_prompt)
        test_evals_all = evaluate_output(test_set)[0]
        metric_value = compute_metric(
            ["correct"] * len(test_evals_all),
            test_evals_all["correctness"],
            scorer=scorer
        )
        print(f"✅ Test {scorer}: {metric_value}\n")

        if metric_value >= threshold:
            print("🎉 Threshold reached! Stopping optimization.")
            break

        loops -= 1

Explanation

Initial Evaluation: Test the base prompt to establish a starting score.
Train Evaluation: Generate outputs on the train set and run evaluators to collect correctness, explanations, and rule violations.
Optimize Prompt: Feed this feedback into the PromptLearningOptimizer to generate a refined system prompt.
Test Evaluation: Validate the new prompt against the test set.
Repeat until performance meets the threshold or the maximum number of loops is exhausted.

This loop automates prompt refinement, ensuring that each iteration incorporates concrete evaluator feedback into the next generation of prompts.

Results

We started off with a prompt that resulted in webpages that didn’t follow any of the rules at all.

In just 1 loop -> high accuracies like 84 or 66 based on the number of rules we were checking

In just 5 loops -> even higher accuracies.

This shows that prompt learning allows LLMs to learn new rules, even up to 50/100 rules for a certain task, when at first it didn’t know any of them.

Last updated 1 month ago

Was this helpful?