Using Human Annotations for Eval-Driven Development

How to leverage human annotations to build evaluations and experiments that improve your system

In this tutorial, we will explore how to build a custom human annotation interface for Phoenix using Lovable. We will then leverage those annotations to construct experiments and evaluate your application.

The purpose of a custom annotations UI is to make it easy for anyone to provide structured human feedback on traces, capturing essential details directly in Phoenix. Annotations are vital for collecting feedback during human review, enabling iterative improvement of your LLM applications.

By establishing this feedback loop and an evaluation pipeline, you can effectively monitor and enhance your system’s performance.

Notebook Walkthrough

Google Colabcolab.research.google.com

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or video above.

Generate traces to annotate

We will generate some LLM traces and send them to Phoenix. We will then annotate these traces to add labels, scores, or explanations directly onto specific spans.

questions = [
    "What is the capital of France?",
    "Who wrote 'Pride and Prejudice'?",
    "What is the boiling point of water in Celsius?",
    "What is the largest planet in our solar system?",
    "Who developed the theory of relativity?",
    "What is the chemical symbol for gold?",
    "In which year did the Apollo 11 mission land on the moon?",
    "What language has the most native speakers worldwide?",
    "Which continent has the most countries?",
    "What is the square root of 144?",
    "What is the largest country in the world by land area?",
    "Why is the sky blue?",
    "Who painted the Mona Lisa?",
    "What is the smallest prime number?",
    "What gas do plants absorb from the atmosphere?",
    "Who was the first President of the United States?",
    "What is the currency of Japan?",
    "How many continents are there on Earth?",
    "What is the tallest mountain in the world?",
    "Who is the author of '1984'?",
]

We deliberately generate some bad or nonsensical traces in the system prompt to demonstrate annotating and experimenting with different types of results.

from openai import OpenAI

openai_client = OpenAI()

# System prompt
system_prompt = """
You are a question-answering assistant. For each user question, randomly choose an option: NONSENSE or RHYME. If you choose RHYME, answer correctly in the form of a rhyme.

If it NONSENSE, do not answer the question at all, and instead respond with nonsense words and random numbers that do not rhyme, ignoring the user’s question completely.
When responding with NONSENSE, include at least five nonsense words and at least five random numbers between 0 and 9999 in your response.

Do not explain your choice.
"""

# Run through the dataset and collect spans
for question in questions:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
    )

Launch Custom Annotation UI

Visit our implementation here: https://phoenix-trace-annotator.lovable.app/

Note: This annotation UI was built for Phoenix Cloud demo purposes and is not optimized for high-volume trace workflows.

How to annotate your traces in Lovable:

Enter your Phoenix Cloud endpoint, API key, and project name. Optionally, also include an identifier to tie annotations to a specific user.
Click Refresh Traces.
Select the traces you want to annotate and click Send to Phoenix.
See your annotations appear instantly in Phoenix.

This tool was built using the Phoenix REST API. For more details on how to build your own custom annotations tool to fit your needs, see here.

Create a dataset from annotated spans

import pandas as pd
import phoenix as px
from phoenix.client import Client
from phoenix.client.types import spans

client = Client()
# replace "correctness" if you chose to annotate on different criteria
query = spans.SpanQuery().where("annotations['correctness']")
spans_df = client.spans.get_spans_dataframe(query=query, project_identifier="my-annotations-app")
dataset = px.Client().upload_dataset(
    dataframe=spans_df,
    dataset_name="annotated-rhymes",
    input_keys=["attributes.input.value"],
    output_keys=["attributes.llm.output_messages"],
)

Build an Eval based on annotations

Next, you will construct an LLM-as-a-Judge template to evaluate your experiments. This evaluator will mark nonsensical outputs as incorrect. As you experiment, you’ll see evaluation results improve. Once your annotated trace dataset shows consistent improvement, you can confidently apply these changes to your production system.

RHYME_PROMPT_TEMPLATE = """
Examine the assistant’s responses in the conversation and determine whether the assistant used rhyme in any of its responses.

Rhyme means that the assistant’s response contains clear end rhymes within or across lines. This should be applicable to the entire response.
There should be no irrelevant phrases or numbers in the response.
Determine whether the rhyme is high quality or forced in addition to checking for the presence of rhyme.
This is the criteria for determining a well-written rhyme.

If none of the assistant's responses contain rhyme, output that the assistant did not rhyme.

[BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {answer}
    [END DATA]


Your response must be a single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word.

"correct" means the response contained a well written rhyme.

"incorrect" means the response did not contain a rhyme.
"""

Experimentation Example: Improving the System Prompt

The next step is to form a hypothesis about why some outputs are failing. In our full walkthrough, we demonstrate the experimentation process by testing out different hypotheses such as swapping out models. However, for demonstration purposes, we will show an experiment that will almost certainly improve your results: modifying the weak system prompt we originally used.

system_prompt = '''
You are a question-answering assistant. For each user question, answer correctly in the form of a rhyme.
'''


def updated_task(example: Example) -> str:
    raw_input_value = example.input["attributes.input.value"]
    data = json.loads(raw_input_value)
    question = data["messages"][1]["content"]
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

def evaluate_response(input, output):
    raw_input_value = input["attributes.input.value"]
    data = json.loads(raw_input_value)
    question = data["messages"][1]["content"]
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": question, "answer": output}]),
        template=RHYME_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4.1"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score

Here, we expect to see improvements in our experiment. The evaluator should flag significantly fewer nonsensical answers as you have refined your system prompt.

experiment = run_experiment(
    dataset,
    task=updated_task,
    evaluators=[evaluate_response],
    experiment_name="updated system prompt",
    experiment_description="updated system prompt",
)

Applying Improvements

Now that we’ve completed a successful experimentation cycle and confirmed our improvements on the annotated traces dataset, we can update the application and test the results on the broader dataset. This helps ensure that improvements made during experimentation translate effectively to real-world usage and that your system performs reliably at scale.

system_prompt = """
You are a question-answering assistant. For each user question, answer correctly in the form of a rhyme.
"""

# Run through the dataset and collect spans
def complete_task(question) -> str:
    question_str = question["Questions"]
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question_str},
        ],
    )
    return response.choices[0].message.content


def evaluate_all_responses(input, output):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": input["Questions"], "answer": output}]),
        template=RHYME_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score


experiment = run_experiment(
    dataset=dataset, #full dataset of questions
    task=complete_task,
    evaluators=[evaluate_all_responses],
    experiment_name="modified-system-prompt-full-dataset",
)

Tips for building your custom annotation UI

Here is a sample prompt you can feed into Lovable (or a similar tool) to start building your custom LLM trace annotation interface. Feel free to adjust it to your needs. Note that you will need to implement functionality to fetch spans and send annotations to Phoenix. We’ve also included a brief explanation of how we approached this in our own implementation. A tool like this can benefit teams that want to collect human annotation data without requiring annotators to work directly within the Phoenix platform. You can also configure features like “thumbs up” and “thumbs down” buttons to streamline filling in annotation fields. Once submitted, the annotations immediately appear in Phoenix.

Prompt for Lovable:

Build a platform for annotating LLM spans and traces:

Connect to Phoenix Cloud by collecting endpoint, API Key, and project name from the user
Load traces and spans from Phoenix (via REST API or Python SDK).
Display spans grouped by trace_id, with clear visual separation.
Allow annotators to assign a label, score, and explanation to each span or entire trace.
Support sending annotations back to Phoenix and reloading to see updates.
Use a clean, modern design

Details on how we built our Annotation UI:

✅ Frontend (Lovable):

Built in Lovable for easy UI generation.
Allows loading LLM traces, displaying spans grouped by trace_id, and annotating spans with label, score, explanation.

✅ Backend (Render, FastAPI):

Hosted on Render using FastAPI.
Adds CORS for your Lovable frontend to communicate securely.
Uses two key endpoints:
1. GET /v1/projects/{project_identifier}/spans
2. POST /v1/span_annotations

PreviousStructured Data Extraction NextFew Shot Prompting

Last updated 2 days ago

Was this helpful?