Code Readability Evaluation
Evaluate the readability of code generated by LLM applications using Phoenix's evaluation framework.
This tutorial shows how to classify code as readable or unreadable using benchmark datasets with ground-truth labels.
Key Takeaways:
Download and prepare benchmark datasets for code readability evaluation
Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy
Analyze results with confusion matrices and detailed reports
Get explanations for LLM classifications to understand decision-making
Notebook Walkthrough
We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook.
Download Benchmark Dataset
dataset_name = "openai_humaneval_with_readability"
df = download_benchmark_dataset(task="code-readability-classification", dataset_name=dataset_name)
Configure Evaluation
N_EVAL_SAMPLE_SIZE = 10
df = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df = df.rename(columns={"prompt": "input", "solution": "output"})
Run Code Readability Classification
Run readability classifications against a subset of the data.
model = OpenAIModel(model="gpt-4", temperature=0.0)
rails = list(CODE_READABILITY_PROMPT_RAILS_MAP.values())
readability_classifications = llm_classify(
dataframe=df,
template=CODE_READABILITY_PROMPT_TEMPLATE,
model=model,
rails=rails,
concurrency=20,
)["label"].tolist()
Evaluate Results and Plot Confusion Matrix
Evaluate the predictions against human-labeled ground-truth readability labels.
true_labels = df["readable"].map(CODE_READABILITY_PROMPT_RAILS_MAP).tolist()
print(classification_report(true_labels, readability_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
actual_vector=true_labels, predict_vector=readability_classifications, classes=rails
)
confusion_matrix.plot(
cmap=plt.colormaps["Blues"],
number_label=True,
normalized=True,
)
Get Explanations
When evaluating a dataset for readability, it can be useful to know why the LLM classified text as readable or not. The following code block runs llm_classify
with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting.
readability_classifications_df = llm_classify(
dataframe=df.sample(n=5),
template=CODE_READABILITY_PROMPT_TEMPLATE,
model=model,
rails=rails,
provide_explanation=True,
concurrency=20,
)
Compare Models
Run the same evaluation with different models:
# GPT-3.5
model_gpt35 = OpenAIModel(model="gpt-3.5-turbo", temperature=0.0)
# GPT-4 Turbo
model_gpt4turbo = OpenAIModel(model="gpt-4-turbo-preview", temperature=0.0)
Last updated
Was this helpful?