CI/CD
Chapter Summary
CI/CD strategies for LLM applications include experiments, automated testing, and integration with workflows like GitHub Actions. The chapter emphasizes tracking and iterating on changes while maintaining consistent evaluation standards to ensure smooth deployment.
Integrate CI/CD workflows with automated testing and experiment tracking using our product documentation.
Experiments
An experiment allows you to systematically test and validate changes in your LLM applications using a curated dataset. By defining a dataset (a collection of examples), creating tasks to generate outputs, and setting up evaluators to assess those outputs, you can run an experiment to see how well your updated pipeline performs. Whether you’re testing improvements with a golden dataset or troubleshooting issues with a problem dataset, experiments provide a structured way to measure the impact of your changes and ensure your application is on the right track.
Components of experiments:
-
Datasets: Datasets are collections of examples that provide the inputs and, optionally, expected reference outputs for evaluating your application. These examples are used in experiments to track improvements to your prompt, LLM, or other parts of your LLM application.
-
Tasks: A task is any function or process that produces a JSON-serializable output. Typically, a task replicates the LLM functionality you’re aiming to test. For instance, if you’ve made a prompt change, your task will run the examples through the new prompt to generate an output. The task is used in the experiment to process the dataset, producing outputs that will be evaluated in the next steps.
-
Evaluators: An evaluator is any function that takes the output of a task and provides an assessment. It serves as the measure of success for your experiment, helping you determine whether your changes have achieved the desired results. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment.
“Experiments provide a structured way to measure the impact of your changes and ensure your application is on the right track.”
Application/Orchestration Changes
Experiments extend beyond simple prompt or model changes—they also allow AI engineers to test changes in how the LLM application is orchestrated. This could include testing new integrations, API interactions, or external function calls. By experimenting with application or orchestration changes, engineers can optimize the performance of the overall system, not just the LLM output itself.
For example, an application might include task orchestration changes, such as altering how functions are triggered or modifying interaction flows between the LLM and external APIs. Experiments help determine whether these changes improve response times, reduce errors, or enhance overall user experience.
Tracking New Experiments
Effective experiment tracking is essential for measuring progress and avoiding regressions. By maintaining a record of each experiment—along with its configuration, dataset, evaluators, and results—AI engineers can compare past and present performance. This ensures that each change is assessed in context, and improvements or declines are easily traceable across different experiments.
How to Read Results: It’s Not Black and White
Interpreting experiment results requires nuance. Experiments often yield mixed outcomes—some evaluations may show improvements while others may not. For example, a change in prompt structure might lead to higher coherence scores but could slightly reduce factual accuracy.
Most teams hand curate annotated examples for an experiment validation set and then use metrics such as an average score, F1, recall or precision on top of evaluation results to assess performance. The use of statistical metrics is not ubiquitous, as many AI engineers come from different non-stats backgrounds, though we do see the use of statistical checks as a growing set of best practices.
Understanding that evaluation results are not always “black and white” is key. It’s important to weigh trade-offs and prioritize improvements that align with the specific goals of your application. A robust evaluation process will consider the impact across multiple metrics and tasks, allowing the AI engineer to decide whether the trade-offs make sense for their application.
LLM Evaluators
LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs.
Here’s an example of a LLM evaluator that checks for hallucinations in the model output:
from phoenix.evals import (
HALLUCINATION_PROMPT_RAILS_MAP,
HALLUCINATION_PROMPT_TEMPLATE,
llm_classify,
)
from phoenix.experiments.types import EvaluationResult
from openai import OpenAIModel
class HallucinationEvaluator(Evaluator):
def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
print(“Evaluating outputs”)
expected_output = dataset_row[“attributes.llm.output_messages”]
# Create a DataFrame with the actual and expected outputs
df_in = pd.DataFrame(
{“selected_output”: output, “expected_output”: expected_output}, index=[0]
)
# Run the LLM classification
expect_df = llm_classify(
dataframe=df_in,
template=HALLUCINATION_PROMPT_TEMPLATE,
model=OpenAIModel(model=”gpt-4o-mini”, api_key=OPENAI_API_KEY),
rails=HALLUCINATION_PROMPT_RAILS_MAP,
provide_explanation=True,
)
label = expect_df[“label”][0]
score = 1 if label == rails[1] else 0 # Score 1 if output is incorrect
explanation = expect_df[“explanation”][0]
# Return the evaluation result
return EvaluationResult(score=score, label=label, explanation=explanation)
In this example, the HallucinationEvaluator class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.
You can customize LLM evaluators to suit your experiment’s needs, whether you’re checking for hallucinations, function choice, or other criteria where an LLM’s judgment is valuable. Simply update the template with your instructions and the rails with the desired output. You can also have multiple LLM evaluators in a single experiment to assess different aspects of the output simultaneously.
Code Based Eval Experiment
Code evaluators are functions designed to assess the outputs of your experiments. They allow you to define specific criteria for success, which can be as simple or complex as your application requires. Code evaluators are especially useful when you need to apply tailored logic or rules to validate the output of your model.
Creating a custom code evaluator is as simple as writing a Python function. By default, this function will take the output of an experiment run as its single argument. Your custom evaluator can return either a boolean or a numeric value, which will then be recorded as the evaluation score.
For example, let’s say our experiment is testing a task that should output a numeric value between 1 and 100. We can create a simple evaluator function to check if the output falls within this range.
def in_bounds(output):
return 1 <= output <= 100
By passing the in_bounds function to run_experiment, evaluations will automatically be generated for each experiment run, indicating whether the output is within the allowed range. This allows you to quickly assess the validity of your experiment’s outputs based on custom criteria.
experiment = arize_client.run_experiment(
space_id=SPACE_ID,
dataset_id=DATASET_ID,
task=run_task,
evaluators=[in_bounds],
experiment_name=experiment_name,
)
Async vs Sync Tasks and Evals
Keep these in mind when choosing between synchronous and asynchronous experiments.
Synchronous: Slower but easier to debug. When you are building your tests these are inherently easier to debug. Start with synchronous and then make them asynchronous.
Asynchronous: Faster. When timing and speed of the tests matter. Make the tasks and/or evals asynchronous and you can 10x the speed of your runs.
CI/CD for Automated Experimentation
Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create experiments that automatically validate changes—whether it’s a tweak to a prompt, model, or function—using a curated dataset and your preferred evaluation method. These tests can be integrated with GitHub Actions, so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.
GitHub Actions allow you to automate workflows directly from your GitHub repository. It enables you to build, test, and deploy your code based on specific events (such as code pushes, pull requests, and more).
Key Concepts of GitHub Actions:
- Workflows: Automated processes that you define in your repository.
- Jobs: A workflow is composed of one or more jobs that can run sequentially or in parallel.
- Steps: Jobs contain steps that run commands in the job’s virtual environment.
- Actions: The individual tasks that you can combine to create jobs and customize your workflow. You can use actions defined in the GitHub marketplace or create your own.
Datasets
Datasets are integral to evaluation and experimentation.
They are collections of examples that provide the inputs, outputs,and any other attributes needed for assessing your application. Each example within a dataset represents a single data point, consisting of an inputs dictionary, an optional output dictionary, and an optional metadata dictionary. The optional output dictionary often contains the expected LLM application output for the given input.
Datasets allow you to collect data from production, staging, evaluations, and even manually. The examples collected are then used to run experiments and evaluations to track improvements.
Use datasets to:
- Store evaluation test cases for your eval script instead of managing large JSONL or CSV files
- Capture generations to assess quality manually or using LLM-graded evals
- Store user reviewed generations to find new test cases
Creating Datasets
There are various ways to get started with datasets:
Manually Curated Examples
This is how we recommend you start. From building your application, you probably have an idea of what types of inputs you expect your application to be able to handle, and what “good” responses look like. You probably want to cover a few different common edge cases or situations you can imagine. Even 20 high quality, manually curated examples can go a long way.
Historical Logs
Once you ship an application, you start gleaning valuable information: how users are actually using it. This information can be valuable to capture and store in datasets. This allows you to test against specific use cases as you iterate on your application.
If your application is going well, you will likely get a lot of usage. How can you determine which datapoints are valuable to add? There are a few heuristics you can follow. If possible, try to collect end user feedback. You can then see which datapoints got negative feedback. That is super valuable! These are spots where your application did not perform well. You should add these to your dataset to test against in the future. You can also use other heuristics to identify interesting datapoints – for example, runs that took a long time to complete could be interesting to analyze and add to a dataset.
Synthetic Data
Once you have a few examples, you can try to artificially generate examples to get a lot of datapoints quickly. It’s generally advised to have a few good handcrafted examples before this step, as the synthetic data will often resemble the source examples in some way.
Promoting Changes
Promoting changes in LLM applications requires rigorous evaluation-driven testing to ensure new updates are stable, accurate, and aligned with the application’s goals. Given the non-deterministic nature of LLMs, promoting changes based on evaluation metrics needs to be done thoughtfully, balancing between ensuring reliability and avoiding unnecessary roadblocks.
“Promoting changes based on evaluation metrics needs to be done thoughtfully, balancing between ensuring reliability and avoiding unnecessary roadblocks.”
Writing Eval-Driven Development Tests
To promote changes effectively, it’s crucial to integrate GitHub Actions and evaluation-driven tests into your development workflow. Since LLMs are non-deterministic, running a test once isn’t enough. The key is passing evaluation tests consistently. AI engineers need to run evaluations multiple times to ensure that results are consistent across runs. Changes should pass a certain number of test runs to guarantee stability and avoid promoting changes based on one-off good results.
What Experiments Should You Run in CI/CD?
When running experiments in your CI/CD pipeline, the goal is to simulate production as closely as possible before deploying changes. Some experiments to include:
- Ground Truth Evals: These should be run regardless of what’s being tested. If the application fails on ground truth comparisons, something is fundamentally broken, and changes should not be promoted.
- Threshold-Based Experiments: Set up experiments to detect if certain evaluation metrics (e.g., hallucinations or correctness) pass a defined threshold. For instance, if hallucination rates spike by 50% compared to a baseline, it may indicate a significant issue that needs addressing before promoting the change.
Should You Block PRs on Failures?
Blocking PRs on evaluation failures should be handled similarly to traditional unit tests. Some evaluation failures, like major issues in ground truth comparisons or significant metric spikes, should block PRs. However, not all evaluation tests should block a PR—some are better suited for post-deployment monitoring to avoid unnecessary bottlenecks.
This is where the paradigm shift comes in: LLM evaluations are your new unit tests. While teams may not block changes for every failure, they represent real tests that detect critical issues in your application, ensuring that only high-quality updates are promoted.
Paradigm Shift: Detaching Experiments from CI/CD
In traditional software development, Continuous Integration and Continuous Deployment (CI/CD) pipelines are designed to catch issues early, ensuring that code changes don’t introduce new bugs or degrade performance. However, with LLM applications, this approach requires a significant shift. Unlike traditional applications, LLM applications are not only affected by code changes but also by model updates, or input drift in production. As a result, experiments need to be run regularly in order to detect performance shifts, even when there aren’t associated pull requests or active development changes.
Download this article
Join the Arize community and continue your journey into LLM evaluation.