Evaluate multi-agent systems using Arize Phoenix, Google Evals, and CrewAI
This guide demonstrates how to evaluate multi-agent systems using Arize Phoenix, Google Gen AI Evals, and CrewAI. It shows how to:
Set up a multi-agent system using CrewAI for collaborative AI agents
Instrument the agents with Phoenix for tracing and monitoring
Evaluate agent performance and interactions using Google GenAI
Analyze the results using Arize Phoenix's observability platform
CrewAI: For orchestrating multi-agent systems
Arize Phoenix: For observability and tracing
Google Cloud Vertex AI: For model hosting and execution
OpenAI: For agent LLM capabilities
We will walk through the key steps in the documentation below. Check out the full tutorial here:
This crew consists of specialized agents working together to analyze and report on a given topic.
from crewai import Agent, Crew, Process, Task
#Define agents here (see full tutorial)
# Create tasks for your agents with explicit context
conduct_analysis_task = Task(
description=f"""Conduct a comprehensive analysis of the latest developments in {topic}.
Identify key trends, breakthrough technologies, and potential industry impacts.
Focus on both research breakthroughs and commercial applications.""",
expected_output="Full analysis report in bullet points with citations to sources",
agent=researcher,
context=[], # Explicitly set empty context
)
fact_checking_task = Task(
description=f"""Review the research findings and verify the accuracy of claims about {topic}.
Identify any potential ethical concerns or societal implications.
Highlight areas where hype may exceed reality and provide a balanced assessment.
Suggest frameworks that should be considered for each major advancement.""",
expected_output="Fact-checking report with verification status for each major claim",
agent=fact_checker,
context=[conduct_analysis_task], # Set context to previous task
)
# Instantiate your crew with a sequential process
crew = Crew(
agents=[researcher, fact_checker, writer],
tasks=[conduct_analysis_task, fact_checking_task, writer_task],
verbose=False,
process=Process.sequential,
)
return crew
Next, you'll built an experiment set to test your CrewAI Crew with Phoenix and Google Gen AI evals.
When run, an Experiment will send each row of your dataset through your task, then apply each of your evaluators to the result.
All traces and metrics will then be stored in Phoenix for reference and comparison.
phoenix_client = px.Client()
try:
dataset = phoenix_client.get_dataset(name="crewai-researcher-test-topics")
except ValueError:
dataset = phoenix_client.upload_dataset(
dataframe=df,
dataset_name="crewai-researcher-test-topics",
input_keys=["topic"],
output_keys=["reference_trajectory"],
)
This method will be run on each row of your test cases dataset:
def call_crew_with_topic(input):
crew = create_research_crew(topic=input.get("topic"))
result = crew.kickoff()
return result
Define as many evaluators as you'd need to evaluate your agent. In this case, you'll use Google Gen AI's eval library to evaluate the crew's trajectory.
from vertexai.preview.evaluation import EvalTask
def eval_trajectory_with_google_gen_ai(
output, expected, metric_name="trajectory_exact_match"
) -> float:
eval_dataset = pd.DataFrame(
{
"predicted_trajectory": [create_trajectory_from_response(output)],
"reference_trajectory": [expected.get("reference_trajectory")],
}
)
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[metric_name],
)
eval_result = eval_task.evaluate()
metric_value = eval_result.summary_metrics.get(f"{metric_name}/mean")
if metric_value is None:
return 0.0
return metric_value
def trajectory_exact_match(output, expected):
return eval_trajectory_with_google_gen_ai(
output, expected, metric_name="trajectory_exact_match"
)
def trajectory_precision(output, expected):
return eval_trajectory_with_google_gen_ai(
output, expected, metric_name="trajectory_precision"
)
import nest_asyncio
from phoenix.experiments import run_experiment
nest_asyncio.apply()
experiment = run_experiment(
dataset,
call_crew_with_topic,
experiment_name="agent-experiment",
evaluators=[
trajectory_exact_match,
trajectory_precision,
trajectory_in_order_match,
trajectory_any_order_match,
agent_names_match,
],
)