# Ragas

[Ragas](https://docs.ragas.io/en/stable/) is a library that provides robust evaluation metrics for LLM applications, making it easy to assess quality. When integrated with Arize, it enriches your experiments with metrics like goal accuracy and tool call accuracy—helping you evaluate performance more effectively and track improvements over time.

This guide will walk you through the process of creating and evaluating agents using Ragas and Arize. We'll cover the following steps:

* Build a customer support agent with the OpenAI Agents SDK
* Trace agent activity to monitor interactions
* Generate a benchmark dataset for performance analysis
* Evaluate agent performance using Ragas

We will walk through the key steps in the documentation below. Check out the full tutorial here:

{% embed url="https://colab.research.google.com/drive/1t9Htr6vFPBRqizwL1uTM8GN-E9Y7_mHF?usp=sharing" %}

### Creating the Agent <a href="#creating-the-agent" id="creating-the-agent"></a>

Here we've setup a basic agent that can solve math problems. We have a function tool that can solve math equations, and an agent that can use this tool. We'll use the `Runner` class to run the agent and get the final output.

```python
from agents import Runner, function_tool

@function_tool
def solve_equation(equation: str) -> str:
    """Use python to evaluate the math equation, instead of thinking about it yourself.

    Args:"
       equation: string which to pass into eval() in python
    """
    return str(eval(equation))
```

```python
from agents import Agent

agent = Agent(
    name="Math Solver",
    instructions="You solve math problems by evaluating them with python and returning the result",
    tools=[solve_equation],
)
```

## Evaluating the Agent

Agents can go awry for a variety of reasons. We can use Ragas to evaluate whether the agent responded correctly. Two Ragas measurements help with this:

1. **Tool Call Accuracy** - Did our agent choose the right tool with the right arguments?
2. **Agent Goal Accuracy** - Did our agent accomplish the stated goal and get to the right outcome?

We'll import both metrics we're measuring from Ragas, and use the `multi_turn_ascore(sample)` to get the results. The `AgentGoalAccuracyWithReference` metric compares the final output to the reference to see if the goal was accomplished. The `ToolCallAccuracy` metric compares the tool call to the reference tool call to see if the tool call was made correctly.

In the notebook, we also define the helper function `conversation_to_ragas_sample` which converts the agent messages into a format that Ragas can use.

The following code snippets define our task function and evaluators.

```python
import asyncio

from agents import Runner

async def solve_math_problem(input):
    if isinstance(input, dict):
        input = next(iter(input.values()))
    result = await Runner.run(agent, input)
    return {"final_output": result.final_output, "messages": result.to_input_list()}
```

```python
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import AgentGoalAccuracyWithReference, ToolCallAccuracy

#Setup evaluator LLM and metrics
async def tool_call_evaluator(input, output):
    if isinstance(output, dict):
        output = output.get("messages")
    sample = conversation_to_ragas_sample(output, reference_equation=input)
    tool_call_accuracy = ToolCallAccuracy()
    score = await tool_call_accuracy.multi_turn_ascore(sample)
    return score 


async def goal_evaluator(input, output):
    sample = conversation_to_ragas_sample(output.get("messages"), reference_answer = output.get("final_output"))
    evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
    goal_accuracy = AgentGoalAccuracyWithReference(llm=evaluator_llm)
    score = await goal_accuracy.multi_turn_ascore(sample)
    return score 
```

## Run the Experiment

Once we've generated a dataset of questions, we can use our experiments feature to track changes across models, prompts, parameters for the agent.

```python
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE

client = ArizeDatasetsClient(api_key= os.environ.get("ARIZE_API_KEY"), developer_key= developer_key)

dataset_df = pd.DataFrame({
    "id": [f"id_{i}" for i in range(len(conversations))],
    "question": [conv["question"] for conv in conversations],
    "attributes.input.value": [conv["question"] for conv in conversations],
    "attributes.output.value": [conv["final_output"] for conv in conversations],
})

dataset = client.create_dataset(
    space_id=os.environ.get("SPACE_ID"),
    dataset_name="math-qestions",
    data = dataset_df,
    dataset_type = GENERATIVE,
)
```

Finally, we run our experiment and view the results in Arize.

```python
experiment_id, experiment_df = client.run_experiment(
    space_id=os.environ.get("SPACE_ID"),
    dataset_id=dataset,
    task=solve_math_problem,
    evaluators=[goal_evaluator, tool_call_evaluator],
    experiment_name="ragas-agent",
    exit_on_error = True,
)
```

{% embed url="https://storage.googleapis.com/arize-phoenix-assets/assets/gifs/ragas_results_arize.gif" %}

\\