> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Ragas

> Use Ragas evaluators to grade Arize AX traces and as evaluators in Arize AX experiments.

The [Ragas](https://docs.ragas.io/en/stable/) library ships LLM-as-judge evaluators — faithfulness, answer relevancy, context recall, and many more — designed for RAG and agent workloads. This guide shows both ways to wire Ragas into Arize AX: Flow 1 grades existing Arize AX traces with a Ragas evaluator and writes the scores back via `client.spans.update_evaluations(...)`; Flow 2 uploads a small dataset, runs an Arize AX experiment with a Ragas-backed evaluator function, and surfaces the scores in Datasets+Experiments.

Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.

## Prerequisites

* Python 3.11+
* An `ARIZE_SPACE_ID` and `ARIZE_API_KEY` from your Arize AX space settings
* An `OPENAI_API_KEY` from [OpenAI Platform](https://platform.openai.com/api-keys) (used as both the model under trace and Ragas's judge LLM)

## Launch Arize AX

If you don't already have an Arize AX account, sign up at [arize.com](https://arize.com/) and grab your `ARIZE_SPACE_ID` and `ARIZE_API_KEY` from Settings → Space Settings.

## Install

```bash theme={null}
pip install ragas 'arize>=8.0.0' openai openinference-instrumentation-openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc pandas
```

## Configure credentials

```bash theme={null}
export ARIZE_SPACE_ID="<your-space-id>"
export ARIZE_API_KEY="<your-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"
```

## Define evaluators

The shared setup: a Ragas Faithfulness evaluator backed by GPT-5 (via an `AsyncOpenAI` client — Ragas's new collections API requires async), the canonical 2-row hallucination dataset that both flows score, and an Arize SDK client.

```python theme={null}
# combined.py
import os
import time
from datetime import datetime, timedelta, timezone

import pandas as pd
from arize import ArizeClient
from openai import AsyncOpenAI, OpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import Faithfulness

SPACE_ID = os.environ["ARIZE_SPACE_ID"]
API_KEY = os.environ["ARIZE_API_KEY"]
TIMESTAMP = int(time.time())

# Ragas Faithfulness measures how well a response is grounded in the
# retrieved context. The new collections API requires an async client.
async_oai = AsyncOpenAI()
ragas_llm = llm_factory("gpt-5", client=async_oai)
faithfulness = Faithfulness(llm=ragas_llm)

# Canonical 2-row dataset — row 0 is factual (answer matches the reference),
# row 1 is hallucinated. Both flows grade these same rows.
ROWS = [
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
]

arize = ArizeClient(api_key=API_KEY)
```

## Flow 1 — Evaluate existing traces

### Source the spans

Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.

```python theme={null}
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

PROJECT_NAME = f"ragas-tracing-example-{TIMESTAMP}"

resource = Resource.create(
    {
        "service.name":                PROJECT_NAME,
        "openinference.project.name":  PROJECT_NAME,
        "model_id":                    PROJECT_NAME,
    }
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="https://otlp.arize.com:443",
            headers={
                "authorization":     API_KEY,
                "arize-space-id":    SPACE_ID,
                "arize-interface":   "python",
            },
        )
    )
)
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument(tracer_provider=provider)

sync_oai = OpenAI()
for row in ROWS:
    sync_oai.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a fact-recall assistant. The user states the "
                    "exact answer to use; reply with that verbatim."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {row['input']}\n"
                    f"Answer (reply verbatim): {row['output']}"
                ),
            },
        ],
    )

provider.force_flush(timeout_millis=10_000)
print(f"Project: {PROJECT_NAME}")

# Spans take ~5–15s to be queryable after flush. Poll defensively: Arize's
# OTLP ingest and Flight export use different catalogs and the new project
# can briefly appear "unauthorized" to the export endpoint while still
# accepting span writes via OTLP, so swallow transient errors and retry.
start = datetime.now(timezone.utc) - timedelta(minutes=5)
end = datetime.now(timezone.utc) + timedelta(minutes=1)
spans_df = None
last_err: Exception | None = None
for _ in range(12):
    time.sleep(5)
    try:
        spans_df = arize.spans.export_to_df(
            space_id=SPACE_ID,
            project_name=PROJECT_NAME,
            start_time=start,
            end_time=end,
        )
    except Exception as e:
        last_err = e
        continue
    if spans_df is not None and len(spans_df) >= len(ROWS):
        break
else:
    raise RuntimeError(
        f"Spans never appeared after 60s (last error: {last_err})"
    )

spans_df = spans_df.sort_values("start_time").reset_index(drop=True)
```

### Run the evaluators

`Faithfulness.score(...)` is the sync entry point; use it when you're not already inside an `asyncio` loop (Flow 2 below switches to `ascore(...)` because experiments evaluate inside one).

Faithfulness returns a continuous score in `[0.0, 1.0]` that can wobble between runs (Berlin might score `0.0` one run and `0.25` the next, depending on how the judge counts partially-supported statements). The doc binarizes via a `0.5` threshold so the printed `score` column stays stable across runs. If you want the raw fractional value, drop the `1.0 if … else 0.0` and assign `result.value` directly.

```python theme={null}
scores = []
labels = []
for i, row in spans_df.iterrows():
    result = faithfulness.score(
        user_input=ROWS[i]["input"],
        response=row["output"],
        retrieved_contexts=[ROWS[i]["reference"]],
    )
    is_faithful = float(result.value) >= 0.5
    scores.append(1.0 if is_faithful else 0.0)
    labels.append("factual" if is_faithful else "hallucinated")
```

### Log evaluations to Arize AX

`update_evaluations(...)` requires a `context.span_id` column (which `export_to_df` already provides) plus the reserved `eval.<name>.{score,label,explanation}` columns. Each Ragas score becomes one row in this DataFrame.

```python theme={null}
eval_df = pd.DataFrame(
    {
        "context.span_id":           spans_df["context.span_id"],
        "eval.faithfulness.score":   scores,
        "eval.faithfulness.label":   labels,
    }
)
arize.spans.update_evaluations(
    space_id=SPACE_ID,
    project_name=PROJECT_NAME,
    dataframe=eval_df,
)

# Print the scores so they appear in stdout for verification.
flow1_display = pd.DataFrame(
    {
        "input":  [r["input"]  for r in ROWS],
        "output": [r["output"] for r in ROWS],
        "score":  scores,
    }
)
print("Flow 1 results:")
print(flow1_display.to_string())
```

### Expected output

```text wrap theme={null}
Flow 1 results:
                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0
```

### Verify in Arize AX

Open the project named `ragas-tracing-example-<timestamp>` (the value printed above) in your Arize AX space. Each `ChatCompletion` span now carries a `faithfulness` annotation column showing the score and label written by `update_evaluations(...)`.

## Flow 2 — Run an experiment

### Create a dataset

The dataset is the same two rows. The `space=` / `examples=` kwarg names match the v8 SDK exactly (note: not `space_id=` and not `dataframe=`).

```python theme={null}
DATASET_NAME = f"ragas-experiment-example-ds-{TIMESTAMP}"
dataset_df = pd.DataFrame(ROWS)
arize.datasets.create(
    name=DATASET_NAME,
    space=SPACE_ID,
    examples=dataset_df,
)
print(f"Dataset: {DATASET_NAME}")
```

### Define the task

The task function receives the dataset row and returns whatever the experiment should grade. The parameter name **must** be one of `input`, `output`, `metadata`, or `dataset_row` — a single-arg task with an unrecognized name is bound to `dataset_row` by default. A real workflow would call an LLM here; this passthrough keeps the example deterministic.

```python theme={null}
def task(dataset_row):
    return dataset_row["output"]
```

### Wrap the evaluators

Experiment evaluators run inside an `asyncio` loop, so use `async def` and Ragas's `ascore(...)` — the sync `score(...)` fails with `Cannot call sync score() from an async context`. Return an `EvaluationResult` with score **and** label **and** explanation populated: leaving any of those reserved fields as `None` triggers `unsupported cast from null to <type>: reserved column cannot be coerced to canonical type` at upload time.

```python theme={null}
from arize.experiments.evaluators.types import EvaluationResult


async def faithfulness_eval(input, output, dataset_row) -> EvaluationResult:
    result = await faithfulness.ascore(
        user_input=dataset_row["input"],
        response=output if isinstance(output, str) else str(output),
        retrieved_contexts=[dataset_row["reference"]],
    )
    is_faithful = float(result.value) >= 0.5
    return EvaluationResult(
        score=1.0 if is_faithful else 0.0,
        label="factual" if is_faithful else "hallucinated",
        explanation=result.reason or "no explanation",
    )
```

### Run the experiment

```python theme={null}
EXPERIMENT_NAME = f"ragas-experiment-example-{TIMESTAMP}"
experiment, runs_df = arize.experiments.run(
    space=SPACE_ID,
    name=EXPERIMENT_NAME,
    dataset=DATASET_NAME,
    task=task,
    evaluators={"faithfulness": faithfulness_eval},
)
print(f"Experiment: {EXPERIMENT_NAME}")
print("Flow 2 results:")
flow2_display = runs_df[
    ["output", "eval.faithfulness.score", "eval.faithfulness.label"]
].rename(columns={"eval.faithfulness.score": "score", "eval.faithfulness.label": "label"})
print(flow2_display.to_string())
```

### Expected output

```text wrap theme={null}
Flow 2 results:
                             output  score         label
0   Paris is the capital of France.    1.0       factual
1  Berlin is the capital of France.    0.0  hallucinated
```

### Verify in Arize AX

Open the **Datasets + Experiments** tab in Arize AX. The dataset `ragas-experiment-example-ds-<timestamp>` and the experiment `ragas-experiment-example-<timestamp>` (names printed above) appear with one run per dataset row, each carrying the `faithfulness` score and label columns.

## Troubleshooting

* **`Cannot call sync score() from an async context`.** Your evaluator function in Flow 2 is calling `faithfulness.score(...)` instead of `faithfulness.ascore(...)`. Experiment evaluators run inside `asyncio`; use the async API. Flow 1 calls `score(...)` because it runs outside any loop.
* **`column "eval.<name>.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type`.** Your evaluator returned a bare number or string instead of a fully-populated `EvaluationResult(score=..., label=..., explanation=...)`. Arize AX's Flight server rejects null values in reserved eval columns — populate all three fields.
* **`llm_factory() requires a client instance`.** The new Ragas collections API removed text-only LLMs. Pass a configured client: `llm_factory("gpt-5", client=AsyncOpenAI())`.
* **Spans never appear after 60s.** Span flush + ingest typically takes 5–15s. If the loop times out, check that `ARIZE_SPACE_ID` + `ARIZE_API_KEY` are right and that you're connecting to the correct region's OTLP endpoint (`otlp.arize.com` for US, `otlp.eu.arize.com` for EU).
* **`task failed for example id ...`.** Your task function's parameter name isn't one of the recognized names (`input`, `output`, `metadata`, `dataset_row`). Rename it to `dataset_row` if you want the whole row, or pick the field you actually need.
* **Experiment runs duplicate or the dataset already exists.** Both names embed `TIMESTAMP = int(time.time())` so a single re-run produces unique names. If you re-execute the same `combined.py` quickly, regenerate `TIMESTAMP` first or call `arize.experiments.delete(...)` / `arize.datasets.delete(...)` on the prior run's names.
* **Using a different Ragas metric.** Swap `Faithfulness` for any class in `ragas.metrics.collections` (`AnswerRelevancy`, `ContextRecall`, `FactualCorrectness`, etc.). Each metric has slightly different required fields on `SingleTurnSample` — see the [Ragas metrics docs](https://docs.ragas.io/en/stable/concepts/metrics/index.html).

## Resources

<CardGroup>
  <Card icon="book-open" href="https://docs.ragas.io/en/stable/" title="Ragas Documentation" horizontal />

  <Card icon="github" href="https://github.com/explodinggradients/ragas" title="Ragas on GitHub" horizontal />

  <Card icon="book-open" href="/api-clients/python/version-8/client-resources/spans#update-evaluations" title="Logging evaluations to Arize AX" horizontal />

  <Card icon="book-open" href="/ax/integrations/evaluation-integrations/ragas-nvidia-rag-metrics" title="NVIDIA RAG metrics via Ragas" horizontal />
</CardGroup>
