> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# NVIDIA RAG Metrics via Ragas

> Use NVIDIA's RAG metrics (Answer Accuracy, Context Relevance, Response Groundedness) via Ragas to grade Arize AX traces and as evaluators in Arize AX experiments.

NVIDIA's RAG evaluation metrics ship inside [Ragas](https://docs.ragas.io/en/stable/) as a dedicated collection: [`AnswerAccuracy`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/) (does the response match a reference), [`ContextRelevance`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/) (does the retrieved context cover the question), and [`ResponseGroundedness`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/) (is the response actually supported by the context). They're tuned to match NVIDIA's published RAG quality benchmarks while reusing Ragas's prompting and execution machinery.

This guide shows both ways to wire them into Arize AX: Flow 1 grades existing Arize AX traces with `ResponseGroundedness` and writes the scores back via `client.spans.update_evaluations(...)`; Flow 2 uploads a small dataset, runs an Arize AX experiment with the same evaluator wrapped as an experiment evaluator, and surfaces the scores in Datasets+Experiments. For the sibling Ragas integration (the standard metrics like `Faithfulness` and `AnswerRelevancy`), see the [Ragas evaluation guide](/ax/integrations/evaluation-integrations/ragas).

Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.

## Prerequisites

* Python 3.11+
* An `ARIZE_SPACE_ID` and `ARIZE_API_KEY` from your Arize AX space settings
* An `OPENAI_API_KEY` from [OpenAI Platform](https://platform.openai.com/api-keys) (used as both the model under trace and the judge model for NVIDIA's metrics)

## Launch Arize AX

If you don't already have an Arize AX account, sign up at [arize.com](https://arize.com/) and grab your `ARIZE_SPACE_ID` and `ARIZE_API_KEY` from Settings → Space Settings.

## Install

```bash theme={null}
pip install ragas 'arize>=8.0.0' openai openinference-instrumentation-openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc pandas
```

## Configure credentials

```bash theme={null}
export ARIZE_SPACE_ID="<your-space-id>"
export ARIZE_API_KEY="<your-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"
```

## Define evaluators

The shared setup: NVIDIA's v2 `ResponseGroundedness` metric (from `ragas.metrics.collections`) backed by `gpt-5-mini` via Ragas's `llm_factory`, the canonical 2-row hallucination dataset both flows score, and an Arize SDK client. `temperature=1.0` is passed explicitly because `gpt-5` only supports the default temperature.

```python theme={null}
# combined.py
import os
import time
from datetime import datetime, timedelta, timezone

import pandas as pd
from arize import ArizeClient
from openai import AsyncOpenAI, OpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import ResponseGroundedness

SPACE_ID = os.environ["ARIZE_SPACE_ID"]
API_KEY = os.environ["ARIZE_API_KEY"]
TIMESTAMP = int(time.time())

# NVIDIA ResponseGroundedness scores 0.0–1.0 — 1.0 means the response is
# fully supported by the retrieved context, 0.0 means it's not. The v2
# metric uses Ragas's modern `llm_factory` API and a dual-judge prompt
# pair under the hood; it accepts an `InstructorBaseRagasLLM`.
ragas_llm = llm_factory(
    "gpt-5-mini",
    client=AsyncOpenAI(),
    temperature=1.0,  # gpt-5 only supports the default temperature
)
response_groundedness = ResponseGroundedness(llm=ragas_llm)

# Canonical 2-row dataset — row 0 is factual (answer matches the reference),
# row 1 is hallucinated. Both flows grade these same rows.
ROWS = [
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
]

arize = ArizeClient(api_key=API_KEY)
```

## Flow 1 — Evaluate existing traces

### Source the spans

Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.

```python theme={null}
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

PROJECT_NAME = f"nv-ragas-tracing-example-{TIMESTAMP}"

resource = Resource.create(
    {
        "service.name":                PROJECT_NAME,
        "openinference.project.name":  PROJECT_NAME,
        "model_id":                    PROJECT_NAME,
    }
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="https://otlp.arize.com:443",
            headers={
                "authorization":   API_KEY,
                "arize-space-id":  SPACE_ID,
                "arize-interface": "python",
            },
        )
    )
)
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument(tracer_provider=provider)

sync_oai = OpenAI()
for row in ROWS:
    sync_oai.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a fact-recall assistant. The user states the "
                    "exact answer to use; reply with that verbatim."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {row['input']}\n"
                    f"Answer (reply verbatim): {row['output']}"
                ),
            },
        ],
    )

provider.force_flush(timeout_millis=10_000)
print(f"Project: {PROJECT_NAME}")

# Poll defensively: Arize's OTLP ingest and Flight export use different
# catalogs and the new project can briefly appear "unauthorized" to the
# export endpoint while still accepting span writes via OTLP, so swallow
# transient errors and retry.
start = datetime.now(timezone.utc) - timedelta(minutes=5)
end = datetime.now(timezone.utc) + timedelta(minutes=1)
spans_df = None
last_err: Exception | None = None
for _ in range(12):
    time.sleep(5)
    try:
        spans_df = arize.spans.export_to_df(
            space_id=SPACE_ID,
            project_name=PROJECT_NAME,
            start_time=start,
            end_time=end,
        )
    except Exception as e:
        last_err = e
        continue
    if spans_df is not None and len(spans_df) >= len(ROWS):
        break
else:
    raise RuntimeError(
        f"Spans never appeared after 60s (last error: {last_err})"
    )

spans_df = spans_df.sort_values("start_time").reset_index(drop=True)
```

### Run the evaluators

The v2 metric exposes `ascore(response=..., retrieved_contexts=[...])` directly — no `SingleTurnSample` wrapper. It's async, so wrap each call in `asyncio.run(...)` from sync code. Flow 2 below uses the same metric from inside an `async def` evaluator wrapper.

```python theme={null}
import asyncio

scores = []
labels = []
for i, row in spans_df.iterrows():
    result = asyncio.run(
        response_groundedness.ascore(
            response=row["output"],
            retrieved_contexts=[ROWS[i]["reference"]],
        )
    )
    score = float(result.value)
    scores.append(score)
    labels.append("grounded" if score >= 0.5 else "ungrounded")
```

### Log evaluations to Arize AX

```python theme={null}
eval_df = pd.DataFrame(
    {
        "context.span_id":                  spans_df["context.span_id"],
        "eval.nv_response_groundedness.score": scores,
        "eval.nv_response_groundedness.label": labels,
    }
)
arize.spans.update_evaluations(
    space_id=SPACE_ID,
    project_name=PROJECT_NAME,
    dataframe=eval_df,
)

flow1_display = pd.DataFrame(
    {
        "input":  [r["input"]  for r in ROWS],
        "output": [r["output"] for r in ROWS],
        "score":  scores,
    }
)
print("Flow 1 results:")
print(flow1_display.to_string())
```

### Expected output

```text wrap theme={null}
Flow 1 results:
                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0
```

### Verify in Arize AX

Open the project named `nv-ragas-tracing-example-<timestamp>` (the value printed above) in your Arize AX space. Each `ChatCompletion` span now carries an `nv_response_groundedness` annotation column showing the 0/1 score and the `grounded` / `ungrounded` label.

## Flow 2 — Run an experiment

### Create a dataset

```python theme={null}
DATASET_NAME = f"nv-ragas-experiment-example-ds-{TIMESTAMP}"
dataset_df = pd.DataFrame(ROWS)
arize.datasets.create(
    name=DATASET_NAME,
    space=SPACE_ID,
    examples=dataset_df,
)
print(f"Dataset: {DATASET_NAME}")
```

### Define the task

```python theme={null}
def task(dataset_row):
    return dataset_row["output"]
```

### Wrap the evaluators

Experiment evaluators run inside an `asyncio` loop, so the wrapper is `async def` and `await`s `response_groundedness.ascore(...)` directly. Return an `EvaluationResult` with score, label, and explanation populated — leaving any of those as `None` triggers `unsupported cast from null to <type>: reserved column cannot be coerced to canonical type` at upload time.

```python theme={null}
from arize.experiments.evaluators.types import EvaluationResult


async def nv_response_groundedness_eval(
    input, output, dataset_row
) -> EvaluationResult:
    result = await response_groundedness.ascore(
        response=output if isinstance(output, str) else str(output),
        retrieved_contexts=[dataset_row["reference"]],
    )
    score = float(result.value)
    return EvaluationResult(
        score=score,
        label="grounded" if score >= 0.5 else "ungrounded",
        explanation="NVIDIA ResponseGroundedness (Ragas)",
    )
```

### Run the experiment

```python theme={null}
EXPERIMENT_NAME = f"nv-ragas-experiment-example-{TIMESTAMP}"
experiment, runs_df = arize.experiments.run(
    space=SPACE_ID,
    name=EXPERIMENT_NAME,
    dataset=DATASET_NAME,
    task=task,
    evaluators={"nv_response_groundedness": nv_response_groundedness_eval},
)
print(f"Experiment: {EXPERIMENT_NAME}")
print("Flow 2 results:")
flow2_display = runs_df[
    [
        "output",
        "eval.nv_response_groundedness.score",
        "eval.nv_response_groundedness.label",
    ]
].rename(
    columns={
        "eval.nv_response_groundedness.score": "score",
        "eval.nv_response_groundedness.label": "label",
    }
)
print(flow2_display.to_string())
```

### Expected output

```text wrap theme={null}
Flow 2 results:
                             output  score       label
0   Paris is the capital of France.    1.0    grounded
1  Berlin is the capital of France.    0.0  ungrounded
```

### Verify in Arize AX

Open the **Datasets + Experiments** tab in Arize AX. The dataset `nv-ragas-experiment-example-ds-<timestamp>` and the experiment `nv-ragas-experiment-example-<timestamp>` (names printed above) appear with one run per dataset row, each carrying the `nv_response_groundedness` score and label columns.

## Troubleshooting

* **`Skipping a sample by assigning it nan score`.** The judge call failed (rate limit, model error, etc.) and Ragas swallowed the exception. Check the warning lines just above this message in stderr for the actual error.
* **`column "eval.nv_response_groundedness.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type`.** Your experiment evaluator returned a bare float instead of a fully-populated `EvaluationResult(score=..., label=..., explanation=...)`. Arize AX's Flight server rejects null reserved columns.
* **Spans never appear after 60s.** Span flush + ingest typically takes 5–15s. If the loop times out, check that `ARIZE_SPACE_ID` + `ARIZE_API_KEY` are right and that you're connecting to the correct region's OTLP endpoint (`otlp.arize.com` for US, `otlp.eu.arize.com` for EU).
* **Using `AnswerAccuracy` or `ContextRelevance` instead.** Swap `ResponseGroundedness` for `AnswerAccuracy` (requires `reference` field on the sample) or `ContextRelevance` (requires `user_input` + `retrieved_contexts`). The wiring is otherwise identical. See [Ragas NVIDIA metrics docs](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/) for the per-metric required fields.
* **Using the official NVIDIA RAG-Eval suite directly (not via Ragas).** The standalone `nvidia-rag-eval` package exists but is paywalled behind NVIDIA AI Enterprise. The Ragas wrappers used here are the open, community-supported path to the same metric definitions.
* **Experiment re-runs collide.** Both names embed `TIMESTAMP = int(time.time())` so a single re-run produces unique names. If you re-execute the same `combined.py` quickly, regenerate `TIMESTAMP` first or call `arize.experiments.delete(...)` / `arize.datasets.delete(...)` on the prior run's names.

## Resources

<CardGroup>
  <Card icon="book-open" href="https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/" title="Ragas NVIDIA Metrics Documentation" horizontal />

  <Card icon="github" href="https://github.com/explodinggradients/ragas" title="Ragas on GitHub" horizontal />

  <Card icon="book-open" href="/ax/integrations/evaluation-integrations/ragas" title="Ragas (standard metrics) in Arize AX" horizontal />

  <Card icon="book-open" href="/api-clients/python/version-8/client-resources/spans#update-evaluations" title="Logging evaluations to Arize AX" horizontal />
</CardGroup>
