Ragas - NVIDIA RAG Quality Metrics

Ragas is a library that provides robust evaluation metrics for LLM applications, making it easy to assess quality. NVIDIA has developed three specialized metrics through sophisticated LLM-as-a-judge evaluation approaches:

Answer Accuracy
Context Relevance
Response Groundedness

This guide will walk you through the process of creating and evaluating agents using Ragas and Arize. This notebook demonstrates how to:

Build a RAG pipeline using LlamaIndex
Create a test dataset for evaluation
Run 3 experiments with varying parameters
Evaluate using NVIDIA metrics (AnswerAccuracy, ContextRelevance, ResponseGroundedness)
View comprehensive analysis and compare results in the Arize platform
Analyze how retrieval count and chunk size impact evaluation metrics

We will walk through the key steps in the documentation below. Check out the full tutorial here:

https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/gc.ico

Colab Notebook Tutorial

How NVIDIA metrics are Calculated:

The following approach applies to AnswerAccuracy, ContextRelevance, and ResponseGroundedness metrics Step 1: The LLM generates ratings using two distinct templates to ensure robustness:

Template 1: The LLM compares the response with the reference and rates it on a scale of 0, 2, or 4.
Template 2: The LLM evaluates the same question again, but this time the roles of the response and the reference are swapped. This dual-perspective approach guarantees a fair assessment of the answer’s accuracy.

Step 2: If both ratings are valid, the final score is average of score1 and score2; otherwise, it takes the valid one.

Example Calculation:

User Input: “When was Einstein born?”
Response: “Albert Einstein was born in 1879.” Reference: “Albert Einstein was born in 1879.” Assuming both templates return a rating of 4 (indicating an exact match), the conversion is as follows:
A rating of 4 corresponds to 1 on the [0,1] scale. Averaging the two scores: (1 + 1) / 2 = 1. Thus, the final Answer Accuracy score is 1.

Resources

More details - RAGAS NVIDIA Metrics
NVIDIA Metrics and Templates - Github repo

OpenTelemetry

LLM Providers

TS/JS Agent Frameworks

Python Agent Frameworks

Java

Platforms

Evaluation Integrations

Ragas - NVIDIA RAG Quality Metrics

Colab Notebook Tutorial

How NVIDIA metrics are Calculated:

Example Calculation:

Resources

OpenTelemetry

LLM Providers

TS/JS Agent Frameworks

Python Agent Frameworks

Java

Platforms

Evaluation Integrations

Colab Notebook Tutorial

​How NVIDIA metrics are Calculated:

​Example Calculation:

​Resources

How NVIDIA metrics are Calculated:

Example Calculation:

Resources