- Answer Accuracy
- Context Relevance
- Response Groundedness
- Build a RAG pipeline using LlamaIndex
- Create a test dataset for evaluation
- Run 3 experiments with varying parameters
- Evaluate using NVIDIA metrics (AnswerAccuracy, ContextRelevance, ResponseGroundedness)
- View comprehensive analysis and compare results in the Arize platform
- Analyze how retrieval count and chunk size impact evaluation metrics
Colab Notebook Tutorial
How NVIDIA metrics are Calculated:
The following approach applies to AnswerAccuracy, ContextRelevance, and ResponseGroundedness metrics Step 1: The LLM generates ratings using two distinct templates to ensure robustness:- Template 1: The LLM compares the response with the reference and rates it on a scale of 0, 2, or 4.
- Template 2: The LLM evaluates the same question again, but this time the roles of the response and the reference are swapped. This dual-perspective approach guarantees a fair assessment of the answer’s accuracy.
Example Calculation:
- User Input: “When was Einstein born?”
- Response: “Albert Einstein was born in 1879.” Reference: “Albert Einstein was born in 1879.” Assuming both templates return a rating of 4 (indicating an exact match), the conversion is as follows:
- A rating of 4 corresponds to 1 on the [0,1] scale. Averaging the two scores: (1 + 1) / 2 = 1. Thus, the final Answer Accuracy score is 1.
Resources
- More details - RAGAS NVIDIA Metrics
- NVIDIA Metrics and Templates - Github repo