Pre-Built Evals

The following are simple functions on top of the LLM evals building blocks that are pre-tested with benchmark data.

All evals templates are tested against golden data that are available as part of the LLM eval library's benchmarked data and target precision at 70-90% and F1 at 70-85%.

Hallucination Eval

Hallucinations on answers to public and private data

Tested on:

Hallucination QA Dataset, Hallucination RAG Dataset

Q&A Eval

Private data Q&A Eval

Tested on:

WikiQA

Retrieval Eval

RAG individual retrieval

Tested on:

MS Marco, WikiQA

Summarization Eval

Summarization performance

Tested on:

GigaWorld, CNNDM, Xsum

Code Generation Eval

Code writing correctness and readability

Tested on:

WikiSQL, HumanEval, CodeXGlu

Toxicity Eval

Is the AI response racist, biased or toxic

Tested on:

WikiToxic

AI vs. Human

Compare human and AI answers

Reference Link

Check citations

User Frustration

Detect user frustration

SQL Generation

Evaluate SQL correctness given a query

Agent Function Calling

Agent tool use and parameters

Audio Emotion

Classify emotions from audio files

Supported Models.

The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.

model = OpenAIModel(model_name="gpt-4",temperature=0.6)
model("What is the largest costal city in France?")

We currently support a growing set of models for LLM Evals, please check out the Eval Models section for usage.

PreviousHow to: Evals NextHallucinations

Last updated 2 months ago

Was this helpful?