Evaluation

Execute code and evaluate LLM performance with precision

Span-Level Evaluation

Evaluate code functionality

Evaluate hallucination

Evaluate human ground truth vs. AI

Evaluate Q&A correctness

Evaluate RAG

Evaluate reference links

Evaluate relevance

Evaluate SQL correctness

Evaluate tool calling

Evaluate toxicity

Evaluate user frustration

Last updated

Was this helpful?