Evaluation
Execute code and evaluate LLM performance with precision
Span-Level Evaluation
Evaluate code functionality
Evaluate hallucination
Evaluate human ground truth vs. AI
Evaluate Q&A correctness
Evaluate RAG
Evaluate reference links
Evaluate relevance
Evaluate SQL correctness
Evaluate tool calling
Evaluate toxicity
Evaluate user frustration
Last updated
Was this helpful?