May 9th & 16th
10:00am PST – 10:45am PST
Virtual
Join Arize AI’s Co-Founders for a virtual event dedicated to exploring the latest frontiers in evaluating large language models (LLMs) for complex tasks. This event will feature two insightful sessions, each delving into a unique and exciting application of LLM evaluation:
Session 1 | SQL Generation Evals: LLMs-as-a-Judge
LLM-as-a-Judge is a popular and scalable technique to evaluate LLMs for tasks including toxicity classification, sentiment classifier, and text-to-SQL tasks. However, LLM-as-a-Judge based evaluation has certain limitations and points of contention – circular methodology (using 1 LLM to evaluate another LLM) and disregard for database schema or distribution. In this session, we will discuss an experiment we designed to evaluate the performance of the LLM-as-a-Judge Eval for text-to-SQL tasks. We’ll take you through a framework to compare LLM-as-a-Judge approach with a data distribution-based Eval approach for text-to-SQL tasks. We will also discuss some interesting cases that came up in our research highlighting the pitfalls of LLM-as-a-Judge approach and some suggestions on how this approach can be enhanced to account for those limitations.
Session 2 | LLM Evals for Router-Based Architectures
The second session in our series will explore how to effectively evaluate large language models (LLMs) within router-based AI architectures. Router networks allow for the dynamic routing of inputs to specialized LLM components, enabling more efficient and capable systems. However, evaluating the performance of these complex architectures presents unique challenges. In this session, we’ll cover key considerations and best practices for LLM evaluation in router setups.