LLM evaluation is a discipline where confusion reigns and foundation model builders are effectively grading their own homework.
Building on the viral threads on X/Twitter, Greg Kamradt, Robert Nishihara, and Jason Lopatecki discuss highlights from Arize AI’s ongoing research on how major foundation models – from OpenAI’s GPT-4 to Mistral and Anthropic’s Claude – are stacking up against each other at important tasks and emerging LLM use cases, covering and explaining the importance of results of Needle in a Haystack tests and other evals results on hallucination detection on private data, question-and-answer, code functionality, and more.
Curious which foundation models your company should be using for a specific use case – and which to avoid? You won’t want to miss this meetup!
Subscribe to get the latest news, expertise, and product updates from Arize. Your inbox is sacred, so we’ll only curate and send the best stuff.
*We’re committed to your privacy. Arize uses the information you provide to contact you about relevant content, products, and services. You may unsubscribe from these communications at any time. For more information, check out our privacy policy.
Like what you see? Let’s chat. Fill out this form and we will be in contact with you soon!