Featured
LLM evaluation is a discipline where confusion reigns and foundation model builders are effectively grading their own homework. Building on the viral threads on X/Twitter, Greg Kamradt, Robert Nishihara, and Jason Lopatecki discuss highlights from Arize AI's ongoing research on how major foundation models – from OpenAI’s GPT-4 to Mistral and Anthropic’s Claude – are stacking up...