Build Observability Into Your LLM Applications
Building LLM applications is different from traditional software development. When a user says “your chatbot gave me the wrong answer,” you need to understand not just what happened, but why it happened. Which LLM calls were made? What context was retrieved? Which tools were invoked? What was the decision-making process? This tutorial teaches you how to add observability to your LLM applications using Arize AX. Observability means instrumenting your application so you can understand its internal state from its external outputs. Instead of guessing where failures occur, you’ll learn to capture detailed execution traces that show you exactly what happened.What You’ll Build
Throughout this tutorial, you’ll build a customer support agent called SupportBot with two key capabilities:- Order Status Lookups - Look up customer order information using tool integration
- FAQ Responses - Answer common questions using RAG-based knowledge base search
What You’ll Learn
This tutorial is organized into three progressive chapters:Your First Traces
Capture LLM calls, tool executions, and retrieval operations. Get complete
visibility into your application’s execution flow.
Annotations & Evaluations
Measure quality through human feedback and automated evaluators. Transform
traces into actionable quality metrics.
Sessions
Track multi-turn conversations. Assess conversation-level coherence and
identify context-loss patterns.
Chapter 1: Your First Traces
Learn how to instrument your application with OpenTelemetry and OpenInference to capture:- LLM call details (prompts, outputs, model names, token counts, latency)
- Tool invocations and their parameters
- RAG retrieval operations and document relevance
- Complete execution traces grouped by request
Chapter 2: Annotations and Evaluations
Address quality measurement through:- Manual human annotations to create ground truth
- User feedback capture (thumbs up/down, escalations)
- Automated LLM-as-Judge evaluators for scalability
- Quality metrics like “23% of FAQ queries have irrelevant retrieval”
Chapter 3: Sessions
Add conversation tracking to:- Group related traces into multi-turn conversations
- Track context across interactions
- Evaluate conversation-level metrics
- Identify where conversations break down
Prerequisites
To follow along with this tutorial, you’ll need:- Arize AX Account - Sign up for free at app.arize.com
- OpenAI API Key - For LLM calls (or use another supported provider)
- Python 3.8+ or Node.js 18+ - Code examples provided in both languages
All tutorial code is available in our GitHub tutorials
repository.
The Methodology
This tutorial emphasizes data-driven debugging: instead of guessing where failures occur, you’ll learn to examine captured traces to see precisely what happened. By the end, you’ll be able to:- Answer questions like “Why did my agent choose that tool instead of this one?”
- Identify exactly which retrieved documents were passed to the LLM
- Measure quality at scale with automated evaluations
- Track customer satisfaction across complete conversations