Skip to main content

Build Observability Into Your LLM Applications

Building LLM applications is different from traditional software development. When a user says “your chatbot gave me the wrong answer,” you need to understand not just what happened, but why it happened. Which LLM calls were made? What context was retrieved? Which tools were invoked? What was the decision-making process? This tutorial teaches you how to add observability to your LLM applications using Arize AX. Observability means instrumenting your application so you can understand its internal state from its external outputs. Instead of guessing where failures occur, you’ll learn to capture detailed execution traces that show you exactly what happened.

What You’ll Build

Throughout this tutorial, you’ll build a customer support agent called SupportBot with two key capabilities:
  1. Order Status Lookups - Look up customer order information using tool integration
  2. FAQ Responses - Answer common questions using RAG-based knowledge base search
SupportBot will classify incoming queries and route them to the appropriate handler while maintaining conversational context.

What You’ll Learn

This tutorial is organized into three progressive chapters:

Your First Traces

Capture LLM calls, tool executions, and retrieval operations. Get complete visibility into your application’s execution flow.

Annotations & Evaluations

Measure quality through human feedback and automated evaluators. Transform traces into actionable quality metrics.

Sessions

Track multi-turn conversations. Assess conversation-level coherence and identify context-loss patterns.

Chapter 1: Your First Traces

Learn how to instrument your application with OpenTelemetry and OpenInference to capture:
  • LLM call details (prompts, outputs, model names, token counts, latency)
  • Tool invocations and their parameters
  • RAG retrieval operations and document relevance
  • Complete execution traces grouped by request
The payoff: Complete visibility for debugging classification errors, tool failures, and document relevance issues.

Chapter 2: Annotations and Evaluations

Address quality measurement through:
  • Manual human annotations to create ground truth
  • User feedback capture (thumbs up/down, escalations)
  • Automated LLM-as-Judge evaluators for scalability
  • Quality metrics like “23% of FAQ queries have irrelevant retrieval”
The payoff: Data-driven debugging at scale instead of manual trace inspection.

Chapter 3: Sessions

Add conversation tracking to:
  • Group related traces into multi-turn conversations
  • Track context across interactions
  • Evaluate conversation-level metrics
  • Identify where conversations break down
The payoff: Understand how your application behaves over complete customer journeys.

Prerequisites

To follow along with this tutorial, you’ll need:
  • Arize AX Account - Sign up for free at app.arize.com
  • OpenAI API Key - For LLM calls (or use another supported provider)
  • Python 3.8+ or Node.js 18+ - Code examples provided in both languages
All tutorial code is available in our GitHub tutorials repository.

The Methodology

This tutorial emphasizes data-driven debugging: instead of guessing where failures occur, you’ll learn to examine captured traces to see precisely what happened. By the end, you’ll be able to:
  • Answer questions like “Why did my agent choose that tool instead of this one?”
  • Identify exactly which retrieved documents were passed to the LLM
  • Measure quality at scale with automated evaluations
  • Track customer satisfaction across complete conversations
Let’s get started! Click through to Your First Traces to begin.