Reliability is trust at scale. Users expect AI systems to work — every time. Downtime, latency, and quietly degraded behavior erode trust faster than any single outage. Modern AI applications are some of the hardest systems to keep reliable: a single request can call a model, hit a vector store, invoke a tool, and chain back through more LLM calls before producing an answer. When something goes wrong, the failure is distributed and often silent. The goal isn’t zero failures. It’s fast detection, clear visibility, and quick recovery — and that requires observability. This video contains a walkthrough of every topic in this section:Documentation Index
Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
The Silent Failures Problem
AI systems are inherently black boxes. When an answer is wrong, the cause is rarely obvious:- Was it a bad retrieval?
- The wrong tool call?
- A model that picked the wrong path through an agent loop?
- A subtle regression after a prompt change?
What is OpenTelemetry?
“OpenTelemetry is an observability framework and toolkit designed to facilitate the generation, export, and collection of telemetry data.”OpenTelemetry (OTel) is the industry-standard, vendor-neutral framework for collecting telemetry from any kind of software system. A few terms it’s worth pinning down:
- Observability — the ability to understand the internal state of a system by examining its outputs.
- Telemetry — the data emitted by a system. The main signals are traces, metrics, and logs; the OTel spec also covers baggage and is adding events and profiles.
- Instrumentation — the code that generates and exports telemetry.
- Auto-instrumentation — libraries that emit telemetry without any code changes from you.
What OpenTelemetry Is and Isn’t
| OpenTelemetry is | OpenTelemetry is not |
|---|---|
| A framework | A backend |
| Open source | A storage layer |
| Vendor-agnostic | A frontend or UI |
| Tool-agnostic | A visualization tool |
| Language-agnostic | |
| Infrastructure-agnostic | |
| An industry standard |
What is OpenInference?
“OpenInference is a set of conventions and plugins that is complementary to OpenTelemetry to enable tracing of AI applications.”OpenInference is an open standard for AI/LLM observability, built on top of OpenTelemetry and maintained by Arize. Where OTel gives you the generic span and trace primitives, OpenInference adds the AI-specific semantic conventions — span kinds (LLM, Tool, Agent, Retriever, etc.), attribute names (
llm.input_messages, llm.token_count.total), and the auto-instrumentor libraries that wrap popular AI frameworks.
OpenInference has the most extensive collection of auto-instrumentors for AI/LLM applications. Supported languages include Python, JavaScript, and Java, with auto-instrumentors covering:
LangChain, LangGraph, LlamaIndex, AutoGen, CrewAI, BeeAI, Vercel AI SDK, OpenAI, AWS Bedrock, Anthropic, Groq, LiteLLM, Portkey, Google ADK, Pydantic AI, Mistral, VertexAI, DSPy, Instructor, Haystack, Guardrails AI, Semantic Kernel, Microsoft Agent Framework, Smolagents, AWS Strands, Together AI, Pipecat, MCP — and more.
For the full list, see Integrations.
How OpenTelemetry and OpenInference Fit Together
OTel defines how telemetry is shaped, transported, and exported. OpenInference defines what AI-specific data to put inside it. Together they give you:- 30+ native integrations across LLM providers and agent frameworks.
- Portability — your instrumentation isn’t locked to Arize AX. It works with any OTel-compatible backend.
- A standardized format for AI trace data, so tools and backends understand it without custom mapping.
