User Guide

Unified Platform across Development & Production

Arize AI provides an end-to-end platform designed to build, evaluate, and monitor LLM-powered applications at scale.

Development:

Flexible Dataset Curation: Datasets can be created from spans, programmatically via code, or upload via CSV
Leverage Production Data: Pull real-world interactions from production into datasets to refine prompts and responses

Code-Based Experimentation: Run experiments programmatically to test different LLM models, retrieval strategies, and other parameters.
Interactive Playground: Prototype and iterate on prompts and agent behaviors in the Prompt Playground before deployment.
Prompt Hub: Store, version, and compare prompts in a central repository for better tracking and collaboration.

Validate Experiment Outcomes: Ensure that modifications to prompts, models, or retrieval strategies result in measurable improvements.
Compare Across Experiments: Standardize evaluation criteria to compare different iterations, models, or fine-tuned variants objectively.

Automate CI/CD Workflows: Create experiments that automatically validate changes—whether it's a tweak to a prompt, model, or function — using a curated dataset and your preferred evaluation method.

Full LLM Trace Logging: Capture the entire LLM workflow, including function calls, retrieved context, chain-of-thought reasoning, and response outputs.
Deep Debugging: Quickly identify latency, poor retrievals, prompt issues, understand agentic paths, replay voice assistants, and more.

Automated Real-Time Evaluation: Set up evaluation tasks that automatically tag new spans with evaluation labels as soon as the data arrives in the platform, streamlining the process.
Track Application Outputs: Track degradation in performance, increased latency, or undesirable outputs (e.g., hallucinations, toxicity) on live application data.
Annotations: Combine human expertise with automated workflows to generate high-quality labels and annotations.

Real-Time Application Monitoring: Set up alerts and dashboards to track LLM latency, token usage, evaluations, and more.
Guardrails Against Poor Outputs: Automatically flag and prevent inappropriate, biased, or hallucinated responses before they reach users.

Last updated 14 days ago

Was this helpful?