User Guide

Unified Platform across Development & Production

Arize AI provides an end-to-end platform designed to build, evaluate, and monitor LLM-powered applications at scale.

Development:

  • Flexible Dataset Curation: Datasets can be created from spans, programmatically via code, or upload via CSV

  • Leverage Production Data: Pull real-world interactions from production into datasets to refine prompts and responses

  • Code-Based Experimentation: Run experiments programmatically to test different LLM models, retrieval strategies, and other parameters.

  • Interactive Playground: Prototype and iterate on prompts and agent behaviors in the Prompt Playground before deployment.

  • Prompt Hub: Store, version, and compare prompts in a central repository for better tracking and collaboration.

  • Validate Experiment Outcomes: Ensure that modifications to prompts, models, or retrieval strategies result in measurable improvements.

  • Compare Across Experiments: Standardize evaluation criteria to compare different iterations, models, or fine-tuned variants objectively.

  • Automate CI/CD Workflows: Create experiments that automatically validate changes—whether it's a tweak to a prompt, model, or function — using a curated dataset and your preferred evaluation method.

Production

  • Full LLM Trace Logging: Capture the entire LLM workflow, including function calls, retrieved context, chain-of-thought reasoning, and response outputs.

  • Deep Debugging: Quickly identify latency, poor retrievals, prompt issues, understand agentic paths, replay voice assistants, and more.

  • Automated Real-Time Evaluation: Set up evaluation tasks that automatically tag new spans with evaluation labels as soon as the data arrives in the platform, streamlining the process.

  • Track Application Outputs: Track degradation in performance, increased latency, or undesirable outputs (e.g., hallucinations, toxicity) on live application data.

  • Annotations: Combine human expertise with automated workflows to generate high-quality labels and annotations.

  • Real-Time Application Monitoring: Set up alerts and dashboards to track LLM latency, token usage, evaluations, and more.

  • Guardrails Against Poor Outputs: Automatically flag and prevent inappropriate, biased, or hallucinated responses before they reach users.

Last updated

Was this helpful?