Skip to main content
Deploy and orchestrate AI agents at scale - governed, observable, and integrated for enterprise transformation. Microsoft Foundry offers a rich library of enterprise-grade evaluation capabilities such as Risk and Safety, while Arize AX delivers observability, evaluation and experimentation workflows for continuous improvement. Combined, they let organizations close the loop between insight and action, transforming Responsible AI from policy into practice. The result is a continuous feedback system where the same evaluators that power offline testing also monitor live production traffic. Data moves seamlessly from trace logs to evaluation results to experiment dashboards.

This tutorial follows examples illustrated in this blog:

Blog: Evaluating and Improving AI Agents at Scale with Microsoft Foundry

This tutorial covers two sections:

1. Azure AI Foundry and Arize for Agent Observability and Evaluation

This notebook demonstrates how to:
  1. Build a LangChain multi-chain agent on Azure AI Foundry while tracing all operations to Arize for observability
  2. Leverage Microsoft Risk and Safety Evaluators to evaluate LLM behavior
  3. Log evaluation results to Arize for visibility

Notebook Tutorial - Foundry Agent Observability and Evaluation

Screenshot shows Arize AX Agent graph view with aggregate span level  evaluation performance Microsoft Foundry Agent Observability Agent Graph Screenshot showing Microsoft hate and unfairness evaluation metric attached to a span. Arize Ax Microsoft Foundry Trace Unfairness Screenshot showing summarized dashboard with key observability metrics and evaluation KPI metrics Arize Ax Dash

2. Azure Risk and Safety Evaluators on Arize Datasets+Experiments

This notebook demonstrates how to leverage Azure Risk and Safety Evaluators with Arize Datasets+Experiments to track and visualize experiments and evaluations in the Arize. We will use the Hate Unfairness Evaluator to evaluate the output an Azure AI Foundry agent.

Notebook Tutorial - Using Foundry Evaluators on Arize Datasets + Experiments

Screenshot showing experiment runs on the dataset, comparison of evaluation hate and unfairness metric in Arize AX Arize Ax Experiment Runs Screenshot showing row level comparison of experiment runs in Arize AX with hate and unfairness scores, labels and explanations. Arize Ax Prompts