> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Microsoft Azure AI Evaluation

> Use Microsoft Azure AI Evaluation evaluators to grade Arize AX traces and as evaluators in Arize AX experiments.

The [Microsoft Azure AI Evaluation](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-evaluation-readme) library ships LLM-as-judge evaluators — groundedness, relevance, coherence, fluency, content safety — along with deterministic NLP scorers (BLEU, F1, ROUGE, METEOR). This guide shows both ways to wire them into Arize AX: Flow 1 grades existing Arize AX traces with `GroundednessEvaluator` and writes the scores back via `client.spans.update_evaluations(...)`; Flow 2 uploads a small dataset, runs an Arize AX experiment with the same evaluator wrapped as an experiment evaluator, and surfaces the scores in Datasets+Experiments.

Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.

## Prerequisites

* Python 3.11+
* An `ARIZE_SPACE_ID` and `ARIZE_API_KEY` from your Arize AX space settings
* An `OPENAI_API_KEY` from [OpenAI Platform](https://platform.openai.com/api-keys) (used as both the model under trace and the judge model for `GroundednessEvaluator`)

## Launch Arize AX

If you don't already have an Arize AX account, sign up at [arize.com](https://arize.com/) and grab your `ARIZE_SPACE_ID` and `ARIZE_API_KEY` from Settings → Space Settings.

## Install

```bash theme={null}
pip install azure-ai-evaluation 'arize>=8.0.0' openai openinference-instrumentation-openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc pandas
```

## Configure credentials

```bash theme={null}
export ARIZE_SPACE_ID="<your-space-id>"
export ARIZE_API_KEY="<your-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"
```

## Define evaluators

The shared setup: a Microsoft `GroundednessEvaluator` backed by `gpt-4.1-mini`, the canonical 2-row hallucination dataset that both flows score, and an Arize SDK client. The judge model is pinned to `gpt-4.1-mini` because `azure-ai-evaluation` still sends the legacy `max_tokens` parameter, which the GPT-5 and o-series families reject. `gpt-4.1-mini` accepts `max_tokens` natively and is more deterministic at temperature 0 than `gpt-4o-mini`.

```python theme={null}
# combined.py
import os
import time
from datetime import datetime, timedelta, timezone

import pandas as pd
from arize import ArizeClient
from azure.ai.evaluation import GroundednessEvaluator
from openai import OpenAI

SPACE_ID = os.environ["ARIZE_SPACE_ID"]
API_KEY = os.environ["ARIZE_API_KEY"]
TIMESTAMP = int(time.time())

# Microsoft GroundednessEvaluator scores 1–5 — higher = better grounded
# in the supplied context. 5 means fully supported; 1 means the response
# directly contradicts the context.
groundedness = GroundednessEvaluator(
    model_config={
        "type":     "openai",
        "api_key":  os.environ["OPENAI_API_KEY"],
        "model":    "gpt-4.1-mini",
        "base_url": "https://api.openai.com/v1",
    }
)

# Canonical 2-row dataset — row 0 is factual (answer matches the reference),
# row 1 is hallucinated. Both flows grade these same rows.
ROWS = [
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
]

arize = ArizeClient(api_key=API_KEY)
```

## Flow 1 — Evaluate existing traces

### Source the spans

Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.

```python theme={null}
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

PROJECT_NAME = f"microsoft-tracing-example-{TIMESTAMP}"

resource = Resource.create(
    {
        "service.name":                PROJECT_NAME,
        "openinference.project.name":  PROJECT_NAME,
        "model_id":                    PROJECT_NAME,
    }
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="https://otlp.arize.com:443",
            headers={
                "authorization":   API_KEY,
                "arize-space-id":  SPACE_ID,
                "arize-interface": "python",
            },
        )
    )
)
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument(tracer_provider=provider)

sync_oai = OpenAI()
for row in ROWS:
    sync_oai.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a fact-recall assistant. The user states the "
                    "exact answer to use; reply with that verbatim."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {row['input']}\n"
                    f"Answer (reply verbatim): {row['output']}"
                ),
            },
        ],
    )

provider.force_flush(timeout_millis=10_000)
print(f"Project: {PROJECT_NAME}")

# Poll defensively: Arize's OTLP ingest and Flight export use different
# catalogs and the new project can briefly appear "unauthorized" to the
# export endpoint while still accepting span writes via OTLP, so swallow
# transient errors and retry.
start = datetime.now(timezone.utc) - timedelta(minutes=5)
end = datetime.now(timezone.utc) + timedelta(minutes=1)
spans_df = None
last_err: Exception | None = None
for _ in range(12):
    time.sleep(5)
    try:
        spans_df = arize.spans.export_to_df(
            space_id=SPACE_ID,
            project_name=PROJECT_NAME,
            start_time=start,
            end_time=end,
        )
    except Exception as e:
        last_err = e
        continue
    if spans_df is not None and len(spans_df) >= len(ROWS):
        break
else:
    raise RuntimeError(
        f"Spans never appeared after 60s (last error: {last_err})"
    )

spans_df = spans_df.sort_values("start_time").reset_index(drop=True)
```

### Run the evaluators

`GroundednessEvaluator.__call__` is sync — call it once per span, pulling the question / answer / reference triple.

The raw `groundedness` value is a 1–5 score from the judge LLM, which can drift by one point between runs at the extremes. The doc grades on the deterministic `groundedness_result` (`pass` / `fail` against the configured threshold of 3) and normalizes to `1.0` / `0.0` so the score column is stable across runs. If you want the raw 1–5 number, swap in `float(result["groundedness"])`.

```python theme={null}
scores = []
labels = []
for i, row in spans_df.iterrows():
    result = groundedness(
        query=ROWS[i]["input"],
        response=row["output"],
        context=ROWS[i]["reference"],
    )
    passed = result["groundedness_result"] == "pass"
    scores.append(1.0 if passed else 0.0)
    labels.append("grounded" if passed else "ungrounded")
```

### Log evaluations to Arize AX

```python theme={null}
eval_df = pd.DataFrame(
    {
        "context.span_id":          spans_df["context.span_id"],
        "eval.groundedness.score":  scores,
        "eval.groundedness.label":  labels,
    }
)
arize.spans.update_evaluations(
    space_id=SPACE_ID,
    project_name=PROJECT_NAME,
    dataframe=eval_df,
)

flow1_display = pd.DataFrame(
    {
        "input":  [r["input"]  for r in ROWS],
        "output": [r["output"] for r in ROWS],
        "score":  scores,
    }
)
print("Flow 1 results:")
print(flow1_display.to_string())
```

### Expected output

```text wrap theme={null}
Flow 1 results:
                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0
```

### Verify in Arize AX

Open the project named `microsoft-tracing-example-<timestamp>` (the value printed above) in your Arize AX space. Each `ChatCompletion` span now carries a `groundedness` annotation column showing the normalized 0/1 score and the `grounded` / `ungrounded` label.

## Flow 2 — Run an experiment

### Create a dataset

```python theme={null}
DATASET_NAME = f"microsoft-experiment-example-ds-{TIMESTAMP}"
dataset_df = pd.DataFrame(ROWS)
arize.datasets.create(
    name=DATASET_NAME,
    space=SPACE_ID,
    examples=dataset_df,
)
print(f"Dataset: {DATASET_NAME}")
```

### Define the task

```python theme={null}
def task(dataset_row):
    return dataset_row["output"]
```

### Wrap the evaluators

`GroundednessEvaluator.__call__` is already safe to invoke from inside an asyncio loop (the library wraps its async core with `async_run_allowing_running_loop`), so the experiment evaluator is a plain `def`, not `async def`. Return an `EvaluationResult` with score, label, and explanation populated — leaving any of those as `None` triggers `unsupported cast from null to <type>: reserved column cannot be coerced to canonical type` at upload time.

```python theme={null}
from arize.experiments.evaluators.types import EvaluationResult


def groundedness_eval(input, output, dataset_row) -> EvaluationResult:
    result = groundedness(
        query=dataset_row["input"],
        response=output if isinstance(output, str) else str(output),
        context=dataset_row["reference"],
    )
    passed = result["groundedness_result"] == "pass"
    return EvaluationResult(
        score=1.0 if passed else 0.0,
        label="grounded" if passed else "ungrounded",
        explanation=result.get("groundedness_reason") or "no explanation",
    )
```

### Run the experiment

```python theme={null}
EXPERIMENT_NAME = f"microsoft-experiment-example-{TIMESTAMP}"
experiment, runs_df = arize.experiments.run(
    space=SPACE_ID,
    name=EXPERIMENT_NAME,
    dataset=DATASET_NAME,
    task=task,
    evaluators={"groundedness": groundedness_eval},
)
print(f"Experiment: {EXPERIMENT_NAME}")
print("Flow 2 results:")
flow2_display = runs_df[
    ["output", "eval.groundedness.score", "eval.groundedness.label"]
].rename(
    columns={
        "eval.groundedness.score": "score",
        "eval.groundedness.label": "label",
    }
)
print(flow2_display.to_string())
```

### Expected output

```text wrap theme={null}
Flow 2 results:
                             output  score       label
0   Paris is the capital of France.    1.0    grounded
1  Berlin is the capital of France.    0.0  ungrounded
```

### Verify in Arize AX

Open the **Datasets + Experiments** tab in Arize AX. The dataset `microsoft-experiment-example-ds-<timestamp>` and the experiment `microsoft-experiment-example-<timestamp>` (names printed above) appear with one run per dataset row, each carrying the `groundedness` score and label columns.

## Troubleshooting

* **`OpenAIConnection.__init__() missing 1 required positional argument: 'base_url'`.** The Azure AI Evaluation library requires an explicit `base_url` in the `model_config` even for plain OpenAI. Set it to `https://api.openai.com/v1` as shown in the Define evaluators block.
* **`Unsupported parameter: 'max_tokens' is not supported with this model`.** `azure-ai-evaluation` sends OpenAI requests with the legacy `max_tokens` parameter that GPT-5 and o-series models reject. Pin the judge to a model that still accepts `max_tokens` (`gpt-4.1-mini`, `gpt-4o-mini`, `gpt-4o`).
* **`column "eval.groundedness.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type`.** Your experiment evaluator returned a bare float or a dict that didn't fill all three of score / label / explanation. Return a fully-populated `EvaluationResult(...)`.
* **Spans never appear after 60s.** Span flush + ingest typically takes 5–15s. If the loop times out, check that `ARIZE_SPACE_ID` + `ARIZE_API_KEY` are right and that you're connecting to the correct region's OTLP endpoint (`otlp.arize.com` for US, `otlp.eu.arize.com` for EU).
* **Using Azure OpenAI for the judge model.** Swap the `model_config` for `{"type": "azure", "api_key": "...", "azure_endpoint": "https://<resource>.openai.azure.com", "azure_deployment": "<deployment-name>", "api_version": "2024-10-21"}`. The rest of the doc is unchanged.
* **Using safety evaluators (HateUnfairness, Violence, etc.) instead.** Those require an Azure AI Foundry project and use `AzureAIProject(subscription_id=..., resource_group_name=..., project_name=...)` as the evaluator's second arg instead of `model_config`. See [Microsoft's safety eval docs](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/risk-safety-evaluators).
* **Using a different score scale.** Microsoft's LLM-judged evaluators (Groundedness, Relevance, Coherence, Fluency, Similarity, Retrieval) all return scores on a 1–5 scale. To project to 0/1 for downstream tooling, normalize before assigning: `score = (raw - 1) / 4`.
* **Experiment re-runs collide.** Both names embed `TIMESTAMP = int(time.time())` so a single re-run produces unique names. If you re-execute the same `combined.py` quickly, regenerate `TIMESTAMP` first or call `arize.experiments.delete(...)` / `arize.datasets.delete(...)` on the prior run's names.

## Resources

<CardGroup>
  <Card icon="book-open" href="https://learn.microsoft.com/en-us/python/api/overview/azure/ai-evaluation-readme" title="Azure AI Evaluation Documentation" horizontal />

  <Card icon="terminal" href="https://pypi.org/project/azure-ai-evaluation/" title="azure-ai-evaluation on PyPI" horizontal />

  <Card icon="book-open" href="/api-clients/python/version-8/client-resources/spans#update-evaluations" title="Logging evaluations to Arize AX" horizontal />

  <Card icon="book-open" href="/ax/integrations/evaluation-integrations/ragas" title="Ragas evaluators in Arize AX" horizontal />
</CardGroup>
