> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Set Up Custom Metrics

> Define custom metrics using AQL to track what matters most to your business or application such as cost, quality, and usage.

You can see your traces and understand agent behavior. But how do you measure what matters to your specific app? The built-in metrics (latency, token count, error rate) are a start, but they won't tell you your cost per query, what percentage of responses are hallucinated, or how many distinct sessions you're handling.

Custom metrics let you define exactly what you need to track. Each one is a single query in **Arize Query Language (AQL)** — a SQL-like language — that returns one number, which then shows up in your [dashboards](/ax/observe/dashboards) and [monitors](/ax/observe/production-monitoring).

<Frame>
  <video controls className="w-full aspect-video rounded-xl" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/observe/custom_metric.mp4" />
</Frame>

## How to set up custom metrics

Open your project, click the **`...`** menu in the top right, and select **Custom Metrics**. From there, you have three ways to create one:

<Tabs>
  <Tab title="Alyx">
    Click **New Custom Metric**. On the right side of the page, Alyx can either recommend a metric or translate one you describe into AQL.

    **Ask for a recommendation** if you're not sure what to track:

    * "What metrics should I set up for this project?"
    * "Suggest a quality metric based on the evals I'm running"

    **Describe a metric in plain English** if you already know what you want:

    * "Count responses flagged as hallucinated"
    * "Percentage of retriever spans where Relevance is 'relevant'"
    * "Total token cost divided by number of distinct sessions"
    * "A weighted score: 25% QA correctness plus 75% relevance"

    Alyx drops the AQL into the SQL Query box. Edit, name, save.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/observe/alyx_custom_metric.png" alt="Creating a custom metric with Alyx — natural language prompt on the right generates AQL in the SQL Query box" />
    </Frame>
  </Tab>

  <Tab title="UI">
    1. Click **New Custom Metric**
    2. Write or paste AQL into the SQL Query box
    3. Name the metric and save — the chart above the query box previews the result over time
  </Tab>

  <Tab title="Code">
    Create and manage custom metrics via the GraphQL API:

    ```graphql theme={null}
    mutation {
      createCustomMetric(input: {
        modelId: "MODEL_ID"
        name: "Hallucination Rate"
        description: "Percentage of responses flagged as hallucinated"
        metric: "SELECT COUNT(*) FILTER (WHERE \"eval.Hallucination.label\" = 'hallucinated') / COUNT(*) * 100 FROM model"
      }) {
        customMetric { id name }
      }
    }
    ```

    See the full [Custom Metrics GraphQL API reference](/ax/graphql-reference/apis/custom-metrics-api).
  </Tab>
</Tabs>

## Common patterns

Wondering what to put in the SQL Query box? Most custom metrics fall into one of five shapes. Each section below has a template you can adapt plus a concrete example. Every example starts with a `-- description` comment, which is the format Alyx generates and a useful habit for readability.

### 1. Count of an eval label

How many times did a specific label occur? Pair `COUNT(*)` with a `FILTER (WHERE ...)` clause.

```sql theme={null}
SELECT COUNT(*) FILTER (WHERE "eval.<NAME>.label" = '<LABEL>')
FROM model
```

**Example — count of hallucinated responses**

```sql theme={null}
-- Count of responses flagged as hallucinated
SELECT COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated')
FROM model
```

### 2. Rate (% of total)

What percentage of traffic has a given eval label, status, or attribute? Divide a filtered count by an unfiltered count.

```sql theme={null}
SELECT
    COUNT(*) FILTER (WHERE <condition>) /
    COUNT(*) FILTER (WHERE <denominator scope>) * 100
FROM model
```

**Example — hallucination rate**

```sql theme={null}
-- Hallucination rate: % of evaluated responses flagged as hallucinated
SELECT
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated') /
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" != null) * 100
FROM model
```

<Tip>
  Filter the denominator to `!= null` so spans without an eval don't inflate your total.
</Tip>

**Example — error rate for a specific tool**

```sql theme={null}
-- Error rate for the fetch_inventory tool
SELECT
    COUNT(*) FILTER (WHERE "status_code" = 'ERROR') /
    COUNT(*) * 100
FROM model
WHERE "attributes.tool.name" = 'fetch_inventory'
```

### 3. Weighted composite score

Combine multiple eval scores into one number — useful when quality depends on several dimensions (e.g., correctness *and* relevance).

```sql theme={null}
SELECT
    <weight_1> * AVG("eval.<NAME_1>.score") +
    <weight_2> * AVG("eval.<NAME_2>.score")
FROM model
```

**Example — 25% QA correctness + 75% relevance**

```sql theme={null}
-- Weighted quality: 25% QA correctness + 75% relevance
SELECT
    0.25 * AVG("eval.QA_Correctness.score") +
    0.75 * AVG("eval.Relevance.score")
FROM model
```

Weights should sum to 1. Scores must be numeric — use `.score`, not `.label`.

### 4. Cost and usage

Turn token counts into dollars, count distinct sessions, and roll them up together or separately.

**Total cost**

```sql theme={null}
-- Total token cost across all LLM calls
SELECT SUM(
    "attributes.llm.token_count.prompt" * 2.5 / 1000000 +
    "attributes.llm.token_count.completion" * 10 / 1000000
)
FROM model
```

Replace `2.5` and `10` with your model's per-million input/output token rates.

**Distinct sessions**

```sql theme={null}
-- Count of distinct sessions
SELECT APPROX_COUNT_DISTINCT("attributes.session.id")
FROM model
```

**Cost per session**

```sql theme={null}
-- Total token cost divided by number of distinct sessions
SELECT SUM(
    "attributes.llm.token_count.prompt" * 2.5 / 1000000 +
    "attributes.llm.token_count.completion" * 10 / 1000000
) / APPROX_COUNT_DISTINCT("attributes.session.id")
FROM model
```

**Evaluation cost estimate** — estimate how much it costs to run an eval over your traces

```sql theme={null}
-- Eval cost: inputs + outputs + eval prompt template + eval output
SELECT
    (SUM("attributes.llm.token_count.prompt" + "attributes.llm.token_count.completion" + 229) * 0.00000025) +
    (COUNT(*) * (104 * 0.00001))
FROM model
```

Replace `229` with your eval prompt template's token length, `104` with the estimated output token length, and the rates with your eval model's per-token input/output costs.

### 5. Eval quality against human labels

When you have both auto-evals and human annotations, measure how well the eval matches the human ground truth.

```sql theme={null}
SELECT
    PRECISION(
        predicted = "eval.<EVAL_NAME>.label",
        actual = "annotation.<ANNOTATION_NAME>.label",
        pos_class = '<LABEL>'
    )
FROM model
```

**Example — precision of a QA correctness eval**

```sql theme={null}
-- Precision of QA correctness eval vs. human annotation
SELECT
    PRECISION(
        predicted = "eval.QA_Correctness_Eval.label",
        actual = "annotation.Judge Correctness.label",
        pos_class = 'correct'
    )
FROM model
```

`RECALL(...)`, `F1(...)`, `TP(...)`, `FP(...)` follow the same signature. See [Metric functions](#metric-functions) below for the full catalog.

## AQL syntax

The patterns above cover most cases. When you need to adapt or extend them, this is the full AQL vocabulary.

A metric is a single query that aggregates rows from `model` (the span table). You pick your dimensions, choose an aggregate or metric function, scope which rows count with `WHERE` or `FILTER`, and use `CASE` or operators inside expressions. Each subsection below walks one of those pieces.

### Query shape

Every metric starts with this template. `SELECT` must return one value — there is no `GROUP BY` in AQL. `WHERE` scopes the whole query; `FILTER (WHERE ...)` scopes just the preceding aggregate.

```sql theme={null}
-- Template
SELECT
  < AGG_FUNC | METRIC_FUNC >(exprs)
  [FILTER (WHERE exprs)]
  [CASE WHEN ... END]
  [ <OPERATOR> { < AGG_FUNC | ARIZE_METRIC >(exprs) | constant }] …
FROM model
[WHERE (exprs)]
```

```sql theme={null}
-- Example — hallucination rate
SELECT
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated') /
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" != null) * 100
FROM model
```

### Constants, expressions, and comments

The general building blocks you use inside a query.

* **Constants** — numbers (int or float) or strings in single quotes (`'active'`).
* **Expressions** — math/boolean operators combining dimensions and constants; no aggregate inside. Can nest: `A * (B + C)`.
* **Comments** — `-- single line` (CMD+/) or `/* multi-line */`.

### Dimensions

A dimension is any span property, feature, tag, prediction, or actual — string or numeric. Wrap any column with dots or spaces in double quotes. Here are the ones you'll reference most on trace data:

| Column                                                | What it captures                                          |
| ----------------------------------------------------- | --------------------------------------------------------- |
| `"status_code"`                                       | `OK` or `ERROR` on each span                              |
| `"attributes.openinference.span.kind"`                | `LLM`, `TOOL`, `RETRIEVER`, `CHAIN`, `AGENT`, `EMBEDDING` |
| `"attributes.llm.token_count.prompt"` / `.completion` | Token counts per LLM span                                 |
| `"attributes.llm.model_name"`                         | Model used (e.g. `gpt-4o-mini`)                           |
| `"attributes.tool.name"`                              | Name of the tool called on a TOOL span                    |
| `"attributes.session.id"`                             | Session identifier                                        |
| `"eval.<name>.label"` / `.score`                      | Eval results                                              |
| `"annotation.<name>.label"`                           | Human labels                                              |

### Aggregation functions

Reduce many rows to one number. Every metric must have at least one aggregation or metric function.

| Function                                     | Description                                                        | Type    |
| -------------------------------------------- | ------------------------------------------------------------------ | ------- |
| `COUNT(*)`                                   | Counts the number of rows                                          | n/a     |
| `APPROX_COUNT_DISTINCT(exprs)`               | Counts the unique values of `exprs`                                | String  |
| `SUM(exprs)`                                 | Sums the value across rows                                         | Numeric |
| `AVG(exprs)`                                 | Averages the value across rows                                     | Numeric |
| `APPROX_QUANTILE(exprs, quantile=<decimal>)` | Approximate quantile; `quantile` must be between 0 and 1 inclusive | Numeric |
| `MIN(exprs)`                                 | Minimum across rows                                                | Numeric |
| `MAX(exprs)`                                 | Maximum across rows                                                | Numeric |

### Metric functions

Beyond basic aggregates, AQL ships pre-built statistical metrics. All take keyword arguments. Omit `pos_class=` to use the model's configured positive class. See the concrete example in [Common patterns #5](#5-eval-quality-against-human-labels).

<AccordionGroup>
  <Accordion title="Classification">
    | Function                  | Signature                                                                                    | Returns                                                                            |
    | ------------------------- | -------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
    | `PRECISION`               | `PRECISION(actual=<col>, predicted=<col>, pos_class='<label>')`                              | Precision for the positive class                                                   |
    | `RECALL`                  | `RECALL(actual=<col>, predicted=<col>, pos_class='<label>')`                                 | Recall for the positive class                                                      |
    | `F1`                      | `F1(actual=<col>, predicted=<col>, pos_class='<label>')`                                     | F1 score (harmonic mean of precision and recall)                                   |
    | `F_BETA`                  | `F_BETA(actual=<col>, predicted=<col>, pos_class='<label>', beta=<n>)`                       | F-score weighted by `beta` (default 1). `beta=0` → precision; `beta→∞` → recall    |
    | `TP` / `FP` / `TN` / `FN` | Same signature as `PRECISION`                                                                | True / false positive / negative rate                                              |
    | `ACCURACY`                | `ACCURACY(actual=<col>, predicted=<col>)`                                                    | Overall accuracy                                                                   |
    | `LOG_LOSS`                | `LOG_LOSS(actual=<string col>, predicted=<numeric col>, pos_class='<label>')`                | Log loss                                                                           |
    | `AUC`                     | `AUC(actual=<col>, predicted=<col>)`                                                         | ROC AUC                                                                            |
    | `MAX_PRECISION`           | `MAX_PRECISION(actual=<col>, predicted=<col>, pos_class='<label>', group_by_column='<col>')` | Groups by `group_by_column`, keeps highest-score row per group, computes precision |
  </Accordion>

  <Accordion title="Regression">
    All four share the signature `FUNC(actual=<col>, predicted=<col>)`.

    | Function | Returns                        |
    | -------- | ------------------------------ |
    | `MAE`    | Mean absolute error            |
    | `MAPE`   | Mean absolute percentage error |
    | `MSE`    | Mean squared error             |
    | `RMSE`   | Root mean squared error        |
  </Accordion>

  <Accordion title="Ranking">
    ```
    NDCG(ranking_relevance=<col>, prediction_group_id=<col>, rank=<col>, omit_zero_relevance=True, k=10)
    ```

    Normalized Discounted Cumulative Gain. `omit_zero_relevance` controls whether 0-relevance rows affect averaging.
  </Accordion>
</AccordionGroup>

### Filters — WHERE and FILTER

Scope which rows the aggregate sees. `WHERE` scopes the whole query; `FILTER (WHERE ...)` scopes just the preceding aggregate. Identical syntax inside. Subqueries are not supported in either.

```sql theme={null}
-- WHERE: every row the query sees must match
SELECT COUNT(*)
FROM model
WHERE "attributes.openinference.span.kind" = 'LLM'

-- FILTER: the filter only applies to the preceding COUNT
SELECT
    COUNT(*) FILTER (WHERE "eval.Relevance.label" = 'relevant') /
    COUNT(*) * 100
FROM model
```

### Conditionals — CASE

Classify rows into buckets before aggregating. Expressions evaluated in order; the first `true` branch wins. `ELSE` gives a default — without one and no match, returns `NULL`.

```sql theme={null}
-- Example — count errors by bucketing spans into 1/0
SELECT SUM(CASE WHEN "status_code" = 'ERROR' THEN 1 ELSE 0 END)
FROM model
```

```sql theme={null}
-- Searched CASE
CASE
  WHEN exprs THEN result1
  [ WHEN exprs THEN result2 ]
  [ ELSE default ]
END

-- Simple CASE — equality against a common expression
CASE expression
  WHEN value THEN result
  [WHEN ...]
  [ELSE result]
END
```

### Operators

Math and comparison building blocks used inside expressions.

**Numeric** — apply to numeric dimensions only.

| Operator                              | Description                                                         |
| ------------------------------------- | ------------------------------------------------------------------- |
| `A + B` / `A - B` / `A * B` / `A / B` | Arithmetic (integer division cast to `FLOAT`)                       |
| `ABS(A)`                              | Absolute value                                                      |
| `CEIL` / `FLOOR`                      | Round up / down to nearest integer                                  |
| `COS`, `SIN`, `TAN`, `TANH`           | Trigonometry                                                        |
| `LN`, `LOG` / `LOG10`                 | Natural log (base `e`) / log base 10                                |
| `SQRT` / `CBRT`                       | Square root / cube root                                             |
| `GREATEST` / `LEAST`                  | Greatest / least of the arguments; `NULL` if any argument is `NULL` |

**Comparison** — apply to strings and numerics.

| Operator                                | Description         |
| --------------------------------------- | ------------------- |
| `A = B` / `A <> B` / `A != B`           | Equal / not equal   |
| `A > B` / `A >= B` / `A < B` / `A <= B` | Ordered comparisons |
| `A IS NULL` / `A IS NOT NULL`           | Null checks         |

***

## Next step

You've defined what to measure. Now put those metrics on a dashboard so you can see them at a glance:

<Card title="Next: Set Up Dashboards" icon="arrow-right" href="/ax/observe/dashboards" />
