> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Utilities

> Helper utilities for preprocessing DataFrames in online task workflows.

The `arize.utils` module provides helper utilities for common preprocessing tasks. These are primarily useful when building evaluators for online tasks that operate on data exported from Arize.

## Online Task Utilities

### `extract_nested_data_to_column`

Extract deeply nested attributes from complex data structures into new DataFrame columns.

This function is designed for use in online task evaluators. Data exported from Arize often contains columns with nested structures (lists of dicts, JSON strings) — for example, LLM message arrays stored under `attributes.llm.output_messages`. This function resolves a dot-delimited attribute path against those structures and creates new flat columns, making the values accessible to evaluators.

```python theme={null}
from arize.utils.online_tasks import extract_nested_data_to_column
```

**Signature:**

```python theme={null}
extract_nested_data_to_column(
    attributes: list[str],
    df: pd.DataFrame,
) -> pd.DataFrame
```

**Parameters:**

| Parameter    | Type           | Description                                                                                            |
| ------------ | -------------- | ------------------------------------------------------------------------------------------------------ |
| `attributes` | `list[str]`    | Dot-delimited attribute paths to extract (e.g. `["attributes.llm.output_messages.0.message.content"]`) |
| `df`         | `pd.DataFrame` | Input DataFrame, typically exported from Arize                                                         |

**Returns:** A new `pd.DataFrame` with the extracted attributes as additional columns. Rows where any of the requested attributes cannot be resolved are dropped.

**Raises:** `ColumnNotFoundError` if no column in `df` matches any prefix of a requested attribute.

***

**How it works:**

For each attribute string (e.g. `"attributes.llm.output_messages.0.message.content"`):

1. Finds the longest prefix that matches an existing column name (e.g. `"attributes.llm.output_messages"`)
2. Uses the remainder as a path to introspect into each row's value (e.g. `"0.message.content"`)
3. Creates a new column named exactly `attribute` with the extracted values
4. Drops rows where any of the new columns could not be resolved

The introspection handles nested dicts, lists (by integer index), JSON strings, and dotted dict keys.

***

**Example:**

```python theme={null}
import pandas as pd
from arize.utils.online_tasks import extract_nested_data_to_column

# DataFrame exported from Arize — output_messages is a list of message dicts
df = pd.DataFrame({
    "span_id": ["s1", "s2"],
    "attributes.llm.output_messages": [
        [{"message.role": "assistant", "message.content": "The capital of France is Paris."}],
        [{"message.role": "assistant", "message.content": "Shakespeare wrote Romeo and Juliet."}],
    ],
})

# Extract the assistant's reply into a flat column
result = extract_nested_data_to_column(
    attributes=["attributes.llm.output_messages.0.message.content"],
    df=df,
)

print(result["attributes.llm.output_messages.0.message.content"].tolist())
# ["The capital of France is Paris.", "Shakespeare wrote Romeo and Juliet."]
```

**Use in an online task evaluator:**

```python theme={null}
import pandas as pd
from arize.utils.online_tasks import extract_nested_data_to_column

def my_evaluator(df: pd.DataFrame) -> pd.DataFrame:
    # Extract nested content before scoring
    df = extract_nested_data_to_column(
        attributes=[
            "attributes.input.value",
            "attributes.llm.output_messages.0.message.content",
        ],
        df=df,
    )

    # Now use the flat columns to compute scores
    df["eval.MyEval.score"] = df.apply(
        lambda row: score(
            row["attributes.input.value"],
            row["attributes.llm.output_messages.0.message.content"],
        ),
        axis=1,
    )
    return df
```
