> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# NLP_Metrics

> Install extra dependencies to compute LLM evaluation metrics

<img className="inline m-0" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/9e9834e4-image.jpeg" /> [View Source on GitHub](https://github.com/Arize-ai/client_python/blob/main/src/arize/pandas/generative/nlp_metrics/hf_metrics.py)

<Warning>
  <img className="inline m-0" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/7539e0ab-python-3_8-red.svg" />  minimum required for NLP Metrics
</Warning>

Install extra dependencies in the SDK:

```
%pip install -q "arize[NLP_Metrics]<8.0.0"
```

Import metrics from `arize.pandas.generative.nlp_metrics`

### Metrics

#### `bleu`

BLEU score is typically used to evaluate the quality of machine-translated text from one natural language to another. BLEU calculates scores for individual translated segments, typically sentences, by comparing them to a set of high-quality reference translations. These scores are then averaged over the entire corpus to obtain an estimate of the overall quality of the translation.

| Argument         | Expected Type | Description                                                                                     |
| ---------------- | ------------- | ----------------------------------------------------------------------------------------------- |
| `response_col`   | pd.Series     | \[Required] Pandas Series containing translations (as strings) to score                         |
| `references_col` | pd.Series     | \[Required] Pandas Series containing a reference, or list of several references, per prediction |
| `max_order`      | int           | \[Optional] Maximum n-gram order to use when computing BLEU score. Defaults to `4`              |
| `smooth`         | bool          | \[Optional] Whether or not to apply Lin et al. 2004 smoothing. Defaults to `False`              |

#### Code Example

```bash theme={null}
bleu_scores = bleu(
    response_col=df['summary'],
    references_col=df['reference_summary'],
    max_order=5, # optional field
    smooth=True # optional field
    )
```

#### `sacre_bleu`

A hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text.

| Argument              | Expected Type | Description                                                                                                                                                                                                               |
| --------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `response_col`        | pd.Series     | \[Required] Pandas Series containing translations (as strings) to score                                                                                                                                                   |
| `references_col`      | pd.Series     | \[Required] Pandas Series containing a reference, or list of several references, per prediction **Note:** There must be the same number of references for each prediction (i.e. all sub-lists must be of the same length) |
| `smooth_method`       | str           | \[Optional] The smoothing method to use, defaults to `'exp'`. Possible values are: - `'none'`: no smoothing - `'floor'`: increment zero counts - `'add-k'`: increment num/denom by k for n>1 - `'exp'`: exponential decay |
| `smooth_value`        | float         | \[Optional] The smoothing value. Defaults to `None` `smooth_method='floor'`, `smooth_value` defaults to `0.1` `smooth_method='add-k'`, `smooth_value` defaults to `1`                                                     |
| `lowercase`           | bool          | \[Optional] If True, lowercases the input, enabling case-insensitivity. Defaults to `False`                                                                                                                               |
| `force`               | bool          | \[Optional] If True, insists that your tokenized input is actually de-tokenized. Defaults to `False`                                                                                                                      |
| `use_effective_order` | bool          | \[Optional] If True, stops including n-gram orders for which precision is 0. This should be True, if sentence-level BLEU will be computed. Defaults to `False`                                                            |

#### Code Example

```bash theme={null}
sacre_bleu_scores = sacre_bleu(
    response_col=df['summary'],
    references_col=df['reference_summary'],
    smooth_method='floor', # optional field
    smooth_value=0.1, # defaulted to 0.1 since smooth_method is 'floor'
    lowercase=True # optional field
    )
```

#### `google_bleu`

BLEU score is typically used as a corpus measure, and it has some limitations when applied to single sentences. To overcome this issue in RL experiments, there exists a variation called the GLEU score. The GLEU score is the minimum of recall and precision.

| Argument         | Expected Type | Description                                                                               |
| ---------------- | ------------- | ----------------------------------------------------------------------------------------- |
| `response_col`   | pd.Series     | Pandas Series containing translations (as strings) to score                               |
| `references_col` | pd.Series     | Pandas Series containing a reference, or list of several references, for each translation |
| `min_len`        | int           | The minimum order of n-gram this function should extract. Defaults to `1`                 |
| `max_len`        | int           | The maximum order of n-gram this function should extract. Defaults to `4`                 |

#### Code Example

```bash theme={null}
google_bleu_scores = google_bleu(
    response_col=df['summary'],
    references_col=df['reference_summary'],
    min_len=2, # optional field
    max_len=5 # optional field
    )
```

#### `rouge`

A software package and a set of metrics commonly used to evaluate machine translation and automatic summarization software in natural language processing. These metrics involve comparing a machine-produced summary or translation with a reference or set of references that have been human-produced.

| Argument         | Expected Type | Description                                                                                                                                                                                                                                                             |
| ---------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `response_col`   | pd.Series     | \[Required] Pandas Series containing predictions (as strings) to score                                                                                                                                                                                                  |
| `references_col` | pd.Series     | \[Required] Pandas Series containing a reference, or list of several references, per prediction                                                                                                                                                                         |
| `rogue_types`    | List\[str]    | \[Optional] A list of rouge types to calculate. Defaults to `['rougeL']`. Valid rogue types: -`rogue1`: unigram (1-gram) based scoring -`rogue2`: bigram (2-gram) based scoring-`rogueL`: longest common subsequence based scoring -`rogueLSum`: splits text using '/n' |
| `use_stemmer`    | bool          | \[Optional] If True, uses Porter stemmer to strip word suffixes. Defaults to `False`.                                                                                                                                                                                   |

#### Code Example

```bash theme={null}
rouge_scores = rouge(
    response_col=df['summary'],
    references_col=df['reference_summary'],
    rouge_types=['rouge1', 'rouge2', 'rougeL', 'rougeLsum'] # optional field
    )
```

#### `meteor`

An automatic metric typically used to evaluate machine translation, which is based on a generalized concept of unigram matching between the machine-produced translation and the reference human-produced translations.

| Argument         | Expected Type | Description                                                                                           |
| ---------------- | ------------- | ----------------------------------------------------------------------------------------------------- |
| `response_col`   | pd.Series     | \[Required] Pandas Series containing predictions (as strings) to score                                |
| `references_col` | pd.Series     | \[Required] Pandas Series containing a reference, or list of several references, per prediction       |
| `alpha`          | float         | \[Optional] Parameter for controlling relative weights of precision and recall. Default is `0.9`      |
| `beta`           | float         | \[Optional] Parameter for controlling shape of penalty as a function of fragmentation. Default is `3` |
| `gamma`          | float         | \[Optional] The relative weight assigned to fragmentation penalty. Default is `0.5`                   |

#### Code Example

```bash theme={null}
meteor_scores = meteor(
    response_col=df['summary'],
    references_col=df['reference_summary'],
    alpha=0.8, # optional field
    beta=4, # optional field
    gamma=0.4 # optional field
    )
```
