Skip to main content

Documentation Index

Fetch the complete documentation index at: https://arizeai-433a7140.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Embed the model output and the reference with an embeddings model, then report their cosine similarity. This is the standard fuzzy-match check for free-text outputs — wording differences shouldn’t count as failures as long as the meaning matches. The example uses OpenAI’s text-embedding-3-small. The same shape works for any HTTP embeddings endpoint; swap the client and model name to switch providers.

Code

import math
import os

from openai import OpenAI

_MODEL = "text-embedding-3-small"
_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


def _embed(text):
    response = _client.embeddings.create(model=_MODEL, input=text)
    return response.data[0].embedding


def _cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(y * y for y in b))
    if norm_a == 0.0 or norm_b == 0.0:
        return 0.0
    return dot / (norm_a * norm_b)


def evaluate(output, reference):
    if not output or not reference:
        return {
            "label": "missing",
            "score": 0.0,
            "explanation": "Missing output or reference.",
        }

    similarity = _cosine(_embed(str(output)), _embed(str(reference)))
    return {
        "score": similarity,
        "explanation": (
            f"Cosine similarity {similarity:.4f} (model={_MODEL})."
        ),
    }
Sandbox dependencies — paste into the sandbox configuration’s Dependencies field, one package per line:
openai

Input mapping

ParameterBind to
outputThe model output to score, usually output.
referenceThe ground-truth string, usually reference.

Output configuration

Continuous score in the range -1.0 to 1.0 (cosine similarity). Optimization direction: maximize. In practice, OpenAI’s text-embedding-3 models produce non-negative similarities on natural-language pairs, so a 0.01.0 range with a low-end threshold (e.g. 0.7 for “close enough”) is also reasonable.

Runtime requirements

SettingValue
SandboxA hosted backend that matches your language. Python: E2B, Daytona — Python, Vercel Sandbox — Python, or Modal. TypeScript: Daytona — TypeScript or Vercel Sandbox — TypeScript (the local Deno sandbox is started with --no-npm and cannot install the openai package).
DependenciesPython: openai. TypeScript: openai (npm). Add it under Dependencies when creating the sandbox configuration.
Internet accessRequired — toggle Allow Internet Access on for the configuration. The sandbox must reach api.openai.com.
Environment variablesOPENAI_API_KEY — preferably set as a secret reference to a key in Settings → Secrets, not a literal value.
Each evaluate(...) call makes two embedding requests (one for output, one for reference). When running this across a large dataset:
  • Raise the sandbox configuration’s Timeout if the default is too tight for a cold-start install plus two API calls.
  • Watch the upstream provider’s rate limits and per-token cost — at production volume this adds up fast.
  • If reference is fixed across many examples (e.g. a shared gold answer), pre-compute its embedding once and store it on the example. The evaluator then needs only one API call per row, or none at all if you also pre-embed the output.