Hallucination (Deprecated)

Deprecated: The Hallucination evaluator is deprecated and will be removed in a future version. Please use the Faithfulness evaluator instead.

Migration Guide

The Hallucination evaluator has been superseded by the Faithfulness evaluator, which uses clearer terminology and a more intuitive scoring direction.

Key Differences

Aspect	Hallucination (Deprecated)	Faithfulness (Recommended)
Labels	`factual` / `hallucinated`	`faithful` / `unfaithful`
Score direction	Minimize (0.0 = good)	Maximize (1.0 = good)
Score meaning	0.0 = factual, 1.0 = hallucinated	1.0 = faithful, 0.0 = unfaithful

Migration Example

Python
TypeScript

Before (deprecated):

from phoenix.evals import LLM
from phoenix.evals.metrics import HallucinationEvaluator

llm = LLM(provider="openai", model="gpt-4o")
# This will emit a deprecation warning
hallucination_eval = HallucinationEvaluator(llm=llm)

scores = hallucination_eval.evaluate({
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
})
# score=0.0 means factual (good), score=1.0 means hallucinated (bad)

After (recommended):

from phoenix.evals import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator

llm = LLM(provider="openai", model="gpt-4o")
faithfulness_eval = FaithfulnessEvaluator(llm=llm)

scores = faithfulness_eval.evaluate({
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
})
# score=1.0 means faithful (good), score=0.0 means unfaithful (bad)

Before (deprecated):

import { createHallucinationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

// Deprecated
const hallucinationEvaluator = createHallucinationEvaluator({
  model: openai("gpt-4o"),
});

const result = await hallucinationEvaluator.evaluate({
  input: "What is the capital of France?",
  output: "Paris is the capital of France.",
  context: "Paris is the capital and largest city of France.",
});
// score=0 means factual (good), score=1 means hallucinated (bad)

After (recommended):

import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o"),
});

const result = await faithfulnessEvaluator.evaluate({
  input: "What is the capital of France?",
  output: "Paris is the capital of France.",
  context: "Paris is the capital and largest city of France.",
});
// score=1 means faithful (good), score=0 means unfaithful (bad)

Updating Score Interpretation

If you have existing code that interprets hallucination scores, you’ll need to update your logic:

# Old: Hallucination score (minimize - lower is better)
if hallucination_score < 0.5:
    print("Response is factual")

# New: Faithfulness score (maximize - higher is better)
if faithfulness_score > 0.5:
    print("Response is faithful")

Why the Change?

The Faithfulness evaluator provides several improvements:

Intuitive scoring: Higher scores = better outcomes, which aligns with most evaluation metrics
Clearer terminology: “Faithful/unfaithful” more accurately describes the relationship between response and context
Consistency: Aligns with other evaluators that use maximize direction

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

Hallucination (Deprecated)

Migration Guide

Key Differences

Migration Example

Updating Score Interpretation

Why the Change?

See Also

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

​Migration Guide

​Key Differences

​Migration Example

​Updating Score Interpretation

​Why the Change?

​See Also

Migration Guide

Key Differences

Migration Example

Updating Score Interpretation

Why the Change?

See Also