Skip to main content
Deprecated: The Hallucination evaluator is deprecated and will be removed in a future version. Please use the Faithfulness evaluator instead.

Migration Guide

The Hallucination evaluator has been superseded by the Faithfulness evaluator, which uses clearer terminology and a more intuitive scoring direction.

Key Differences

AspectHallucination (Deprecated)Faithfulness (Recommended)
Labelsfactual / hallucinatedfaithful / unfaithful
Score directionMinimize (0.0 = good)Maximize (1.0 = good)
Score meaning0.0 = factual, 1.0 = hallucinated1.0 = faithful, 0.0 = unfaithful

Migration Example

Before (deprecated):
from phoenix.evals import LLM
from phoenix.evals.metrics import HallucinationEvaluator

llm = LLM(provider="openai", model="gpt-4o")
# This will emit a deprecation warning
hallucination_eval = HallucinationEvaluator(llm=llm)

scores = hallucination_eval.evaluate({
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
})
# score=0.0 means factual (good), score=1.0 means hallucinated (bad)
After (recommended):
from phoenix.evals import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator

llm = LLM(provider="openai", model="gpt-4o")
faithfulness_eval = FaithfulnessEvaluator(llm=llm)

scores = faithfulness_eval.evaluate({
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
})
# score=1.0 means faithful (good), score=0.0 means unfaithful (bad)

Updating Score Interpretation

If you have existing code that interprets hallucination scores, you’ll need to update your logic:
# Old: Hallucination score (minimize - lower is better)
if hallucination_score < 0.5:
    print("Response is factual")

# New: Faithfulness score (maximize - higher is better)
if faithfulness_score > 0.5:
    print("Response is faithful")

Why the Change?

The Faithfulness evaluator provides several improvements:
  1. Intuitive scoring: Higher scores = better outcomes, which aligns with most evaluation metrics
  2. Clearer terminology: “Faithful/unfaithful” more accurately describes the relationship between response and context
  3. Consistency: Aligns with other evaluators that use maximize direction

See Also