booking blog feature image with depicting the NLP service ecosystem

How Booking.com Personalizes Travel Planning with AI Trip Planner and Arize AI

Executive Summary

Booking.com, a leader in online travel, has leveraged artificial intelligence to revolutionize trip planning with its AI Trip Planner. This innovative tool combines domain-specific optimizations, in-house fine-tuned LLMs, and real-time monitoring powered by Arize AI to deliver highly personalized travel recommendations. The AI Trip Planner integrates seamlessly into the user journey, from inspiration to booking, driving improved accuracy, efficiency, and user satisfaction.

By adopting a modular and iterative development approach, Booking.com enhanced system performance while reducing latency and costs. This case study explores the challenges they faced, the innovative solutions implemented, and the results achieved, offering valuable insights for organizations seeking to harness AI for their own applications.

Problem

The modern traveler demands seamless, personalized, and efficient digital experiences. While generic chatbot systems offer a foundation for conversational AI, they fall short in addressing the specific needs of the travel domain. Key challenges faced by Booking.com included:

  • Domain-Specific Limitations: Generalized LLMs often lacked the accuracy and personalization needed for tailored travel recommendations, leading to hallucinations and irrelevant suggestions.
  • High Latency and Costs: Reliance on third-party LLMs introduced performance bottlenecks, unpredictable outages, and escalating operational costs.
  • Complexity in Orchestration: Managing interactions between conversational AI, recommendation engines, and backend systems required a robust and scalable orchestration framework.
  • Evaluation Gaps: Ensuring the quality of AI outputs, from factual accuracy to user relevance, was critical but challenging without robust evaluation mechanisms.

Solutions

To overcome these challenges, Booking.com implemented key innovations:

  • GenAI Orchestrator: A centralized system to manage conversational flows, transforming free-text inputs into structured data and integrating seamlessly with Booking.com’s backend services.
  • Arize AI’s Comprehensive Evaluation Framework: Combined offline evaluations for controlled pre-deployment testing and online evaluations using production data. Leveraged Arize AI to monitor live outputs and ensure consistent quality, with actionable insights from aggregated metrics and real-time alerts.
  • Arize AI Monitoring and Dashboards: Used Arize AI’s real-time monitoring, alerts, and dashboards to track key metrics like factual accuracy, context relevance, and performance trends. These tools enabled faster debugging and optimization of the Trip Planner.
  • In-House Fine-Tuned LLMs: Leveraged Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA and QLoRA to improve latency and accuracy while cutting costs. These models replaced third-party LLMs, boosting accuracy by 13% and reducing response times 5x.

How Booking.com Built a Next Generation AI Travel Planner

The AI Trip Planner is much more than a chatbot. While it starts with conversational AI, its true strength lies in its domain-specific optimizations. It provides travel recommendations tailored to individual preferences using Booking.com’s vast internal data and machine learning models.

Some standout features include:

  • Personalized Recommendations: Accurate suggestions using contextual user data, such as travel dates, group type, and destination preferences.
  • Explainable Results: Clear explanations for recommendations, helping users make informed choices.
  • Seamless Funnel Integration: From conversation to booking, the planner transitions users smoothly through the travel planning process.

The AI Trip Planner goes beyond simple API calls to external LLMs by incorporating Booking.com’s in-house models, fine-tuned to the unique needs of the travel domain.

Not Just Another ChatGPT App

How is this better than chatGPT calls?

  • Optimized for many travel-domain scenarios–detailed architecture
  • Personalized recommendations (more accurate, less hallucinations)
  • Utilizing more information to provide results (Booking.com’s content, user context)
  • Explainable recommendations
  • Rich UX (carousels, Booking.com funnel)
  • East integration with in-house LLM models

Building Blocks of the AI Trip Planner

AI Trip Planner components:

  • GenAI Orchestrator: Responsible for coordinating and managing the dialog flows e.g. moderations, dialog persistence, mapping to Booking entities, communicating with multiple services (Recommendation pipelines, NLP service, SmartAV, C360…)
  • NLP Service: Responsible for managing and addressing a sequence of LLM calls that are responsible for creating a natural conversational flow. This includes intent understanding, query transformation into structured data, etc.
  • Recommendation Platform: Responsible for creating relevant sets of candidates (properties or destinations) through building recommendation strategies that are based on the user preferences (themes, traveler group, etc.)

NLP Service

NLP Service Ecosystem chart
The NLP Service plays a central role in moderating conversations, classifying user intents, and structuring data for downstream use. This modular approach allows Booking.com to integrate both third-party and in-house LLMs for tasks such as:

  • Intent Classification: Identifying what users need based on their inputs.
  • Data Structuring: Transforming unstructured user inputs into structured formats, enabling efficient processing.

Recommendation Platform

This centralized platform enables Booking.com to serve personalized recommendations efficiently. Whether it’s properties, attractions, or flights, the platform aggregates and enriches data from internal systems to provide the best options.

Screenshot of the centralized platform for all recommendations in Booking.com
Centralized platform for all recommendations in Booking.com

GenAI Orchestrator

The orchestrator bridges the NLP Service, recommendation systems, and Booking.com’s APIs, ensuring a smooth user experience. It manages dynamic conversational flows and adapts to varied user needs, making the planner flexible and robust.

GenAI orchestrator as the core GenAI gateway
GenAI Orchestrator as the core GenAI gateway.

GenAI Orchestrator: The Central Hub of AI Workflows

At the heart of Booking.com’s AI Trip Planner is the GenAI Orchestrator, a centralized system that ensures seamless integration and communication between multiple services and components. Its primary role is to transform unstructured user inputs into structured, actionable data that drives accurate recommendations.

Key Functions of the GenAI Orchestrator

  1. Intent Understanding and Data Structuring
    The orchestrator identifies user intent from free-text inputs (e.g., “I want to travel to Paris in August”) and translates this into structured JSON formats that Booking.com’s systems can interpret:

    {
        "Location": {"country": "France", "city": "Paris"},
        "checkin_month": 8
    }
  2. Integration with Booking Services
    By resolving user queries to internal identifiers (e.g., location IDs, property IDs), the orchestrator bridges the gap between conversational AI and Booking.com’s backend systems. This enables the delivery of enriched and actionable recommendations.
  3. Dynamic Orchestration
    It manages calls to both internal services and external or in-house LLMs. For example:

    • LLM Calls: Generating text-based answers or performing semantic tasks like summarization.
    • Recommendation APIs: Fetching nearby hotels, attractions, or personalized travel deals.
  4. Customized Outputs for the User Interface
    The orchestrator tailors its responses to match the desired front-end format. For instance, it can provide:

    • Interactive Bubbles: For conversational prompts.
    • Rich Recommendations: Carousel-based visuals showing properties or attractions.

Why do we need an Orchestrator?

The orchestrator simplifies complexity, making the system modular and scalable. Booking.com built it in-house to ensure reliability and adaptability, overcoming the limitations of off-the-shelf solutions that struggled in production environments.

orchestrator explanation chart with a representation of a chat with the AI trip planner, text transformation, and resolving to Booking API.
Orchestrator is responsible for converting textual response to Booking Services.

Evaluating AI Performance with Arize AI

To ensure the Trip Planner meets user expectations, Booking.com implemented rigorous evaluation processes. This includes offline and online testing of metrics such as:

  • Factual Accuracy: Verifying responses against Booking.com’s internal data.
  • Context Relevance: Ensuring outputs align with user queries.
  • Answer Relevance: Guaranteeing that answers are meaningful and useful.

Online vs Offline Evaluation: Ensuring Continuous Improvement

Booking.com places a strong emphasis on evaluating its AI systems both offline (in controlled settings) and online (in real-world environments). This dual approach ensures robust, reliable performance at every stage.

Offline Evaluation

Offline evaluations are conducted before deploying updates to production. The process includes:

  1. Dataset Creation: Annotated datasets with labeled examples for intent classification, recommendations, or other tasks.
  2. Model Comparison: Testing different prompts, models, or configurations against key metrics like accuracy and cost.
  3. Iterative Optimization: Fine-tuning prompts or models based on feedback from evaluative metrics such as:
    • Factual Accuracy: Does the model provide correct answers based on context?
    • Context Relevance: Is the retrieved document or data relevant to the user’s query?
    • Answer Relevance: Is the response meaningful and aligned with the query?

Online Evaluation

Online evaluations involve monitoring live user interactions and system performance to ensure the model behaves as expected in production. Key steps include:

  1. Random Sampling: Sampling production data to evaluate outputs without overwhelming the system or adding excessive costs.
  2. Real-Time Alerts: Dashboards powered by Arize AI track metrics like latency, user interactions, and recommendation quality. Alerts signal anomalies, such as a dip in factual accuracy.
  3. Aggregated Metrics: Metrics like average session length, click-through rates, and conversion rates are monitored to assess the overall system health and user satisfaction.

Benefits of Combining Both Methods

  • Offline Evaluations: Enable detailed, controlled analysis for pre-deployment confidence.
  • Online Evaluations: Provide real-world insights into system performance, capturing nuances that may not appear in offline tests.

Booking.com’s ability to blend these evaluation strategies ensures high-quality outputs while maintaining agility in adapting to user needs and business objectives.

Chart depicting the process to build an LLM evaluator including offline and online functions
Process to build an LLM evaluator

Arize AI plays a crucial role in this ecosystem, providing:

  • Real-Time Monitoring: Dashboards track conversations, recommendations, and system health, alerting the team to potential issues.
  • Evaluation Metrics: Continuous feedback loops for both offline and online evaluations.
  • Debugging and Iteration: With Arize’s tools, Booking.com can pinpoint and resolve issues swiftly, keeping the system optimized.
Text reads: Monitoring in AI trip planner, with depictions of different monitoring functions including different information per row, the monitoring dashboard, and alerts
Monitoring in AI Trip Planner (Slide from original webinar)

Fine-Tuning Techniques for Optimized Performance

Fine-tuning was a pivotal step in transforming Booking.com’s AI Trip Planner from a basic system using out-of-the-box (OOTB) LLMs into a domain-specific powerhouse. By adapting general models to Booking.com’s travel domain, the team achieved significant gains in accuracy, latency, and cost efficiency.

The Evolution of Fine-Tuning

  1. Starting Simple
    Initially, the team employed OOTB LLMs with carefully crafted prompts. Techniques like few-shot prompting and chain-of-thought reasoning were used to improve response quality incrementally.
  2. Optimization Through Prompt Engineering
    After deploying initial versions, they enhanced prompts by:

    • Including examples specific to travel queries.
    • Applying contextual reasoning to guide LLMs in structured tasks, such as itinerary generation.
  3. Transition to In-House Models
    As the limitations of third-party LLMs (e.g., latency, cost, occasional unreliability) became apparent, Booking.com moved to train their own fine-tuned LLMs. They leveraged real-world interactions and annotated datasets to align models with user needs.

Parameter-Efficient Fine-Tuning (PEFT)

Booking.com adopted parameter-efficient fine-tuning techniques to train models cost-effectively and efficiently without modifying the entire model. These methods included:

  1. LoRA (Low-Rank Adaptation)
    • Introduced adapter layers (low-rank matrices) into the model.
    • Only trained these new layers while freezing the rest of the parameters, making the process computationally lightweight.
    • Resulted in a 5x latency improvement and significant cost reduction.
  2. QLoRA (Quantized LoRA)
    • Extended LoRA by quantizing weights for even greater efficiency.
    • Allowed fine-tuning with a fraction of the original model’s parameters, further reducing training costs while maintaining performance.
  3. Prompt Tuning
    • Added tunable virtual tokens to the prompt itself, leaving the main model unchanged.
    • Useful for domain-specific customization without significant resource investment.
  4. Prefix Tuning
    • Tuned a small prefix of the input sequence, ensuring task-specific optimization with minimal changes to the model.
Chart depicting training/fine-tune techniques including pre-training fine-tune, prompt tuning, and LoRA
Training/Fine-Tune Techniques

Data-Centric Fine-Tuning

The fine-tuning process was heavily data-driven. Key steps included:

  1. Data Collection and Annotation
    Logged real-world user interactions and curated datasets for intent detection, query classification, and recommendation tasks.
  2. Annotators labeled examples to provide high-quality training data.
  3. Supervised Fine-Tuning
    Trained models to follow task-specific instructions.
    Used datasets that paired user inputs (prompts) with desired outputs (responses). Example:

    {
        "prompt": "Recommend a romantic destination for a couple traveling in June.",
        "response": "How about Santorini, Greece? It offers beautiful sunsets and luxurious accommodations."
    }
  4. Direct Preference Optimization (DPO)
    Collected human feedback for tasks like summarization. Annotators selected the most appropriate response from multiple model outputs (e.g., “chosen” vs. “rejected”).
    Trained models to align responses with human preferences.

Results of Fine-Tuning

The application of these techniques delivered impressive improvements:

  • Accuracy Boost: 13% increase in intent detection and recommendation quality.
  • Latency Reduction: Responses were 5x faster due to in-house fine-tuning and efficient architectures.
  • Cost Savings: Substantial reduction in third-party LLM usage by leveraging internal resources and optimized tuning methods.

Key Takeaways for Building AI-Powered Systems

From their experience, Booking.com shared valuable lessons for developing robust AI applications:

  1. Start Small, Grow Gradually: Begin with simple models and refine over time.
  2. Evaluate Continuously: Use detailed metrics to monitor and improve each component.
  3. Own Your System: Building in-house models and tools ensures better control over performance and scalability.
  4. Modular Design: Develop components that can be reused across different applications, enhancing efficiency.

Shaping the Future of Travel with AI

Booking.com’s AI Trip Planner showcases how cutting-edge AI can transform the travel experience, making it more personalized and user-friendly. With Arize AI supporting their evaluation and monitoring processes, the company continues to push the boundaries of what’s possible in the travel industry.

As more businesses adopt AI-driven solutions, Booking.com serves as a stellar example of innovation, collaboration, and the power of tailored technology.

Watch the full discussion with Booking.com about how they built AI Trip Planner and other lessons from the trenches.