How GetYourGuide Powers Millions of Real-Time Rankings with Production AI
This piece is co-authored by: Martin Jewell, Senior MLOps Engineer at GetYour Guide; Greg Chase, Machine Learning Solutions Engineer at Arize AI; and Mihir Mathur, Product Manager at Tecton
GetYourGuide, a leading global online platform for discovering and booking travel experiences worldwide, makes over 30 million ranking predictions daily, shaping the journeys of millions of customers. Each prediction plays a crucial role in presenting the most relevant and personalized results for search queries quickly – with rankings in under 80 milliseconds. This ranking system is an important part of discovery throughout GetYourGuide’s search features, influencing everything from the home page to specific destination pages, and it’s a critical driver of business impact.
Creating a machine learning system that meets such demands and maintains performance is no small feat. GetYourGuide faced several challenges while building its search ranking system:
- Diverse Feature Types: GetYourGuide’s ranking algorithm utilizes features, ranging from activity-level information such as historical performance and activity content to user interactions like activity views and bookings, to personalize each search ranking to best match a visitor’s unique interests. These features combine multiple data warehouse tables and kafka events, and they are not always easy to compute on the fly.
- Running Realtime Feature Pipelines: Historically, GetYourGuide used only batch-computed features served at inference time. However, over time, the team realized that features generated in real-time in response to visitors exploring the activity selection were extremely valuable. Implementing and maintaining such features can be challenging as they require complex data streaming infrastructure, robust feature engineering, and tight performance monitoring.
- Cost-Efficient Serving: GetYourGuide’s traffic patterns are significantly seasonal, which requires adaptable and scalable infrastructure to remain cost-efficient.
- A/B Testing: GetYourGuide constantly tests new ideas, which means comparing new features to the old and gaining clear insights into what works and what doesn’t.
- Drift Detection: Data changes, behaviors shift, and our model must adapt to maintain high performance. Spotting and adjusting for these changes is a daily challenge and key to GetYourGuide’s ML operations.
GetYourGuide adopted Tecton as its feature platform and Arize for model observability to tackle these challenges. They fit nicely with the organization’s existing tech and help the team create new features for user personalization while also offering a clearer view of how models perform in production and whether any changes in features or model behavior need addressing.
In this post, we’ll take you behind the scenes of creating a mission-critical feature for personalizing rankings. We’ll leverage the power of Tecton and Arize to seamlessly integrate it into production, monitor the model performance, and quickly alert you to model issues and address them whenever they arise.
Feature Engineering
GetYourGuide starts its modeling by exploring its diverse, real-time data sources – Kafka streams capturing everything from user impressions and clicks to actual bookings. This rich activity data allows GetYourGuide to paint a comprehensive picture of user engagement, informing both offline analytics and the features powering our production ML models for personalizing each user journey in its platform.
Using the in-house events catalog tool, GetYourGuide data scientists can quickly identify promising events and dive into their historical records within our data warehouse. After an event has been identified, data scientists use Tecton for developing ML features by setting up the data source and defining feature views.
A recent example of GetYourGuide’s feature engineering approach is the “discounted ranking impressions” feature the team created for its activity ranking system. This feature maintains an up-to-date count of each visitor’s impressions for specific GetYourGuide activities, while also factoring in the position of the activity on the page where the impression occurred. As user behavior changes rapidly, GetYourGuide opted for a rather short 7-day lookback window and used a Stream Feature View to aggregate the Kafka stream of activity impressions in real-time on a visitor level, along with the activity positions in the search result page. Subsequently, the team used Tecton’s On Demand Feature View to further process this stream to extract the final position-discounted impression count for each visitor-activity pair.
When working on feature engineering tasks with Tecton, GetYourGuide typically employs an iterative approach, applying an entity-level filter during the initial stages of the feature’s development. This ensures that only a small subset of entities are materialized, so that any potential feature rematerializations during development will have minimal impact on cost and kafka cluster stability. Before a full-scale materialization, GetYourGuide validates the feature and performs a final check by testing it against live events from the Kafka streams using Tecton’s built-in .validate() and .run_stream() functionalities. This allows the team to identify any occasional discrepancies in event structures that could later surface in the fully materialized feature, such as schema differences across platforms like iOS and Android. After the final checks are in place, GetYourGuide removes the entity filter used during development and deploy the feature to production where Tecton automatically performs the full materialization, taking care of all necessary backfills.
Training
GetYourGuide’s search ranking model pipeline leverages Airflow to orchestrate dataset generation, automate model training, and deploy a fresh model on a daily basis. This ensures that GetYourGuide’s production models capture the latest user interaction trends and are always up-to-date with ever evolving user preferences.
The training dataset consists of historical ranking event logs combined with any subsequent user interactions on the ranked activities, such as page views and bookings. This data is then joined with the features in Tecton’s offline store, including GetYourGuide’s newly created discounted ranking impressions feature, which the model utilizes during training to determine optimal rankings. Tecton’s offline store enables GetYourGuide to easily fetch the point-in-time accurate feature values for each unique entity at the exact time of the historical ranking event. This capability is extremely valuable as it simplifies feature experimentation by eliminating the hassle of backfilling and the risk of data leakage.
After successfully producing a production-ready model, GetYourGuide commits it to MLFlow, its model repository. During this step the team also sends the training and production data to Arize for observability purposes, where two datasets are dispatched:
- Counterfactual Dataset: A dataset with predictions from the newly trained model, providing insights into its potential performance.
- Production Data: A batch of the previous day’s production data, which serves as a benchmark for the new model’s performance.
Below, we’ll cover in more detail how GetYourGuide uses Arize to keep a close eye on model performance and the quality of feature data, ensuring our users consistently receive relevant recommendations.
Serving
Packaged as a Docker image with FastAPI, the trained model is deployed with both control and treatment models retrieved from MLFlow for A/B testing. Once deployed on our Kubernetes cluster, services can seamlessly invoke the correct model variant as needed.
Every ranking inference request uses the visitor ID to retrieve Tecton’s up-to-date features. In the case of a new feature, this would be the list of position-discounted impressions for each activity ID, updated every time a visitor produces more impressions as they browse through GetYourGuide’s activity selection. Using Redis as the online store, GetYourGuide achieves p99 latencies of just under 7ms per request; this low latency lets GetYourGuide adhere to our service’s tight SLOs. The team tracks the feature freshness and latency using Tecton’s specialized feature monitoring dashboards to ensure that they constantly receive fresh data for each call. Further, by leveraging Feature Server autoscaling, GetYourGuide cuts down roughly 50% of the costs (compared to over-provisioning) while handling spiky and seasonal traffic patterns effortlessly.
At serving time, the ranking service produces a Kafka event on a dedicated topic so that GetYourGuide can later ingest realized predictions to Arize and monitor feature and model accuracy drift. This data is ingested from the Kafka stream and stored as a table in Databricks.
Observability
After deploying the model and making online predictions, the team still needs to ensure it functions well in production. GetYourGuide’s ranking model has 51 features describing a wide range of activity and user properties, split between many different types and ranges of values. To ensure that the model behaves as expected, the team needs to monitor the input features during training and inference, as when problems arise, they can often be traced back to the data itself. To help keep an eye on this, the team uses Arize to create data quality and drift monitors for features so that they can promptly get alerted on Slack in case there are any significant deviations in data quality or distributions. In particular, the feature drift monitors are very valuable as feature drift can be very common in the real world, and it’s easy to remain undetected while a model is in production. Arize can measure the feature drift using a reference dataset, which GetYourGuide has set as the last two weeks of production data in a moving window manner. Drift monitors then compare model feature distribution over time using metrics such as PSI, KL Divergence, and more.
Finally, GetYourGuide also uses Arize to help monitor how model performance changes over time. GetYourGuide is constantly iterating on its production ranking model to improve activity relevance and personalization for users’ unique preferences. As GetYourGuide launches A/B tests, it tracks the Normalized Discounted Cumulative Gain (NDCG) as a primary performance metric, and Arize offers the ability to break the performance further down into different data segments and highlight which features contribute to the model’s predictive performance the most. This gives a broad overview of the ranking model’s overall performance at any time and allows the team to identify areas of improvement, compare different datasets, and examine problematic slices.
Conclusion
Building a business-critical AI system that powers millions of daily predictions involves complex challenges. It’s not just about having robust infrastructure; it’s crucial that dynamic performance and cost efficiency are also supported while enabling rapid iteration and experimentation. Through its current infrastructure, GetYourGuide efficiently creates a variety of feature types and operate real-time feature pipelines, all the while maintaining keen oversight on model performance and observability.
A version of this post was originally originally published on GetYourGuide’s blog; read the full post