The Importance of Real-Time Data Pipelines: An Interview with mParticle’s Shafiq Shivji

amber roberts arize

Amber Roberts

Machine Learning Engineer

Shafiq Shivji is Group Product Marketing Manager at mParticle, where he leads developer experience and data integrity domains. He brings over a decade of experience in product and sales engineer roles across various industries including cybersecurity, education technology, healthcare and telecommunications.

Tell us about your career journey. How did you get into your current position as Group Product Marketing Manager at mParticle? 

I’m a bit of a startup junkie, and as one knows in the startup world you take one cap off and put another on pretty much daily. After college, I took a few years off to volunteer at an upstart nonprofit, leading an early childhood education program for poverty-stricken areas in South Asia. After this life-changing experience, I changed gears and got a “real” job in the corporate world. I have a technical background, so I started as a sales engineer selling mobile apps and telecom solutions to retail pharmacies. After that, I did a stint in product management for about four years at a seed-stage machine learning (ML) technology startup. I had a fantastic opportunity to work firsthand with data scientists and ML engineers. I then took a position at Auth0 in the product marketing organization, and when Okta acquired them I pivoted and joined mParticle’s product team.

Can you tell us about mParticle and how data teams are using it today?

mParticle is a customer data platform, or CDP. That term has become convoluted in my opinion and can mean different things to different people depending on whom you ask. Essentially, a CDP is an infrastructure tool to enable real-time personalization use cases without requiring a heavy engineering lift to instrument and maintain data pipelines. Specifically, our core competency is to simplify data ingestion, unification, and activation. We provide easy ways to manage data pipelines and stream data to downstream destinations. Think of observability similar to what Arize does monitoring drift and model performance, but on the data flow side. I believe, for your audience, the most common use case would be ML modeling and either seeding data to train models or using them in production for marketing and product use cases.

Can you give me a specific customer use case? 

Different teams use us for different purposes. Marketing and product are the most common teams we address. An example would be a marketer who wants to send real-time push notifications to people who abandoned a shopping cart. Product managers typically use us for analytics and growth, and we commonly hear that they cannot trust the data coming in via their analytics tool. A CDP provides a way to ensure high data fidelity and quality. Specifically for ML engineers, they benefit from having clean data pipelines because that ensures their models are usable both during training and in production.  Recently, we added in-house ML capability to our platform. Oftentimes ML engineers are asked by marketers to productionize a model predicting what customers are going to do next, and that isn’t something they are normally excited to do. But now we have in-house capabilities to solve these kinds of inquiries – such as whether to send a coupon or an ad to a customer that’s likely to buy a product. For a marketer that’s big because you can save money with this capability.

Lastly, we are a first-party data tool, and that means we do not rely on any third-party data. We provide the tools to maximize the value of your own first-party data that comes from your app and data stacks.

What makes building data-quality pipelines difficult for teams? 

Your audience is probably better suited to answer that question because they are the ones who experience the day-to-day pain of not having quality data pipelines. We’ve all seen the stats that ML engineers, data engineers, and data scientists basically spend most of their time cleaning data. From my perspective, I think it’s difficult because technology and business realities are constantly changing. Your data team can build the perfect solution at any given moment in time – saying these are the parameters, this is what I need to do – but when business realities change, those solutions either fail or require expensive engineering resources to work, and adaptability is missing. For example, if you have a new application developer who mistakenly changes an event name in an app from “item underscore purchased” to “purchased.” They accidently skipped the prefix. When that particular payload comes in, it breaks the entire data model because what was expected has changed. And that can go potentially undetected until it affects all sorts of downstream applications, including ML models. Then comes the scramble to try to understand what’s happening, which can feel like finding a needle in a haystack. Meanwhile, the business is suffering, the team is getting pinged to sort out the issue and is under a lot of stress because of the potential revenue loss.

To summarize, because of constant change, a lot of companies can’t afford to invest their time or hire resources to keep adapting. A CDP enables data teams to roll out real-time data pipelines and make a system that ensures quality and enables use cases that keep internal stakeholders happy.

Why is personalization so crucial for the user experience? 

Let’s start off by defining what personalization means. One example is when a data engineer, having been asked by a marketer to query a database, provides a CSV flat file with relevant info—technically, that could be personalization. But I think what we are talking about here, today, is personalization in terms of real-time customer experiences. Instead of mass email, for example, this is very targeted, one-to-one personalization. Take Amazon for example. They’ve proven that personalization works in a visceral manner. When I go to their home page, I’m always looking and saying, “oh I like that, and I want that” and it makes me want to buy all the things shown on “my” homepage. I add one thing to my cart – as a big basketball fan, let’s say I add a pair of Air Jordans – and then magically the perfect set of socks that match those sneakers appear in the window. From a customer’s perspective, it’s benign and perhaps expected. From a business perspective, you see double-digit lifts in conversions and revenue.

Personalization is crucial because it makes it easier for me as the consumer to discover the products I like, whether it’s content on Netflix or products on Amazon. Personalization drives sales, loyalty, brand, and a better customer experience. While it’s crucial, it is also hard. Unless you have resources like Amazon and Netflix, it is an even harder problem to solve. And it is not just about having the financial resources but also having a sophisticated data team and mature data practices.

There is a debate on whether to prioritize speed or accuracy for the customer when they want personalization. Why is it so hard to have both? 

I believe it starts with scale. For example, when you’re dealing with customers in a storefront, you can have that one-to-one experience and you can train your staff in customer service to ensure the customer feels good–think of Walmart greeters. But how do you do that digitally at scale with customers engaging with your brand using different interfaces? They may visit your store, open your mobile app, or browse your web app, doing all sorts of random things. Let’s say a customer uses your mobile app but then switches over to the web app, all that data is typically siloed, making it difficult to maintain a personalized experience in real-time. When we talk about personalizing at scale, it requires significant engineering effort to stitch data together and build profiles that then can be fed into an ML model, for example, to give you accurate recommendations. The effort and ability to implement such a model and keep it going becomes very expensive. Companies like Netflix and Amazon have dedicated significant resources to implement these systems and have built tools in-house; however, other companies don’t have the same resources to break data silos and build customer profiles to unlock personalized experiences.

Can you tell us about what build versus buy considerations look like in the CDP space? 

I think the build versus buy conversation starts with awareness. I believe many data teams are not even aware that there are tools out there that can solve so many use cases. When I was at Auth0, we told developers not to focus on low-impact work – the 80% of effort that only solves 20% of the problem. Instead, focus on core competencies and the interesting, unique aspects about your app. I apply the same logic when it comes to a CDP: first, be aware of what tools are available to quickly solve data problems, and second ensure that the effort in building and maintaining your data pipelines aligns with your company’s core value proposition.

When companies start thinking about building internally, they often talk about the modern data stack as the Holy Grail of solving all data pipeline issues. On the surface, it sounds great and very promising. But when you start to dig deeper and look to where it can make an impact, this approach falls short. For example, I don’t think anyone uses Snowflake for real-time use cases; and even if you try to build on top of Snowflake, what happens in instances such as cart abandonment? Just recently I put toothpaste in an online shopping cart because we were running low, got distracted with the kids, and forgot about it. Two days later I wondered why my toothpaste hadn’t arrived. If I had received a push notification about my abandoned cart, I would’ve  made the purchase and had a better experience. When you do that at scale, imagine how many conversions are just waiting to happen! But you can’t do that sort of thing unless you have the tools and invest in engineering resources to build out a solution and maintain it. With a CDP, you can implement real-time infrastructure quickly and efficiently,instead of investing resources in building and maintaining data pipelines that only take you so far. Rather, invest those resources in more interesting tasks like building the next model or improving model metrics (i.e. PR-AUC) to make a huge impact on your business.

Machine learning operations (MLOps) is relatively new and there is a surge in hiring for ML engineers and relevant machine learning certification programs. What does mParticle do to help data teams that are trying to take that first step into personalization and maturing their data pipelines? 

We provide the data infrastructure layer for real-time personalization. We can talk for hours about all the business use cases you unlock when you have real-time capability. One thing that is relevant for MLOps professionals is the ability to build ML models and use them in production with real-time data. There is nothing more frustrating than implementing a model that no one uses or that was built on data that you realize was corrupt, incomplete, or broken in some way.

In discussing data maturity, it all starts with developing a data strategy that outlines how a company wants to use data to achieve specific business goals. I see many organizations that don’t have a sophisticated data strategy in place. Instead, every team is left to form its own strategy. Then when it’s time to activate data, teams spend countless hours stitching together data from different silos and sources. A CDP makes an immature team mature for two reasons. First, we’ve been doing this for nearly a decade, and we have a lot of in-house expertise. We work with big name companies like Burger King and Airbnb, and have real-world experience on how to maximize the value of customer data. Second, we have out of the box tools to help you solve low-hanging problems to drive immediate impact without requiring too many resources or having mature teams in place. A lot of our customers approach us with one problem in mind, like broken data pipes, unable to move the data from here to there. When they use mParticle, they see its total value and the impact it can make and end up using it for all sorts of other use cases.