MLOps in Enterprises – A Holistic Perspective for a Successful AI Strategy With Lamia Youseff

Lamia Youssef discusses a holistic perspective for successful AI strategy. This talk was originally given at Arize:Observe.

Lamia Youssef:  Hello everyone and good morning or good afternoon or good evening, depending on where you are dialing from. Today we're going to be talking about holistic perspectives for a successful AI strategy. And, uh, before diving in, I asked everyone yesterday to, uh, share their views in a, in a poll actually to your side, to your right side of the screen on why do most machine learning projects fail?

Actually, some statistics say that it's up to 85% of projects that start fail. And, uh, through different channels, including LinkedIn and, uh, the, the Slack channel as well as for our conference. I think these today, uh, the responses have been overwhelming, but also very insightful. My name is Dr. Lamia Youssef.

I'm the Executive Director@jazzcomputing.com, and today I'm going to be talking about machine learning projects and how to make them more successful. I've held executive rules at Microsoft, Facebook, apple, and Google over the years doctorate degree in computer science, post-doc MIT, as well as a management sciences degree, master's Management Sciences from Stanford University.

And my passion is to bring together the capabilities, technical capabilities in AI with the business and impact that we do in the world. But I would like to start with a personal story of how I got into ai. And back in the days, about 25 years ago, when I was first introduced to artificial neural networks, deep learning.
And for many of us this. Is a very shocking news, a very concerning news. At the same time in school undergrad, at that time we were taking a class on deep neural networks, and one of our problems sets, one of the assignments homework was on developing a predictive application using deep learning in order to predict benign versus malignant cancers.Breast cancers specifically. And I thought, wow, that is very powerful. Can we use the tools? Can we use our computer science knowledge and AI knowledge in order to be able to do such a big impact in the world saving lives? But something was not calculating despite our assignment being an introductory level homework.

It was not. Deployed in production not being used with medical doctors and folks in the field. So there was something missing that got me on a journey of asking what makes a successful AI application. I mean, we all know some components. We put them together through a system, and then we get a successful AI application.

What are these components? Back in the days, the components were clear. We need a neural architecture in the form of a model. Uh, we need huge computational power and resources in the form of supercomputers. Now, cloud computing, and then we need data to train the model. The more the data, the more powerful the model is, and that should get us to a successful enterprise AI application.

However, it wasn't the case. Over the years, we have gained significant amount of computational power. Um, actually this year, um, uh, Jacque Dona, professor Jacque Dona out of u t k got the Turing Award for being able to build supercomputers and being able to advance the field in supercomputers, bringing together computational resources specifically around metrics, metrics, multipl application in order to be more computational applications.

Possible including machine learning. So we have the computational resources. In fact, the phones that we have in our pockets today that we carry around is more powerful than a 1980 national supercomputer. Very huge. And we got the data and we got an evolving stack back in the days. When I was developing my first model, uh, artificial neuro network model, we were using c we were using pointers in order to connect between the different neurons.

We didn't have any libraries. The stack have evolved significantly. Awesome. This is more power to bring the, the, the, the capabilities and be able to bring AI applications into the enterprises and be able to save lives. Again, not yet. And then the tools in the startup scene in the last three years have been evolving as well and have been exploding with many contributions across different ports of the stack.

Exciting times we're not there yet. Thanks to SHA G B T in less than six months. It proved the conversation around AI and machine learning to the forefront of every dinner table and every conversation. So now we can ask the hard questions. What would it take to bring this kind of tools and resources to a human responsible, human use cases and be able to deploy them in saving lives, in improving our efficiency, in improving our use cases and our lives?

So what we're seeing is an evolution. AI enterprise applications combining and bringing together all of these, um, components and all of these advances, but we're not there yet. This is the statistics that I found even before our conversation yesterday. Nine out of 10 global businesses have some investments in ai.

And global businesses. I mean here, hospitals, I mean, uh, um, nonprofits, I mean enterprises that are building in oil and gas, I mean enterprises that are in the banking industry. So across the board, nine out of 10, however, only 15% have deployed their AI capabilities in their work. Huge gap mean nine out of 10 are investing only one out of seven in the one that's deploying AI capabilities in.

Real life applications. So why, I mean, we have the computational resources, we have the data, we have the startup scene, and we have the capabilities. What's missing?

Something is missing. Big question mark. And this is what took me on a journey working for the world largest organizations in order to understand how do we build and scale successful AI enterprises at the world scale. And this is what I have found. It's not only computational resources and data rather than it's many other components, including machine learning frameworks and tools that we have seen advance over the last five years.

It also includes organizational structure in the form of building the right framework for organization to work with other parts of the of the business. It includes talent acquisition in the form of hiring and bringing forward the right. Set of folks who can drive the the, the use case forward, and it doesn't happen overnight rather than it's a migration journey that we're going to be talking about more today.

In addition to that, it's going to take some business use cases, very strong alignment between business and how we're applying AI and whether we're evolving the business model and the business use cases that we have in our business in order to be able to. Integrate this AI and bring forward the, the, the ROI to our business.

As we're talking today, and have been talking yesterday as well, it involves machine learning operations, new field, I would say three to four years old that have been evolving significantly, but we're still learning about its capabilities and what it can do for our systems. And finally, and despite taking, not taking as much, um, attention, it involves change management.

People are the hardest component in any organization and being able to work with individuals and work with teams and work with the leaders in order to be able to bring the mindset that's going to change and the culture that's going to help us integrate AI applications in the form of change management in order to be able to successfully, believe me, if you tell engineers strategy b t is going to replace you, the likelihood of your ai.

Application succeeding and getting deployed is going to be nll. So change management and being able to address the, the, the hype and being able to address how can we evolve our workforce and how can we reshape and reframe our conversation in order to be able to bring forward successful enterprise AI application is critical in order to be able to address this 85% gap.

So today I'm going to be diving into two specific. Areas, the AI migration roadmap, and the second one is the machine learning operation component. So let's dive in the AI migration roadmap. Enabling an organization to integrate AI applications and AI components into their business and into their use cases does not happen overnight.

We have recently went through a similar journey through cloud migration. Cloud is not that old, actually, uh, started the migration of several companies in 2011, 2012, and we have seen how it brought significant power and significant value. Two different organization. AI is no different. If only it's going to be more powerful.

So what are the phases that we can help different enterprises to be able to come forward and become AI enabled research with a framework to discover the use cases. It's a joint effort between practitioners, AI and machine learning practitioners, product owners, who's going to be able to define what the new product look like, what are the requirements and what are the the needs, and most importantly, business leaders and executive sponsors, because these are the ones who are going to be able to take the project forward.

One of the important concepts here is the relative ROI. I've heard from many leaders who just want to integrate AI because it looks cool or because they are being asked by their board of directors to be able to integrate AI, but not because it's bringing an additional ROI to their business.

That's why it's very important to ask questions about relative ROI In this case, what value are we bringing and what's the cost that's going to be cured in order to bring this, um, new component and new capability into our systems? The outcome from this phase is a list of opportunities, and I'd like to list them to sort them by r, roi, by the relative ROI for each of them, start by the three of them, or maybe the top 10 in this case, and move it into developing p ooc or proof of concept.

The debate in the second phase is usually whether we should buy or outsource or build in-house the different components. Yesterday and today there was several really, very valuable and insightful conversations at the conference about the the different trade-offs between buy, build, or outsource. One of the outcomes from this specific phase is the talent needs.

Who's going to be building this? Systems, who's going to be able to scale them to the next level? And usually I'd love to see three projects with the highest relative ROI to be able to chase out of this second phase to be able to invest in out of this second phase, feeding into the third one, which is the productization use case.

This is the role of the product manager, in this case, to build the product roadmap and the launch plans. And how they can work directly with the data science team, with the machine learning operators, with the machine learning engineers in order to be able to build this, but also more importantly, the go-to-market strategies of how they're going to take this kind of new application, new technical capability in a product setting to take it to market in this phase.

Usually this is the time where we know which of these products are the most. Um, needed the most value that's being created for the business, and this is when we can start hiring data science teams and machine learning operators and machine learning operations capabilities. Only to feed it to the operationalization and scale.

The operationalization and scale is the phase where we know that this is the application that we're taking to production, and this is when we start asking more questions about what needs to be, what are the areas that we need to invest in more in order to be able to make sure that we're addressing all of the other perspectives around taking a machine learning model to production.

I am going to dive more in the machine learning operations in the next section. Three important concepts to look over here is repeatability. Reliability and resilience of the model. I'm gonna go into more depth. It's important to know over here as well that we can go back from phase three to phase two if needed, from phase two to phase one, or from phase one to phase three when needed, based on the specific use cases and based on the learning that we have from one phase to the next.

And based on how the market is evolving. We all know today that the market and the business is evolving very fast given everything that's happening around us. Once. Once we operationalize the product, the next step is to evolve it. Now that we know how the market is responding to our product, AI enabled product in the market, it's time to think more about new products, new business introductions, deploying new technologies, whether it's reinforcement learning or generative ai, or the new technologies that are coming and will be coming over the next year or two years in ai.

Many other areas that we haven't looked at that businesses need to look at from a executive level in order to be able to decide how they're going to address it and what does it mean and whether it aligns with their business. One of them, for example, it the environmental aspects, carbon emission of data centers that we use for training, especially for large language models.

For some businesses, it's all about regulatory and legal aspects per personalization versus privacy, and what kind of machine learning models are we using. For some other, it's the consumer-centric capability, human-centric approaches, so capturing and characterizing the value in the voice of the consumer.

Additionally, societal and ethical aspects are very important in here, specifically around AI transparency efforts. How does the business address that and why it's important for the business, and of course, you political aspects for open source and US versus China on the AI race. These are all copying questions that need to be answered by the executive team one step at a time during their journey.

So awesome. We have a roadmap we can execute. We know where we're going. But machine learning operations, what do you mean? I mean, you know, like why do we need a new operations team specifically for machine learning in the conference? There have been several really amazing talks today about the needs for a machine learning operations, but from an executive level, this is why it's important.

I'm gonna describe it a little bit more, very high level, very simplistic in the form of the difference between traditional software engineering systems and teams versus machine learning applications. We've been building software engineering systems for several decades now. We know it's a deterministic outcome.

If you put output input A, you always get output. B, never changes. The development process is linear. You start by collecting requirements, you end by delivering the system, and then for quality, you do testing in the form of unit testing, regression testing, C I C D. We have learned that even in school, have been doing it through different methodologies and different school of thoughts over the years have perfected it to a great extent.

But now we come to machine learning applications, different story. The outcomes is not deterministic anymore. It's stochastic probabilistic. In other words. So the probability that you're going to get output B from input A is certain percentage, and therefore that mean that when you are verifying the system for quality, you need different techniques and you need different, uh, tooling.

In addition, you need an iterative development process because it's all about the scientific. Thinking process of running experiments, trying to see which of them is working, how the, the product is behaving sometime to traffic and in the market and then going back and circling back and doing some additions to your model.

It doesn't stop there. It's more complicated than that, specifically around machine learning operations. And this is the field that's specifically looking at the best practices of integrating software development practices with machine learning. To be able to efficiently build, deploy, monitor, and manage in production.

And this is the key here in production. And usually I like to refer to it as three specific requirements for this kind of system. The three Rs, it need to be reliable, it need to be reproducible, and it needs to be responsible. And this is very critical. Here is the reason. If we're talking about adding machine learning and AI system into, uh, self-driving cars, driverless cars, I needed to be reliable.

I needed to always do the right thing for me. I needed to never feel on me when I am on the highway driving at 80 or 90 or 65, depending on where you're driving.

I needed to be repeatable. I needed to work every time in the same way that I've seen it work. I don't want to take chances and I needed to be responsible. I needed to be applied to the useful use cases With the current AI evolution and revolution in how we are seeing Shad, g p t have, have, uh, reenergized the conversation around the application of machine learning.

We as machine learning practitioners have a huge responsibility in where we apply AI and machine learning, and we need to take that responsibly. This is specifically if we're talking about applying machine learning and machine learning applications, AI and machine learning applications in the context of the medical field, in the context of fraud detection, in the context of banking.

It is going to make huge societal changes, and we need to be able to act responsibly in how we apply these techniques and this new power given to us. So three Rs, very critical, very important for a successful ML application and machine learning operations. And this is even more complicated when we see where we deploy machine learning applications.

Usually, um, specifically, you know, like in academia, we look at machine learning operation in a silo where there is a model and we just test the model, the output versus the input. We test, you know, like the accuracy. But actually in reality it's a different story. Machine learning application is a part of a, or the machine learning model is part of a bigger application where it interacts with software system that's evolving as you speak, and that's evolving through bug fixes.

That's evolving through new systems, that's being added, new features that's being added, that's evolving all the time, and that interacts directly with machine learning engineering. Not only that, but. Given what we have said about the difference between the, the development cycle between traditional software and machine learning engineering, it's different.

And the way that we verify the way they interact is very different

That. In order to be able to productize and operationalize machine learning engineering, we need to talk about all of the components that goes into making it successful, specifically as how it interacts with the traditional software. Experimentation, tracking and launching is becoming more and more critical and more and more needed.

Feature engineering and training data, as we have heard yesterday in the panel, um, t on and the way that they're doing over their amazing work model, quality and performance validation. Very excited about the announcement that came out yesterday, Phoenix, uh, from Arise in their work

data and model monitoring and the bugging becomes also more critical. Alerting and logging, incidence response, hyper performer tuning. I mean the, the list goes on and on.

For today, we're just going to look at two case studies very quickly in order to make sure that we're, you know, like touching on the criticality of them for business leaders and for executives. In this field, one of them is the importance of experimentation. Very different approach than what we have used in the, the, the, the, the traditional software engineering development cycle.

Specifically with the size of experimentation that we need to launch in order to be able to get statistical analysis that's significant and the use of statistical techniques such as c test and confidence interval in order to be able to provide a repeatable results for the, the, the, the, the use, and this is exactly where we're addressing reportability and can address reportability, uh, repeatability for the system.

This is exactly where we can address repeatability for the system. The other one is data drift. Being able to see and monitor the model such as that. When the model behavior changes, we can very quickly look at where the data drift have happened and be able to respond. Traditionally, that takes several weeks to bug several weeks to find where the mistake is, where why did the drift, and sometimes the drift just happens because drift in the data that's being fed to the model.

Due to bug changes due to feature edition or even due to traffic changes. So, I mean, basically it can be very different. Two books that I want to bring to your attention over here. Amazing books have a. Basically advanced the, the conversation around machine learning operations significantly. One of them is a chips book designing machine learning systems.

Love it. Love how it goes into a very structured phase on how do we build the iterative process for production ready applications and with machine learning. But the other one as well is reliable machine learning, borrowing some of the techniques that we have used at Google on the s r e side in order to be able to.

Build machine learning, application and production. And again, I highlight production, very critical. But taking a step back, so we talked today about AI migration roadmap. We talked about the machine learning operations, machine learning ops.

The biggest question here is how do we bring together and harmonize all of the different components? Those who I talked about and those who we know, but we haven't talked about in order to make sure that they align well in order to produce a successful enterprise AI application. In my mind, this reminds me of so much of seeing a jazz band improv together.

They come together, different tools, different instruments. They look at each other. They can harmonize, they can build together a beautiful piece of music usually on the fly that brings together a really successful outcome. And this is similar here to what we see in in live jazz improv, and that's why we're building jazz computing.

We usually say that for disruptive times through machine learning and ai, it calls for harmonious AI solutions, and that's what we're working on and we're hiring. Please email careers@justcomputing.com. We're also putting out several podcasts, learning from pioneers. In the specific systems who have built and skilled their organization and their systems and innovated in machine learning over the years as part of big organization or build their own organization.

This is season two, season out, have been season one, have been out, uh, now for quite a while, and I keep going back and learning even more whenever I hear more of the conversations in reflection of what we have learned.

Circling back on the story that I started with, the good news is my mom did the biopsy. It turned out to be benign, and she was spared. Uh, she's healthy, she's doing well, so we were relieved.

On the other hand, we haven't, 25 years later, we haven't yet deployed machine learning and AI capabilities. The example that I talked about in the beginning, in day-to-day applications or in hospitals or in saving lives. In fact, this is an article from the New York Times last March, March 5th, specifically saying that, There was 2.3 million breast cancer diagnosis, 685,000 deaths from the disease, and early diagnosis does make a huge difference, so we could fix that.

But also Hungary have started applying. Machine learning and AI in radiologist workloads in order to reduce the workload and in order to be able to improve. And it's so that it increased the cancer detection rate by 13% and it was able to save 22 cases that have been documented since 2021. 22 cases in the big theme is not much, but if this 22 cases included my mom, Your mom, your sister, your wife, or your daughter.

This 22 cases is very significant, so help me bring ai, successful AI applications to the enterprise in order to be able to bring more useful use cases in today, in and day out. These are the key takeaways from I talk today, the three Rs, very critical reliability, repeatability and responsibly application, applying machine learning into enterprise applications.

We're hosting a workshop coming up in August. Please reach out to connect over LinkedIn. Happy to take a couple of questions now. Thank you.

Okay, I have a question.

Ah, very good question. Uh, thank you that, so given your vast experience in big technology companies like Google, Facebook, apple, Microsoft, what should thought and engineering leaders in these large enterprises be thinking about and preparing for an era of generative ai? How do you imagine the successful leader will adopt in the coming years?

Generative AI is just starting. And if we look at the difference between different organization, we have those who are AI first enabling and have built the AI capabilities like Google, like Facebook, apple, and Microsoft, and have already been on the forefront of, uh, this new evolution and revolution, as some would say.

But there is also the other set of enterprises who are just catching up, trying to learn about what machine learning is and what. Generative AI is, and how can they learn from, uh, uh, from it in the business sense. As I highlighted there is the roadmap. The first one of them is brainstorming and learning, attending workshops, attending trainings, attending conferences like this one, being able to understand what are the different problems, what are the different use cases?

Bringing together different voices to the room, from the product side, from the business side, from the executive side, from the engineering side, from this data science side, in order to be able to bring together the different perspectives. And be able to answer some of these questions that I highlighted in the framework is going to get the conversation forward to start building POCs and start being able to address what are the ROIs that we can deliver from this different capabilities.

Great question that, very awesome and really looking forward to, uh, following up on more information given that we run out of time. But such a pleasure meeting everyone today. Thank you.

Subscribe to our resources and blogs

Subscribe