Rise of the ML Engineer: Elizabeth Hutton, Cisco

amber roberts arize

Amber Roberts

Machine Learning Engineer

Elizabeth Hutton is the lead machine learning engineer at the Cisco Webex Contact Center AI team, where she leads building in-house AI solutions from research and development to production. It’s a natural home for Hutton, who has long been interested in natural language – both as a researcher examining how language is learned and now as an engineer building AI systems that help companies understand and act on language. With three patents pending for the development of novel technology, Hutton’s work is relied on to provide good customer experiences across billions of monthly calls.

Can you briefly introduce yourself and outline your role at Cisco? 

I’m a machine learning engineer at Cisco. I’ve been working at Cisco on the Webex Contact Center team for about two-and-a-half years now. Most of my work is in natural language processing (NLP), but I also do some speech-related work with text-to-speech and speech-to-text.

A lot of people are trying to make the jump from academia or other scientific fields into machine learning – can you outline your journey making a similar transition? 

I studied applied math and cognitive science at the University of Southern California (USC), where I was also involved in research and computational linguistics. The combination of my math background and the interest in questions around the nature of intelligence naturally led me to a career in AI and machine learning.

The way that I made that transition really was through Insight Data Science. Insight helped get my foot in the door in the industry, prepare me for interviews and other things. That really helped coming straight out of college, especially without an advanced degree. Having a long-held interest in language learning and math and science also really helped and informed a lot of those conversations.

Do you think you need an advanced degree or can you just start learning the material and interviewing for these positions?

It’s a good question that I would say depends on the kind of role that you want. If you are interested in more of a research kind of role, it helps a lot to have an advanced degree – or if not an advanced degree, some research experience. I did my own research all throughout college, so that kind of gave me a leg up and was really helpful. But if you’re interested in more of the engineering side, you don’t necessarily need an advanced degree – if you’re a strong software developer and coder, then you should be able to study up on your own and do well.

What’s your day-to-day at Cisco look like?

When I started, it was just me and my boss as well as a data engineer – so a very small team – and zero infrastructure, zero code. Of course, Cisco had many other AI teams doing a lot of cutting-edge work, but at the time the Webex Contact Center specifically didn’t have an in-house team doing AI yet.

That was a big part of the appeal for me – the fact that it’s sort of like a startup within this larger company – and I got a chance to take on a lot of responsibility and wear many hats. In terms of my role, I am responsible for all of the data gatekeeping like collecting, cleaning, labeling, storing, validating the data, model development, and also some software development and productionizing the models.

Can you talk about the data pre-processing steps? Since a machine learning model is only as good as the data going in, it would be illustrative to hear about the best practices that you find around those tasks.

The data is honestly the most time consuming and important part; it all starts with collecting the right data and labeling. Something that we do a lot in our team is that we’ll have a small subset of the data labeled internally by our own team, so we know that’s our gold set where we have very high confidence in those labels.

And then we will send a larger amount of data to third party annotators for labeling and the rest of the data will use a tool like Snorkel to try and label some of the examples automatically. Then, we compare between those sets of labels, asking questions like how the Snorkel model is doing compared to the human annotator, how the human annotator is doing compared to our internally-labeled gold set, and so on. It is an iterative process of making sure that there is agreement between those three tiers of labels.

But it can still be difficult to get it right. Sometimes, adjusting the labeling task itself by ensuring that directions are more clear to the annotators is important – because if there is a lot of disagreement in the labels, then it can lead to problems down the line.

What are your primary machine learning use cases?

Most of the models that I work with are NLP models. We have a lot of large language models – transformers, Text-To-Text Transfer Transformer (T5) and variations of BERT – that we use for a wide range of tasks. Some are classification tasks, others are question-answering tasks (i.e. to help customer service agents), and we also have a summarization model that we built to summarize conversations. Of course, each one of these tasks requires a different set of labels for training and also for evaluation and a different paradigm.

When it comes to model development, where do you start? Is it something open-source, is it the simplest thing possible – how do you figure out what model is appropriate for the use case?

Finding tools and sources of research is something that you develop with time. When you’re initially deciding which tools fit your use case the best, I think it is always good to start with something simple at first – the simplest model you can think of – while also looking at the state-of-the-art and the latest research. Papers with Code is a great resource that we use a lot.

With large language models, our approach varies. Many of the models we have in production are pre-trained from Hugging Face or something similar – so open-source models that we fine tune – while others are models that we train from scratch, often for use cases that are specific to Webex Contact Center.

Any best practices or learning experiences on model development worth sharing? 

One thing to always keep in mind when developing a model for production are the end-requirements – what kind of latency do you need, what kind of scale, how many requests per second. Since we’re developing models for Webex Contact Center that are going to be used by literally millions of people across potentially billions of calls, it really narrows down the search a lot. If you need a model that can do inference in under one second, it cuts out a lot of models.

How do you establish goals for what you want a model to do in terms of business KPIs?

We work with our product team as well as the engineering team to understand what customers want, what the requirements are and anticipated infrastructure needs. We always have an initial set of goals for any kind of product – such as how many requests per second – that are Cisco-standard, so there is not always a lot of room for discussion. It’s more like “these are the requirements, let’s go” and it’s similar across models.

What are tips for putting a model into production? 

We have a process of thoroughly testing and evaluating our models in the lab before we put them into production, and I think getting that right is really important. You want to make sure that the metrics you have and the data you’re using to evaluate the models is really spot-on. It also helps to have a mostly automated process so that you don’t have to do a lot of work each time you want to test a new iteration of the model, or each time you get some new test data that you want to include. We use a tool called Weights & Biases for that and it’s excellent for this kind of experimentation, as well as data versioning.

In terms of how to take those tests in the evaluation stage into production, we do a couple of things. First, we have feedback collection – so we collect both explicit and implicit feedback from our users who are receiving the predictions of our models. Explicit feedback would be things like a thumbs down on the recommendation, or clicks or comments that users leave. An example of implicit feedback is where our question-answer models make a suggestion to a customer service agent while they’re handling a call, after which we can compare the suggested answer to the bits of conversation that happened after it was suggested, doing a semantic similarity measure to determine whether the agent used the recommendation and ultimately whether it was a good answer in context. We haven’t collected enough of this feedback and our models haven’t been in production quite long enough to start using this for retraining, but that’s the eventual goal. It’s also just a good sanity check to make sure that our models are performing as expected.

Do you have to navigate delayed ground truth or using proxies where no ground truth is available?

Here’s an illustrative example of what this looks like for us today with Contact Center: we have a system that listens to an ongoing conversation between an agent and a caller. The model runs in the background to identify the caller’s intent and question and uses it to query a knowledge-base to then surface the most relevant answer to the agent so that the agent doesn’t have to take time to search for it manually – since agents don’t always know the ins-and-outs of the company they are representing, it helps to have that information available.

In this case, the ground truth is the correct answer to the customer’s question. During training and development we have labeled data, but in production the correct answer is less obvious. We have implemented a method of implicitly checking whether the agent used the recommendation or not. While it’s not a perfect measure, it is directionally useful. We know our suggestions might have been off-base if the agent did not use them, but if the agent uses the suggestion and repeats bits of the answer the model served then we know it was at least partially useful.

So you’re likely relying on mostly custom metrics rather than just standard model metrics (i.e. AUC)? 

Yes, we mostly use custom metrics because so few of our models and tasks are clear-cut – it’s not as simple as saying this is a classification model and therefore you just need F1 score or accuracy, for example. They are often more nuanced, so we rely on custom metrics or a series of metrics for each task.

We recently surveyed over 900 data scientists and engineers and found that most (84%) of teams cite the time it takes to resolve problems with models in production as a pain point. How are you monitoring models in production? 

So in addition to the feedback collection that we regularly check – we have it stored in a database that we query to see how the model is doing based on user feedback – we also use a tool called Checklist, which is useful specifically for testing language models. We use it as a unit test or for some of the models that we have in production, and it’s amazing how many even state-of-the-art language models fail these really simple tests. Basically, you set up a set of tests based on the model and the use case and your assumptions about what the behavior should be, and then you can run them periodically just as you would any other kind of software test. It’s a good way to just make sure that the model is behaving as expected. It’s not perfect but it’s definitely a useful tool.

Given the data you get isn’t always straightforward, managing data quality issues in production must be challenging at times – what is your approach?

Most of the testing and preparation happens during the model development stage, before we put the models into production – making sure that all of our text normalization steps are going to work for all of the corner cases. We usually also have early field trials, where we release the models to one customer for testing for example, so we can identify unforeseen issues.

Is all of this in the cloud or on-prem hybrid? 

Since we had the opportunity to build our infrastructure from the ground up, we made it a cloud-first and cloud-only platform that handles all of the AI APIs and data processing. But with that cloud infrastructure, we serve both cloud clients and on-premise – Webex Contact Center has several different versions of the software available right now, and we try to make our APIs accessible to all of them.

What is the hardest part and most rewarding part of your job? 

One of the challenging parts is that models can be slippery – it’s hard to know if they are doing what you want them to do. There are so many different clues and metrics that you want to look at to try to figure out. While it’s challenging, it’s also one of the more rewarding parts of the job because you have to get creative about what you are looking at and what’s going to be the best measure of success for your particular use case.

Another rewarding part of the job is to then go to the executives and the people who don’t necessarily understand machine learning and say “the model is performing this well – and here is how we compare to our competitors.” Going in with the ultimate confidence in the model and showcasing the impact on the business is very satisfying.