Insights From the Front Lines of Building Feature Engineering Infrastructure

After wearing many hats early in his career, LinkedIn’s Thomas Huang now has a Feathr in his cap.

Given the fact that machine learning platform and central ML roles are often among the most coveted AI-related engineering positions at large technology companies, Thomas Huang is breathing rarefied air early in his career. After spending several years at an early-stage data preparation startup, Huang became a Software Engineer – Machine Learning Infrastructure at LinkedIn late last year. He joins an accomplished team that recently open-sourced Feathr (formerly referred to as Frame), LinkedIn’s feature store for productive machine learning.

Why did you choose a role in machine learning (ML) infrastructure? 

When I was working in my previous role as a machine learning scientist – which is a bit of a misnomer since my work was more aligned with that of a machine learning engineer with a lot of data engineering and software engineering mixed in – I personally felt like a lot of the problems that we struggled with at our company came from our data pipeline or work related to scaling data or scaling processes. At the time, we had a big organizational knowledge gap in this area so often we had to figure things out for ourselves.

I think this is a common problem that a lot of machine learning teams deal with. You’re trying to handle a huge mess of infrastructure along with the platform itself. Debugging problems within machine learning infrastructure is really complicated at times and a bit more complex and difficult to debug than standard data infrastructure problems that might have been developed and well-documented over a long time – it’s kind of like you’re bootstrapping a bunch of different technologies on top of a totally new type of platform. Not to mention the sheer number of choices of platforms and tools available, many of which provide overlapping services.

When it came time to look for a new role, I knew to stick to my passion. I’ve always enjoyed engineering a bit more than research, and I wanted to work on something that would be hugely important to how ML engineering teams operate. ML infrastructure is also exciting because it is a young field that is wide open – there aren’t a lot of rules or standard practices yet. We are seeing a paradigm shift in the ML world that many companies have realized in the past few years; the same efforts to improve the ML pipeline – data quality, model deployment time, et cetera – translate to more significant improvements to the impact of a pipeline than that same effort directed towards the choice of the model and hyperparameter tuning. How the infrastructure is designed for these large scale machine learning problems is crucial towards the success of solving those problems.

Can you tell us more about Feathr (formerly referred to as Frame), LinkedIn’s in-house feature store?

I’m a relatively new member of the team and really want to pay tribute to the fact that Feathr was built and has been used at LinkedIn since 2017 when the concept of a feature store was a newer idea and a bit of a hackathon-like project internally. Now, dozens of applications at LinkedIn use Feathr to define features, compute them for training, deploy them in production, and share them across teams – ultimately reducing time required to add new features to training workflows and improving runtime performance.

And now it’s open source! One thing that’s nice about LinkedIn is that when these internal initiatives grow, we can open source them so people can contribute to it and it can help the developer community.

Feathr works offline, online and nearline. You really need both an online and offline feature store; offline allows you to experiment and to implement local pipelines, and online is when you have a production system that needs to pull from a real-time feature store that’s being updated. Behind the scenes, there is a lot of work in regards to the types of aggregation operations, joins and data types that are handled all throughout the lifecycle that make it a very comprehensive ongoing project.

What type of models is LinkedIn running in production today or what are the common use cases? 

The variety of models is really deep and far more than I expected when I first joined! For instance, detecting and preventing abuse is a priority, and the team has used both deep learning and isolation forests to do that. Like many other large companies, ML models are used to drive important verticals like Ads and Feed or other crucial components of the website like “People you may know” or “Jobs you may be interested in.” ML is used essentially in all major products produced by LinkedIn.

How do you view the ML engineering role evolving over time? You likely have an interesting take after seeing this from multiple angles at both LinkedIn (where your users are ML engineers) and at a small MLOps startup. 

As a more junior engineer, my perspective is going to be very different from someone who has been in this industry much longer. That said, I do believe the role of the ML engineer is evolving over time. At first, the data scientist was someone who was using R and using statistical packages to do things like A/B testing with regression models or logistic regression models – so more of an academic statistician – and that transformed over time into the modern data scientist role. Similarly, the ML engineer might now be expected to write PyTorch or TensorFlow code to maybe help develop a neural network, or a more advanced use case within computer vision or natural language processing.

What’s confusing is that sometimes the use case enlarges your role. For example, someone with a data science background might be working with LIDAR data and writing scripts dealing with autonomous driving use cases and end up with tasks that are more like a machine learning engineer. In practice, the roles can blur over time. An ML engineer is part software engineer and part researcher — and a researcher is part software engineer since code is often now a requirement to publish research papers!

Startups probably use the machine learning engineer title term most loosely. Naturally, when you’re working at a startup as a machine learning engineer or as a software engineer for that matter, you get involved in more things that are outside of your specific domain. If you’re a backend engineer at a really small company, it wouldn’t be surprising if you might have to do something front-end related or even something content marketing-related.

What were some of the takeaways from your time trying to build active learning as a service? 

Before LinkedIn, I worked for Alectio which was trying to build active learning as a service. The ultimate goal behind active learning is to find the subset of the superset of data that you have that is most optimal to train your model, reducing the quantity of labels you need to save money and time. However, the problem with this – and this is an important thing to note – is that a lot of the work that we did was oftentimes unsuccessful and unpredictable because active learning as a service rests on a somewhat flawed premise.

Active learning can work really well for traditional models – things like support vector machines and decision trees. There is a lot of good work from the early 2000’s from Daphne Koller and others on this, but all of it is based on using things like multiclass tabular data. To really save money on labels nowadays, you need to look at scanning images or labeling large amounts of text – something like a BERT model that takes a lot of data and time, on which an optimization would mean a big improvement.

That means active learning needs to be applied to deep learning, but the research just isn’t there yet. The reason a lot of large companies aren’t using active learning inside their pipelines is because it’s very heavy on compute and it’s not consistent across datasets and across models. Of course, there are always records inside a dataset that will be super informative to the model – what Andrew Ng talks about in terms of having a data-centric approach to machine learning is absolutely true and valid and makes a lot of sense – but on the other hand, doing that data-centric approach to reduce the quantity of labels specifically within active learning is just not that well-developed yet. While there is a lot of research, there are not consistent results outside of a few narrow use cases. Generally, it’s used with caution.

What is your advice to students or others anxiously hoping to get into a machine learning engineering or ML platform type role? 

Be patient. From the second or third month of me being a freshman in college, I was desperate for an internship. Over time, it impacted my focus and I would get really frustrated.

Here’s the point: there is no real set path, and you should never give up on what you really like doing. You might consider taking a job that isn’t perfectly aligned with your dream and not be myopically focused on becoming a machine learning engineer at Google, for example, overnight. Whether that means taking alternative positions that aren’t directly related to what you want to do, or pursuing a masters to support a transition into this industry, no one will pay much attention to how you got to where you will eventually be, because many people have a similar story.

When it comes to interview preparation for MLE roles, the preparation can really test your patience. Oftentimes you prepare, interview, and get rejected. It’s mentally, physically, and emotionally draining to apply, prepare rigorously as if it was an exam, and then lose out on a job you really wanted. I’m all too familiar with this, but for me little hacks to keep learning kept me going – taking the time to read history, watch a movie, or dust off a tangentially relevant book about something like performance browser networking or C++. Keeping myself intellectually stimulated was crucial to keeping me motivated to improve my skills, and that doesn’t mean that you always have to stay within the domain of computer science. It gives you a chance to step back from studying algorithms and ML concepts all the time. Then the core stuff related to, say, interview prep will feel like one of many of life’s challenges, and it may give you a fresh start. Maintaining your enthusiasm throughout months of searching is not easy, but that’s how I did it. It’s all about creating your own flow, even in difficult circumstances, and coming out on top in the end.