Gaining Insights from Private Data Using Federated Learning
Sid Roy is Manager of Machine Learning Engineering at Devron, a federated machine learning platform dedicated to unlocking the innovation and insight of data while preserving privacy. Before joining Devron, Sid was an AI Fellow at Insight Data Science and received his PhD in computational science with a focus on machine learning from the University of Iowa. Here, he explains the benefits and challenges of using federated learning systems to train and deploy models.
What’s your career background, and how did you first get into machine learning?
I’m currently working as an ML manager at Devron, joining the company two years ago to help productize the idea of federated machine learning for siloed and distributed data. I joined as a head of research, looking into various technologies, researching and prototyping new ideas, and writing proposals and patents to find the right product that is marketable. Ultimately, I realized we needed to build the technology from the ground up to be scalable and usable for enterprise customers. That’s when I moved into an engineering-focused role, where I began managing the machine learning tasks related to building algorithms for enabling federated machine learning utilizing various privacy and encryption technologies. Before Devron, I was an AI Fellow at Insight Data Science, where I learned the many ways startups and industry are different from academia. Before that, I got my Ph.D. at the University of Iowa, focusing on computer vision and building predictive and active learning models for computational physics applications.
How would you explain federated learning to people who may not be familiar with the tech?
Federated learning is essentially machine learning for inaccessible data—the data could be private, or the data owner may not want to lose ownership. Google’s Gboard is the first known deployment of a federated learning system. Most people are probably familiar with the fact that Google can predict the next few words you might type into a keyboard. However, if Google were to collect everything you’ve ever typed, they would end up storing and managing a lot of private information in their central servers, which would then be used to train future ML models. So instead, Google trains their models directly on the phones themselves. The trained models are collected in a central server and aggregated using an algorithm called federated averaging. Then, they train the algorithms and deploy them back on the phones to make better predictions. As a result, they can train models on data from millions of devices without ever collecting the data.
At Devron, we are productizing federated machine learning technology for enterprise solutions. In enterprises, there are a lot of situations where companies want to access certain data, but due to privacy, regulatory, or jurisdictional reasons they cannot. Through our platform, data scientists can build, train, and evaluate machine learning models and go through the entire data science workflow without ever having access to the data. That’s federated learning for enterprises in a nutshell. It’s the idea that the machine learning model goes to the data instead of the data going to the model. We’re seeing a lot more traction around federated learning and privacy-enhancing technologies in recent years.
Would you say that there’s a level of shared responsibility with federated systems?
In federated learning systems, I would say there are two types of user personas: the data owner (who owns the data) and the data scientist (who performs the data science operations). For example, if the organization owning the data (data owner) and the organization wanting to get insights from the data (data scientist) are different entities, Devron enables the data scientist to train a model on the data without ever “seeing” the data. As a data scientist I can still write an algorithm and build a model to train, expecting that it will train on the real data. So I would send the model to the data owner and it will train on their infrastructure, right next to the data, without the data ever having to move. And then I can get the trained model back.
Do you think federated learning will also have applications in academia, where people can’t share their data sets?
Absolutely. My thesis committee member, Dr. Stephen Baek, co-advised me during my PhD. from the University of Iowa and won a National Science Foundation (NSF) proposal to train federated learning algorithms on MRI image data. Multiple universities—Stanford University, University of Chicago, Harvard University, Yale University, and Seoul National University—would collaborate to build global models that would train on different types of images coming from those universities’ hospitals.
How is federated learning used? Is it both within companies and from company to company?
That’s exactly right. At Devron, we have different models: inter-organization collaboration and intra-organization collaboration. As an example, let’s think of a financial organization, which has different branches: the credit card transaction system, the banking system, and the loan system. Data can’t be transferred between the different branches, even though they’re all within the same organization. So, if they want to create a model trained on all of this data, they need to use federated machine learning. Or, let’s say there’s a new financial startup that wants to build a fraud detection system, but they don’t have any way of accessing the data. Federated learning could let them train on the existing data without having to transfer the data.
How do you see federated learning mitigating bias in these models?
What happens a lot is that when we train some models, they’re trained on a similar type of data set, so they become more biased. For example, if we’re training models within U.S. boundaries, there is a particular bias infused in the model that comes from the population distribution; this model may or may not be useful in other countries. But if we can improve the variety of data fed into the model, we are automatically improving the data quality. The new information from different geographic locations helps create less biased models. You can also integrate the existing bias-reducing frameworks with Devron’s ML infrastructure.
One part of Devron’s mission is reproducibility. How would this work if anyone can get the same results?
If you want reproducibility, you need some form of access to train the model on that specific data. Currently, if you want to run models, let’s say from Kaggle, you have to give your data to that person. But in a federated setting, you can provide “federated” access to the data, and multiple people can potentially run their models on your data for reproducibility. Essentially, there’s no other way to do it besides having the ground truths for that data source.
Have you thought about how to monitor these models in production?
Let’s say I’m a data scientist working with federated machine learning, and I’m training a model using five different data sources. What could end up happening is that one of the data sources might have data that wants to poison the global model, or it could be that the data quality in one of those data sources is so poor that it’s actually degrading the model during the training process. So in federated machine learning, we need to track and monitor how each of these models is trained and how much value is added during the training process. At Devron, we’ve built an in-house metrics tracking system to monitor whether the training is going in the right direction or if there’s some kind of poisoning attack. We’ve also thought about the scenarios where you can unlearn information that came in from bad actors.
Do you use a lot of embeddings with these models to try to get more of a high-level overview?
If you ask any data scientist, they’ll say the most important parts of the model-building process are data analysis, data cleaning, transformation, understanding the data’s peculiarities, and getting a feel for it. Unfortunately, data scientists have no way to do that in federated machine learning. So we use various synthetic data generation techniques to give data scientists a feel for the real data. We also let them view private statistics and analytics of the real data. These insights are critical in the data science workflow and help them drive future decisions to understand the features and embeddings of the real data.
The way I think about federated learning is that it’s actually not that different from machine learning. We still have to think about drift, large-scale training, and production. The only difference is that you can’t see the data. Everything else in the machine learning pipeline still applies.
What types of customers are most interested in federated machine learning?
At Devron, we have customers in two main areas: government and FinTech. We have a lot of projects with the Department of Defense and other government organizations. We also work with various types of FinTech and large enterprise organizations that want to maintain ownership of their data but also want to give others access to it, or they want access to data but can’t due to privacy restrictions.
Do you see the potential for a community around federated learning?
That’s the final goal of the system: everyone has some form of access to the data, everyone gets to keep their data private, none of the data leaks and all private information is protected. But you still get the benefits of machine learning. Many organizations may want to sell their data but not provide ownership to run machine learning models and get analytics. And others might wish for a community to build models on the data, but they can’t afford to lose ownership of the data. I can see this being massive, especially in industries where there’s a lot you can’t do because of personally identifiable information (PII) or protected health information (PHI).
What are the steps to make sure data is secure?
There are vulnerabilities, like model inversion attacks, membership inference attacks, and model poisoning attacks. Still, I’d say federated machine learning is step one towards a much larger privacy infrastructure for machine learning and analytics, but there is a long road ahead.
There are technologies like differential privacy that smooths and adds noise to the information output from the data owner—this way, no single data point information is leaked. There are encryption technologies through which you can take data and models from different locations and perform secure aggregations such that only the aggregated model is visible to the aggregator and no individual models coming from each data source are exposed. There are so many different levels to increase the data’s level of privacy.
To wrap it up, what would you say is your favorite part of the day-to-day? What’s the most challenging?
My favorite part is ideating and coding. I love delving into these problems and trying to solve them, coming up with new ways to solve the old problems. Some days, I’m reading papers to really understand a new technology and then try to implement it. I get satisfaction when I can deploy a product that someone is actually using. So that’s the best part of my job. For me, the most challenging part is managing the time between meetings and balancing my schedule to have enough time for the coding and deep-focus work.