Your Data Science Workflows Are About To Get A Lot More Scalable
Doris Lee is the CEO and co-founder of Ponder, which recently closed a $7 million seed round led by Lightspeed Venture Partners with participation from Intel Capital, 8VC, and The House Fund. The company, which promises to solve usability challenges with data science tools at scale, is already achieving momentum in-market due in no small part to the founding team’s ample contributions to the open source community. While completing her PhD in Information Management and Systems at the University of California, Berkeley, Lee developed Lux – a widely-used Python library for accelerating and simplifying the process of data exploration – and attracted the attention of Facebook, winning the company’s 2020 fellowship for systems in machine learning.
I’d love to start with a quick overview on your background and research at U.C. Berkeley and why you decided to pursue a PhD in information systems.
I studied Physics and Astrophysics as an undergraduate at Berkeley. At the time, data science was still a nascent field and I was really interested in applying some emerging data science techniques to my astronomy research. So I started diving into tools like Jupyter notebooks and Pandas and many other amazing open source tools, picking things up along the way.
One of the things I quickly realized is that finding insights from your data – especially when you don’t have a clear idea of what you’re looking for or are just exploring – is a very challenging task. Not only do you have to be an expert in statistics and your own domain (i.e. astronomy), but you also need to have an extensive knowledge about how to work with specialized tools. This can be a daunting task for those without a computer science background.
Having experienced this pain point myself, in graduate school, I set out to develop tools and systems that make it easier for people to explore and understand their data. Early in my PhD, a lot of the focus was around how to build these tools so that non-programmers, business analysts, and domain experts could derive value from their data – no-code or low-code tools to help with visualizations and data exploration, all with a focus on showing and discovering insights automatically for users.
That thread and motivation became the core of my PhD work. My goal became building automated assistants – or what we call visualization recommendation tools – that help users discover insights. Essentially, what these tools do is they go into the data and look for statistical patterns and trends and then show them to the users in automated ways. I built a few of these systems throughout the course of my PhD.
Despite the fact that there is a lot of research work in this field, one persistent barrier remains lack of adoption. Often, real-world users like data scientists don’t necessarily adopt these tools in practice. So a big emphasis in my PhD was trying to understand the bottleneck in adopting these more intelligent tools – particularly how to build better systems, deriving system design principles and guidelines and actually building these into tools to help data scientists and others in surfacing insights.
Can you tell us about Lux, which you developed to wide acclaim while completing your PhD?
Lux is a visualization tool built on top of Pandas DataFrames. Essentially, Lux is a tool that finds and displays visual insights automatically to the user. The way it works is that when you’re working with a Pandas DataFrame, you’re often starting off with a tabular view of the dataset – so you can think of this as a spreadsheet, with rows and cells and columns of data. With Lux, there is an additional button you can click that now suddenly shows panels of dashboards and visualizations – all presented to you without having to do any extra work!
Basically, Lux answers the question: what should be the next step in my analysis? We call Lux an “always-on” tool, because it is always going to show you these visual insights without you having to write code or explicitly ask for it. Oftentimes we find that by just minimizing that overhead to exploring your data, it helps people discover interesting things that they might not otherwise set out to find.
This approach is resonating with data scientists and adopted across a variety of industries from pharmaceutical to insurance to retail. Lux has been used by thousands of data scientists around the world, with over 3,500 stars on GitHub and over 100k downloads.
After spending several months in stealth, you recently unveiled the launch of Ponder. Is Ponder an outgrowth of your work with Lux?
Yes, but that’s only one part of the story. Ponder is largely built on the work that my co-founders (Devin Petersohn and Aditya Parameswaran) and I did at UC Berkeley on improving data scientists’ experience in using Pandas.
For some context, Pandas is one of the most popular libraries in Python for data analysis, cleaning, preparation, and exploration. However, it is incredibly difficult to operate on large datasets with Pandas. So when you start to work on anything over a couple of gigabytes – things start to slow down, you run into memory issues and your analysis and discovery process quickly grinds to a halt.
To solve this problem, we developed a tool called Modin, which is a much faster and more scalable version of Pandas. What’s special about Modin is that it not only has this amazing ability to parallelize everything underneath the hood, it does so in a way that doesn’t require you to change a single line of your code. It keeps all the details of distributed computing hidden away from the user, so that users can get to scalable insights faster. In other words, Modin gives you the benefits of performance and scale, without forcing you to rewrite your Pandas workload into a big-data framework like Spark or SQL.
At Ponder, we are on a mission to improve the usability and scalability of Pandas. We already discussed how Lux helps make Pandas more usable by making visualizations seamless for users. Modin addresses the scalability problem of Pandas by automatically scaling up your Pandas workload. Both projects were a result of our many years of research at UC Berkeley and development in the open source community. More than 30 companies and organizations, including 10 Fortune 100 companies, are using our open-source technology to scale up their Pandas workloads and accelerate their data teams.
So is your primary target data scientists and ML teams or is it more of a low-code offering for citizen data scientists?
Our focus for these tools is definitely more around folks who know and love Pandas. And this is by no means a small niche, since Pandas has an estimated five to ten million users and is often dubbed “the most important tool in data science.” We’ve also seen a huge influx of practitioners moving from spreadsheets to using Pandas due to its convenience and flexibility, as well as its ability to integrate well with other tools in the Python ecosystem for machine learning (ML).
One of our core principles is to keep the tools that users are already comfortable working with but make it more scalable. With Ponder we are making scalable, enterprise-ready Pandas more accessible to a broader group of data professionals so that users do not have to learn a new framework or adopt a completely new platform to get the benefits of scale.
What is the primary pain point that exists today that Ponder is trying to solve?
The data science process is highly agile and iterative, with people constantly refining their hypothesis and building on top of their analysis and insight. Unfortunately, most prevailing tools require users to either write a lot of code or switch to a completely different framework to be able to work with large amounts of data. Often, the scenario looks like this: data scientists will prototype their work in their computational notebook (e.g. Jupyter) where they can iterate and experiment with their code. When they want to take this code and run it on large datasets, things start to break apart.
The reality is that when you shift from a small-scale dataset to a large-scale one, there’s a huge gap between these different worlds. You would think that just turning a knob for more data would be relatively easy because the workload is the same, but in fact people are having to rewrite their workloads completely. This takes a very long time because you have to find the folks with the right skill set who know big data frameworks and have the distributed computing knowledge to be able to scale up their workloads.
What we’re doing at Ponder is helping people seamlessly scale up to these larger use cases to get insights quicker without a hefty translation phase.
Returning to the earlier question on who we serve, Ponder is really democratizing the data science process in the sense that you might have domain experts who are familiar with Pandas, who might not want to worry about distributed computing or how to carefully partition data. Ponder handles this all for you so that data teams can scale up their analysis without a sweat.
How has the experience been going from stealth to launch?
Even though Ponder has only been out of stealth for a short time, we are continuing to see a huge demand and interest in using our technology. We are excited to ramp up these efforts in being able to support our industry users and help them scale, work on larger data sets and get to scalable insights faster.
Any predictions on where the market is headed?
I’m no psychic, but a few shifts in the market are becoming clear. Despite waves of digital transformation and the fact that data science has been evolving for over a decade, there are still a lot of pain points in the space.
Given that our work draws from both human-centered computing as well as data management, at Ponder, we think a lot about how to optimize for human time, as opposed to simply what runs faster in terms of CPU cycles. It’s all about prioritizing developer’s time and productivity and making their experience seamless.
I think we’ll be seeing a lot more tools focused on a better user experience, automatically finding insights or scaling up like Ponder does for data scientists. Especially as the ecosystem matures and as we go up the hierarchy of needs, it’s not enough for the tools to just work, but they also need to be laser-focused on making things easier for the user.