Francisco Castillo and Leland McInnes talk about surfacing insights and discovering issues in your unstructured data. This talk was originally given at Arize:Observe in April 2023.
Francisco "Kiko" Castillo: Thank you everyone for coming to the conference, in particular to this talk. I am Francisco Castillo. I'm a software engineer and data scientist here at Arize. My work is focused on the embeddings slash unstructured data product in both Arize and Phoenix. Let me start by reminding everybody to ask any questions in the chat or in the Slack community. If we have a lot to unpack today, feel free to go to this our Arize community and we'll make sure to answer your questions there. We have some people that might answer questions in the chat, but if we don't get to them, the community is ready for you. Today we're going to talk about surfacing insights and discovering issues in your unstructured data. To do so, Arize and Phoenix leveraged two tools, UMAP and HBDScan, which can be a little bit complex to understand. to help us understand them that the maintainer himself, Leland, thank you so much for being here we appreciate your time and your insight as always
Leland McInnes:Thank you very much for having me.
Kiko: Let me introduce you a little bit to Leland. He's a researcher at the Institute of Mathematics and Computing. He's also a maintainer for a variety of ML packages, including UMAP and HBDscans, which we talk about today. He also contributes to Scikit-learn and Scikit-TDA. So chances are that if you work on ML, you've benefited from his work one way or another. OK, so let me go a little bit of the outline today. We're going to talk about UMAP. In both, we want to give a basic, at least a good understanding of what they are, what they do, how can we set the parameters that they depend on for our benefit, what are the strengths and limitations, and maybe some hints and tidbits from a person like Leland that is very, very, very much used to working with these tools.
So let's start. And if we have a little bit of time with this, it might be pushing it. But we're going to go to try to ask some generative use case questions. Before I join, so how do these tools look like? Let me give you a little. So this that you see here, this is a little peak. This that you see here is a UMA plot that Arize Phoenix offers us. Now, we have clustering approach, which is a collection of data points that are similar to each other, as you see here. We are highlighting. If I select this first one, we see we are able to identify very quickly images that are poor quality. Now, without going into more, we have a demo in the conference, so feel free to attend. But if we look here at the panel, there's a lot of settings, and these come directly from the packages. So let me zoom in a little bit. Let me zoom in a little bit. So let's start with HTV Scan. So these settings that you see here are critical or HDB scan. So let's start with HDB scan.
Leland, could you give us a little bit of, just to set up the conversation, what is HDB scan, roughly overall, what is it good for?
Leland: Density-based clustering approach. So the goal of any clustering algorithm is to try and find clumps or regions of the data that are all sort of similar to each other in some way. And the goal of the density-based clustering algorithms is rather than just finding compact regions that are compacted together, so all the distances are close, it tries to find regions where the data is dense that are separated from other regions by areas where the data is more So you're looking for these islands of density amidst a gigantic sea. And that's common to most density-based clustering algorithms. So dbScan would work that way. What makes hdbScan more interesting is that it has a hierarchical component. So in a sense, if you have regions of different density, something like dbScan will not pick them out quite so well. islands of density within a sea, you can raise or lower the sea level and different islands will pop up at different sea levels. So being able to have a different sea level to pick out islands in different regions of your data space, that's really important and that's the sort of thing that HDB Scan empowers.
Kiko: That's actually a very good analogy. I was having a talk internally with folks at Arize. And I could have used that analogy. I wasn't able to explain it as well as that. That was a good one. And moreover, what are strengths, limitations? Well, let me start with that. You mentioned density-based algorithms. This is a shameless screenshot from one or many talks and John Healy's talk. So I want to see. So these are a collection for the audience. It's a collection of clustering algorithms. And this table, we have different types.
And Leland, could you speak a little bit of why STB scan is, or algorithms such as STB scan that are placed in that quadrant are interesting, beneficial, they're lacking, they're drawbacks versus other parts of the table?
Leland: Yeah, so I think let's start in the upper left quadrant there where we're talking about centroid-based or importantly parametric models and flat clusterings. you have some model of what your clusters should look like. So the GMM there is Gaussian mixture models and that assumes that your clusters should look like Gaussian distributions. So if you know what your data clusters should look like, if you can build a statistical model of them, that's a fine approach. But often you don't know a priori what your data should look like. And that's where the density-based techniques, they learn the data distribution and that sort of a model in and of themselves. So they're much more powerful. So that's where DB scan and mean shift are good. But the next catch is that there's also whether you should break it down by flat clusterings or a hierarchical clustering method. Hierarchical clustering means you're gonna build up a full hierarchy of how different clusters merge together until you get to a single root cluster. technique, but most of them tend to be more in the centroid parametric mold. And HDB scan is one of those rare algorithms that manages to combine the sort of best of both worlds of these. So it's density based, but it does have that hierarchical structure. It still has the ability to pull out these flat clusterings, but you can explore the full hierarchical clustering if you want, but it's taking advantage of that hierarchy to manage to get good flat clusterings bringing together all of these different aspects that make different clustering algorithms good.
Kiko: So in terms of working with HDB scan, is there any prerequisites? Like is there anything in the data that I can use it for? Or is there any guides of data that I should never even touch? Because it's just not gonna work. Are there any opposites or guidelines that we can have a quick insight in?
Leland: Yeah, so the main thing is that HDB scan tends to like lower dimensional data. And this is really just something that's true of density based clustering in general. pretty variable, but in practice, I would say up to about 50 dimensions is really, really pushing your luck. But below 50 dimensions, you're probably, you've got a decent chance. 50 is quite a large number, but compared to the sorts of things that you get output from your networks and things like that, that are often hundreds or thousands of dimensions, it's not very many. with this higher dimensional data is that the notion of density just gets trickier in higher dimensions because higher dimensional spaces have in a sense more corners, there's more space to be filled up. And so you need ever more data to manage to get the same amount of density in the space. And if you're building your whole algorithm out of finding and comparing densities, you It means for high dimensional data, you would need ridiculously large volumes of data to manage to get enough traction for these algorithms to really bear fruit. Now, the catch is that often, while the data exists in a high dimensional space, there's some underlying structure to the data that maybe lives in a lower dimensional sort of portion of that space in some way. But it's very hard to do that out of the box, just using density-based clustering ambient to that high dimensional space.
Kiko: I see. So it seems like there is a tool that we are going to talk later about solving that problem. Before we move on to UMAP, which solves that problem, or at least helps, I would like to ask you a couple more questions. HTBScan in its name is with noise. That's what the N stands for. So it seems like a very, also DB scan, it seems like it's important enough that we included it in the name. Why is noise important and why have you found it so crucial to be even a part of the name, like it's its identity.
Leland McInnes:
This really comes down to what one means when one defines a cluster. When a lot of people talk about clustering algorithms, they really mean what I would refer to as a partitioning algorithm, which means you just partition up the data you have. But for a lot of density-based approaches, what you really mean by cluster is a dense region, and not all of your data is going to fall in those dense regions.
Kiko:
Mm-hmm.
Leland McInnes:
To sort of essentially point out these outliers, the noise in the data and say, this doesn't have to belong to any given cluster, is a really powerful tool. And so, whereas something like k-means, which visually assigns every data point to a cluster, it tends to end up with quite polluted clusters because a whole bunch of noise points that are just sort of a little bit more scattered or random, they have to be assigned to some cluster. up with extra weird points in it for no particular reason than that was the closest cluster to these noise points. So dbScan and HDBScan handle this noise aspect much more cleanly and they essentially drop away these outliers and just ignore them for the purposes of the clustering and returns clusters that are just dense and ignores noise points and sets them aside as a different class entirely.
Kiko:
I see. So it's almost like quality control, if you will. Like, okay, if you're not truly sure, just don't be afraid to drop them. These aren't just points that are in the middle between two classes, just not fully part of the group.
Leland McInnes:
Yeah,
Kiko:
Okay.
Leland McInnes:
yeah, no, I think that's a great way of thinking about it. It's trying to not lie with what you're returning as your clusters. So if HDB scan returns some clusters, those things are clustered together. It doesn't always find maybe the clusters you were looking for because it doesn't have enough data to manage to extract that information or what have you. It won't it won't falsely claim things are a cluster whereas a lot of these other algorithms Especially things like k-means will tend to do that. They'll just they have to assign everything to something So they'll just assign it in whatever way they can and so this is as you say Yeah, a quality control if it can't find a cluster out of this data It will just set those data points aside and say I don't know what to do with these
Kiko:
That's one of my favorite features, the fact that I can confidently say, OK, if it's to be excluded, it's allowed. Let me move into UMAP. We're going to come back to HCD scan and the happy parameters, but let's move on to UMAP for a second. Same question. I would like to know, what is UMAP in simple terms? And also, I'm very curious what led you to work on that problem. It seems like it's a very niche problem. come to happen.
Leland McInnes:
So it actually grew out of my work in HTV scan and clustering for exactly the reason I described. HTV scan is really bad at clustering high dimensional data because it struggles with that ambient dimension and being able to define density in that way. So you really need a way to reduce the dimensionality of the data in a way that preserves structure. So UMAP is a dimension reduction or manifold learning algorithm that tries to take high dimensional data and learn essentially some latent structure. It's built on topological reasoning, so it tries to understand the topology of the data and then give a lower dimensional representation of that data that hopefully preserves as much of that structure as it can while still compressing it down to lower dimensions.
Kiko:
Got it. I actually, that's something I learned today. Maybe because it's the way I learned them, but I learned first about UMAP, then about STB Scan. It's quite a surprise that actually became to happen in the opposite order. But it makes sense. It's like you said, it's pure need. What would be one of the strengths, or the ones that you are happy to share, the strengths, limitations of UMAP, and where is the normal roadblock people tend to encounter.
Leland McInnes:
Yeah, so one of the strengths of UMAP as a dimension reduction is that it is much better at following nonlinear structure in the data because it cares about topological structure. It is dramatically better than things like PCA if your data doesn't sort of project onto a hyperplane very well and to be honest most most interesting high-dimensional data doesn't do that. to a small number of dimensions and really managed to preserve a lot of that structure, UMAP is pretty good at that. Now the limitations are you're still trying to compress everything down to a very small number of dimensions and not everything fits there. So it's going to have to make the best approximation it can and it runs a little optimization algorithm trying to do as well as it can. isn't always going to find something that is a true representation of the data in that lower dimensional space. It's going to give you the best representation it can and maybe that's not always sufficient for your data.
Kiko:
Because you were referring now about those limitations, and I got to ask because we have a lot of settings. So I'm going to move back to the page here. So as I said, on the top, we have the settings for HTV Scan. On the bottom, we have the settings for UMAP. So I would like to ask you if you could give the audience a little bit of insight on what these settings are. Let's start with HBDscan. on cluster size with the minimal number of samples, the mean samples that is in the algorithm and the cluster selection epsilon. What are they and how can they help? solve the correct, let's have a little bit of a question, how the incorrect choice of these hyperparameters can lead us astray or the correct choice can help us overcome some of these difficulties.
Leland McInnes:
Sure, so let's start with minimum samples, because that's actually inherited from the dbScan algorithm, and there's technical details of how one could explain that, but I think intuitively, if you want the intuition for it, it's really about how confident you want to be with regard to noise in the data. the minimum samples, the more noise you're generally going to get. So it's going to be a more conservative clustering that is going to try and really focus on really dense areas more so than the background noise. So if you turn it up, you get sort of a much more conservative clustering with a lot more data, probably declared as noise. And turning it down, you can manage to cluster more of your data. is less conservative, it might glom together more things than perhaps you are looking for. So it's a dial you can turn. It's fairly robust. And what I mean by that is some hyperparameters on algorithms, if you turn it a little, it changes a little, and then the next little step you make, everything changes completely, and then another little step, everything changes completely again. samples is not so much one of those. You don't need to pick the exact value. You can usually get something in the broad range and changes might vary things a little but it's unlikely to have really dramatic changes except for very occasionally. So mostly it's not a matter of picking the perfect value, it's managing to get a value in the right sort of ballpark to get you what you want. is really how big is the smallest group of points you want to consider a cluster. So if you have a lot of data and you know that it breaks up into big clusters you probably want to choose a large value for that. If you want to focus on just individual smaller dense regions maybe you want to choose a smaller value. That's really down to your notion of what you think should mean cluster is how many points do you need to have before you can really feel like that's a cluster of points we should have that called out as a cluster. So I feel like that one is a little more intuitive if you have a notion of how big should a cluster be. You probably have some idea of that for your data.
And lastly, cluster selection epsilon. I would have called that more in many ways, but it's the ability to, if there are lots of tiny clusters that are all pretty close together, you can glue them together if they're within some bounded distance of each other. And that's what that Epsilon is giving you. How close together do clusters have to be before, I'm just gonna call them one cluster, despite the fact that the pure hierarchy split them apart into different clusters. So if you know something about distance ranges in your data, use. The most common use case I've seen for that parameter generally is if you're clustering geographic data often
you can say oh well I don't want any cluster that's smaller than a city block say and so you can set the cluster selection epsilon to the size of a city block and it won't sort of split everything into little tiny clusters within a block it'll just glom all those together even if they are separately very dense within that region.
Kiko:
I'm glad that you said gluing, because internally, that's how we call it a gluing parameter. So I'm glad that I got validation on that one. Yeah, that's very interesting. One question only. On the minimum cluster side, just for the danger that I believe that people can misunderstand on, it's not a choice, correct? It's a lower bound. I can say my minimum cluster is 10, but if my data is very dense, the minimum, if I have 10 clusters, maybe the lowest amount in that cluster or a thousand. It's just a lower bound doesn't mean that I'm gonna get that correct.
Leland McInnes:
Yeah, it means that if you have a group of points that are fewer than that min cluster size, it's not going to separate those out and call that a cluster. That's it. Clusters can be any size greater than that. That's the absolute lower bound on the cluster size. Exactly. Yes.
Kiko:
Got it. All right, let's move to UMAP. I'm going to put the hyperparameters of UMAP here. This third is the number of samples for this plot. So this third is not a UMAP hyperparameter. But these first two, it is. We have the minimum distance and the number of neighbors. Could you elaborate on those?
Leland McInnes:
Yeah, let's start with number of neighbors because it has the largest impact here. So I talked about trying to understand the topology of the data in the high dimensional space and to do that you need to look at local regions around each data point and try and estimate what does the topology look like in this local region. out should I look to determine that local region. So if you choose a small value of in neighbors, it means you're focusing on the sort of very local variations within the data. And so it's gonna glue together all these patches, but each one is gonna have a very, very local view. So if there's a lot of detailed variation in your data, you'll manage to pick up that detail. But because each patch doesn't see very far, to glue it together quite as effectively.
If you choose a larger value you're now looking out further so the ability to glue those patches together and get richer global structure it's going to see more of that the larger the value you choose and the sacrifice that you're making to do that is it's having to sort of average out within that patch that it's looking at and so you're gonna lose off between the sort of very fine detail local structure and sort of more longer range, more global structure. You're not going to get like full scale global structure because it's looking at everything locally and gluing it together, but it's definitely the trade-off to how far out you want to be looking to sort of build up this notion of structure.
So you're tending from a very local distance is really for laying out the data in the low dimensional representation. It's really a setting of how tightly you should be allowed to pack points together. Um, because it's trying to solve a complex optimization problem. And many times one solution to that is simply to pack a whole bunch of points into essentially exactly the same spot. If you increase the minimum distance, it says, don't try not to do that. And it'll push them farther apart. But that means it can't pack everything as tightly as it would like. So you're sort of get a different solution to your optimization problem. Things don't necessarily sort of form the nice tight structures and or clusters that you might have been looking for. So it's a bit of a trade-off about how tightly you're willing to pack points together.
Kiko:
Got it. So now that we've covered, I'm gonna go back to the slides, but now that we've covered basically them individually, I would like to ask you about the combo, especially because we've seen a lot of teams and users of both Horizon Phoenix that they're landing to the page, and maybe this is the first time they see their data. They're seeing the data with the default parameters that we have researched to work decently well But it really is, every data is slightly different, and sometimes very different to each other. So what is your experience, or what kind of advice can you share when you're in that exploratory data analysis phase? You land your data, first time you see it, are these parameters correct? Am I looking at what I should be looking? How is this representation lying to you?
Leland McInnes:
I personally like to start with the UMAP plot in 2 or 3D. If you're using UMAP simply to pass through to clustering, you don't even need to do 2 or 3D. You could do a five or 10 dimensional reduction. But from my own work, I tend to start at the very least with the 2 or 3D plot so that I can actually visualize it and see what I'm getting. And that allows me to tweak U-map parameters to a degree, especially the in-neighbors. I mean, that's really what I'm looking for. I can try a few different values and see if that changes things very much. And often you actually find that you can set the in-neighbors value to, you know, within a fairly large range, and it won't have very much impact on the qualities of output you get. So that gives you a good sense of parameter values, that's probably a reasonable choice. For minimum distance, if you want to use it for clustering, turning that down as low as possible tends to be pretty useful because for clustering, you want to pack points close together. UMAP is going to try and essentially find connected components of the manifold, and it'll preserve to a degree that connected component structure, that's topological structure that it cares about, be able to pick up on that. So if you allow it to pack points together it'll preserve that better. So then you turn to HDB scan and try and ask yourself well what are the what are the good parameter settings here and to be honest I often just choose a min samples value that seems somewhat reasonable. 5 to 10 is often plenty unless you know you have a lot of noise in your data and then you to turn it up higher. But for a lot of data, it's not that noisy. So 5 to 10 is pretty much fine. You can try it with that and see how it goes. The harder one to pick at times would be the minimum cluster size. And that's where having looked at that UMAP plot is really, really helpful. Because you will have seen some of the structures that UMAP is pulling out and gone, I see how big some of these clusters are. the smallest possible cluster I'd like to see is, and you can take some educated guesses. And again, it's relatively robust to those parameter choices. So having an educated guess often gets you pretty much what you want anyway, as long as you can get that first step of having some education about what your first guess should be. And if you're not getting the answers back that you want, it's not that hard to tweak that parameter.
Kiko:
Awesome. Well, unfortunately, I don't think I'm going to have the chance to ask you number four. We are time. Thank you so much, I find this talk to be extremely helpful. It would have made my life way easier back in the day when I started learning about these tools. So hopefully, the audience and other people watching it at another time, they enjoy it. Thank you so much, Lina, for everything, for every answer. Well, this is it for the talk. Thank you so
Leland McInnes:
Thank you, Kiko.
Kiko:
Thank you for coming.