DBSCAN, Explained in 5 Minutes
Author:Murphy | View: 26513 | Time: 2025-03-23 11:44:10
What's DBSCAN [1]? How to build it in python? There are many articles covering this topic, but I think the algorithm itself is so simple and intuitive that it's possible to explain its idea in just 5 minutes, so let's try to do that.
DBSCAN = Density-Based Spatial Clustering of Applications with Noise
What does it mean?
- The algorithm searches for clusters inside the data based on the spatial distance between objects.
- The algorithm can identify outliers (noise).
Why do you need DBSCAN at all???
- Extract a new feature. If the dataset you're dealing with is large, it might be helpful to find obvious clusters inside the data and work with each cluster separately (train different models for different clusters).
- Compress the data. Often we have to deal with millions of rows, which is expensive computationally and time consuming. Clustering the data and then keeping only X% from each cluster might save your wicked data science soul. Therefore, you'll keep the balance inside the dataset, but reduce its size.
- Novelty detection. It's been mentioned before that DBSCAN detects noise, but the noise might be a previously unknown feature of the dataset, which you can preserve and use in modeling.
Then you may say: but there is the super-reliable and effective k-means algorithm.
Semantic Segmentation of Remote Sensing Imagery using k-Means
Yes, but the sweetest part about DBSCAN is that it overcomes the drawbacks of k-means, and you don't need to specify the number of clusters. DBSCAN detects clusters for you!
DBSCAN has two components defined by a user: vicinity, or radius (