Clustering - Analysis

Posted about 6 years ago by Daniel Benjamin Joplin

Daniel Benjamin Joplin
Daniel Benjamin Joplin

This is demo relates to the outcome of the Clustering - An Introduction, it is recommended that one is familiar with that Demo before proceeding.




To understand how the clustering was performed and why the number of clusters chosen was three we will take a look at how the Clustering block saw the data, and how it made these choices. The scatter plot below is approximately what the Clustering block sees when it performs a clustering, here it is gained by using the Reduce Dimensions block.



The Reduce Dimensions block creates some combination of fields that are able to better show the structure of the data. It might be a little difficult to see from this plot why three clusters was the correct number of clusters in this case, take a look at the scatter plot below, it's the same as the scatter plot above but this time each point is sized by the density of points in its area, and coloured by the determined cluster.



From this scatter plot it is much clearer that there are three areas of higher density, it is this that allowed the Clustering block to determine there were three clusters. This is all done automatically by the Clustering block, but if you think there should be a different number of clusters you can set that option in the block easily.


Our own Clustering block uses a similar approach to k-means++, but there are many other clustering approaches available. These different approaches can be used by taking advantage of the R and/or Python blocks which allow custom code to be executed in the workflow, the blocks also come preloaded with some scripts which will allow you to perform both DBSCAN and Gaussian Mixture Modelling.


Using the Python block we have performed DBSCAN. DBSCAN is a density based clustering algorithm that also identifies noise in the data. Points considered noise are classified as being in cluster zero, and in the below scatter plot these can be seen in bright yellow. These points are too far from the clustered points to be seen as part of any cluster.


Using the R block we have performed Gaussian Mixture Modelling. Gaussian Mixture Modelling fits Gaussian distributions to the data, and in this case found three clusters.


This concludes the series on Clustering in Omniscope. All files, and data, related to the series can be found on the nightly server.

0 Votes


0 Comments

Login or Sign up to post a comment