Clustering - An Introduction

Posted over 6 years ago by Daniel Benjamin Joplin

Daniel Benjamin Joplin

We are given a dataset containing a variety of wines, we have data regarding a number of characteristics of each wine, some based on colour, some on chemistry. Our goal is to find natural groupings of wines, that is wines which are similar in some way.

First let's take a look at the data and see if we can identify any clear groupings in the raw data. The correlation plot shows us that some fields are very closely related, so we shouldn't be plotting the similar fields together, we won't gain much information.

We plot two pairs of the least correlated data below, to see if there are any obvious groupings;

The two scatter plots don't tell us much, if anything it looks like all these wines are very similar, with no clear distinctions between them. By using the Clustering block we may be able to find groupings which would be difficult to determine manually.

The Clustering block has successfully discovered three clusters, now let's find out what they look like:

The parallel coordinates are showing us that there is some variation in the groupings, but if we look at the Proline value we can see that these numbers are significantly larger than the other fields, this could lead to a bias where the Proline field effectively determines the groupings.

The above scatter plot shows us that this belief is well-founded; determination of cluster can be done entirely using just that one field, and other fields seem to have little to no effect. We will need to normalise the data before applying any groupings.

Before we normalised the data the standard deviation and range for Proline were significantly larger than any other field:

After normalisation all fields will have a standard deviation of 1, and similar ranges;

Clustering after normalisation the Proline field still has some influence, but it is significantly less pronouced;

To cluster on normalised data we simply add the Normalise block between the data and the Clustering block;

This is much better; the groupings have more variation across all fields and there is no apparent bias in the grouping caused by differences in variation. Exploring the data with the 3d scatter plot it is possible to identify clear separation of the discovered groupings, telling us that these are good groupings that have found some structure to the data.

In this demo we have shown how to cluster a dataset, avoiding the common pitfall of unnormalised data being used when it shouldn't. For a deeper dive, analysis of the clustering can be see in the "Clustering - Analysis" demo.

0 Votes

0 Comments