Clustering

Modified on Mon, 24 Feb, 2020 at 5:26 PM

In Brief

Function: Attempts to find groups in the provided data and assigns a group to each record.

Typical Use Case: A client obtains a database in which categories for the records are missing. The categories can be restored by applying clustering on all, or a selection of the raw data. The clustering block will automatically determine in how many clusters the data fall and assign each data point to one of the clusters.

Case Study

Here we have a dataset containing bond prices. We are interested discovering clusters for aggregations in market yield and model yield. For this purpose we set up a cluster block:

Workflow

A typical workflow

Options

Options for clustering

All numeric fields: If this checkbox is selected, all numerical fields will be automatically used for cluster analysis. Once the checkbox is removed, a list of selectable fields are shown.

Number If Clusters: The block can determine automatically the best number of clusters. If changed to "Manual", an input box appears with which a specific number of clusters can be enforced.

Missing value behaviour: Each input data point will usually be part of the output. If the data set contains points with missing values, they cannot be used for cluster determination. "Ignore missing values" will ensure that these points are part of the output, but they will come without cluster assignment. "Remove missing values" will remove these points from the output.

Output

The cluster block automatically determines that the best cluster separation can be achieved using 3 clusters. We however find it visually more appealing to use 4 clusters and as such we can force the block so. As there are no missing values we do not need to select the final option. The result shows 4 clumps of yield points that cluster together:

The final results plotting the Model Yield