We are given a dataset of customers by a telecoms company, they wish to determine the customers most likely to leave their contract before it happens, so they can attempt to avoid it.
Below we see the distribution of values in each field when the data is split by the value we wish to predict (Churn).
Some of these fields look very similar on both sides, others very different. This isn't a bad way to manually determine the impact of each field on the Churn value but it's certainly not the best. By using the Variable Importance block we can determine how informative each field is in predicting the Churn field's value.
The output of the Variable Importance block give us a score for each field between 0 and 10, telling us how much each field contributes to being able to predict Churn. Below are a series of charts
proving the influence the values of some of those fields when predicting Churn.
The sunburst charts are of particular interest, we can see that there are some key indicators of Churn;
* Paying month-to-month
* Paying by Electronic check
* Having Fiber optic
* Having been a customer for 0 to 12 months
Using the Prediction block, when using all fields achieves an accuracy of around 77%, it will attempt to choose the most important fields, and in this case it has done its job well. If there were a significantly larger number of fields (and if many of them had no predictive power) then it might have had a harder time.
Above we can see that using the Prediction block with all fields or just the fields identified as being useful by the Variable Importance block produces a very similar prediction, with the same 77% accuracy.
The prediction using just the 'useful' fields produces more confident predictions (values closer to 0% and 100%, the prediction using all fields maxes out at ~87%, and with just the best fields at ~92%). From this we can say the other fields are unnecessary in predicting the Churn value, and that including them adds unnecessary noise to the prediction.
This means that in order to make accurate predictions we no longer need to collect the other fields, in some cases this could be a significant cost saving, and will certainly make working with the data much easier in the future.
The Prediction block can be used without first using the Variable Importance block, as can be seen above the results will likely be very similar. However, by removing the 'unnecessary' fields from the Prediction block input it will run faster and use fewer system resources. This is a versatile block capable of handling any number of classification tasks without the need for any data preparation or cleaning before use.