Prediction

Modified on Wed, 26 Feb, 2020 at 11:08 AM

In Brief

Function: Predicts the category that unseen data falls into based on past data.

Typical Use Case: The user has some data which has already been categorised and some other data which has not, they wish to categorise this other data using inferences from the previously categorised data.

Case Study

A bank is contacting customers hoping to subscribe them to a bank term deposit. They have already contacted some of their customers but are unhappy with the 11% success rate they are seeing. Each of these calls is time consuming so they want to minimise the number of calls they need to make, and maximise their success rate at the same time. They have a variety of data about each of their customers and want to use this to achieve their goals.

To do this they need the information of the people they have already contacted as well as the information of the people they are still to contact.

Workflow

We create a workflow by choosing a data set consisting of two parts: Data which we will use for training and data which we will use for prediction. Both files are connected to the Prediction block via the Past Data and New Data input connectors.

Prediction block workflow

The block is connected to old data, on which a model is trained, and new data, on which a category of interest is predicted.

Input data

As already mentioned, the input data comes in two parts: Training data and prediction data. The training data consists of a table in which each row represents an existing customer of the bank. Each field represents some knowledge about the customer, e.g. his job title, marital status etc..

Input data for Prediction block consisting of multiple categorical fields per customer

The data also contains a field called “Purchased”, which specifies whether a particular customer has purchased a bank term deposit. The prediction data consists of a table with precisely the same categorical information about each customer. The important difference is that the “Purchased” field is missing. It is the objective of the Prediction block to determine the value of this field for the prediction data.

Options

Once the files have been connected to the Prediction block, we can configure the block:

Options for Prediction block

Target Field: Choose one of the fields in your training data (past data), which you want to determine for the prediction data (new data).

Fields to Use: Select here the fields that should become part of the prediction model. Selecting more fields generally improves prediction accuracy on the training data set, but might lead to adverse results for the prediction data set due to overfitting.

Provide a field for each target field value containing the likelihood of each record: Adds a field for each value in the target field containing the likelihood of each record belonging to that class.

Maximum number of categories per field: The maximum number of values to use in a text field. This is to avoid data which is unique to each record, such as phone numbers, addresses etc. from being used in the prediction.

Maximum number of target values to predict: This is the maximum number of unique values to predict in the target field. This block is designed for As this number increases so too does execution time.

Use synthetic fields: This will allow the block to create new fields internally which are a combination of the provided fields, this may improve the accuracy of the model.

Output

New output fields for Prediction block

The output consists of the data that you connected to the New Data input connector, augmented by additional fields which represent the prediction, the confidence in that prediction, and, optionally, the likelihoods of the record belonging to the other possible categories.