Getting started: Custom scripts in Omniscope EVO

Modified on Mon, 23 Mar, 2020 at 6:23 PM

Omniscope Evo is shipped with a wide range of blocks to perform ETL tasks such as data loading, preparation and analysis with ease. Sometimes, however, some specific functionality or analysis is missing from the toolset. To fill that gap in your workflow, Omniscope makes it easy to include custom scripts written in R or Python using the Custom Block. This article is a tutorial to give you a quick start on how to do that.

The path forward

The goal of this tutorial is to show you how to create a custom block from scratch. As an example, we're going to create a prediction block that uses SVMs (Support Vector Machines) (see https://en.wikipedia.org/wiki/Support-vector_machine if you're interested in what they are; it's however not necessary for this tutorial to know about the theory).

In a nutshell, An SVM is capable of predicting categorical data. It works with datasets in which each record represents a data point. Each data point contains a set of numerical attributes (like the horse power or fuel usage of a car) and a categorical attribute (like the type of car e.g. truck, race-car, etc.). The SVM is trained using a data set that contains data points including the attributes and the category. After training it is then possible to apply the SVM to a data set which doesn't contain the category. The SVM will then predict the most likely category and usually produce a likelihood specifying how confident it is in its prediction.

In this example, we will create a SVM block with two inputs and one output. Data from the first input will be used to train the model. Afterwards the model will be applied to data from the second input, i.e. the SVM will predict the missing category values in the second data set and then send the completed data to the block output.

Setup

Before we can jump into adding this new functionality, we have to make sure the computer or server is setup in order to run Custom Blocks. To check what is needed, please have a look at the article for Setting up the Custom Block. Omniscope executes scripts in a particular fashion which should be compatible with most setups. If however at any point you experience problems, you can check the article about the Custom Block Environment to see in what way your setup causes issues with the way Omniscope works.

Creating a new Custom Block

We start by adding a Custom Block to the workflow. The Block can be found in the block picker under the section "Custom Scripts".

You'll notice that there are 3 blocks at your disposal. In this tutorial we're only concerned with the Custom block. The Python and R blocks are legacy blocks which are soon deprecated and will disappear from the palette. The Custom Block is marked experimental because it is under constant development which brings more features and makes it more easy and fun to work with.

Start by dragging the Custom Block onto the workflow. It should appear like this:

In anticipation of using the block to predict data, we add a Demo Data block to the workflow in which we select the Banking data set:

As already explained, the goal is to train an SVM model, and then to do some predictions on a different data set. Therefore we need some training data and some data to predict. We create it by splitting the Banking data into a 20% training and 80% data to predict set, and then we'll remove the field to predict from the data to predict.

Every row in the Banking data set corresponds to a customer which includes demographical information. In addition, it has a field "Purchased" which contains two values: either yes, or no. It indicates whether the customer has made a purchase of a specific banking product in the past. For block development and testing purposes, we'll use "Purchased" as the field we want to predict since we're interested in predicting whether another set of customers with compatible demographics would be likely to make a purchase as well. As such, it's the field we're going to remove from the data to predict.

In order to do so, we create the following workflow:

The split data block is configured with the 20% / 80% ratio:

And the Field filter is configured to reject the "Purchased" field:

If you click now on the Custom Block, you'll be greeted with a screen telling you what can be seen here can be edited in the "Design" tab, and some tabs on the top "Options", "Design", "Introduction", and "Setup".

The "Options" tab is where you currently are. It's empty, because you haven't yet defined any options for this block. This is what you'll do in the "Design" tab. Loosely speaking, the person developing the block and adding its functionality is going to work in the "Design" tab, whereas the workflow designer, or the person using the block is going to configure the block in the "Options" tab

The "Setup" and "Instructions" tabs are there to explain how to setup and use the block respectively, but if you have checked out the Setting up the Custom Block article already, then you already know how to do so.

Adding basics

We'll now go into the "Design" tab where we'll do most of our work. You should see a screen with a Project configuration, a section for Options and a default script that doesn't do much except to take input data and use it as output data.

In the Project configuration you have to choose whether to start a Python, or an R project. For this tutorial, we're going for Python. In case you have multiple Python versions installed on your computer or server, you can choose a specific version. The Dependencies input box is there to define which Python packages your project depends on. Adding the dependencies here will trigger Omniscope to check if they are already installed once you execute the block, and if not, install them for you. If Omniscope installs them, they will be placed in an internal folder so not to mess with your global Python installation. For more information on this check out the article Custom Block Environment.

Default Script

This is the default script you'll always see when creating a new project:

from omniscope.api import OmniscopeApi
omniscope_api = OmniscopeApi()

# read the records associated to the first block input
input_data = omniscope_api.read_input_records(input_number=0)

# read the value of the option called "my_option"
# my_option = omniscope_api.get_option("my_option")

output_data = input_data

#write the output records in the first output
if output_data is not None:
    omniscope_api.write_output_records(output_data, output_number=0)
omniscope_api.close()

The first thing it does is to import the Omniscope API. For a description and reference on how to use it, please see the article Custom Block R/Python API Reference. The API is the only means of communicating with Omniscope. The next things in the script are the reading of all data from the first input (zero-th input number), to assign the input data to the output data, to write the output data to the first output (zero-th output number) and to then close the connection, signifying that the script is done and all is well.

For now, we'll remove everything we don't need to keep the script small.

from omniscope.api import OmniscopeApi
omniscope_api = OmniscopeApi()

input_data = omniscope_api.read_input_records(input_number=0)

omniscope_api.write_output_records(output_data, output_number=0)
omniscope_api.close()

We've said in the beginning that we'd be using two inputs. One for the training data and one for the data to predict. Let's add it and rename the inputs and outputs to something that makes more sense:

from omniscope.api import OmniscopeApi
omniscope_api = OmniscopeApi()

training_data = omniscope_api.read_input_records(input_number=0)
data_to_predict = omniscope_api.read_input_records(input_number=1)

omniscope_api.write_output_records(predicted_output_data, output_number=0)
omniscope_api.close()

As you see, the data_to_predict variable now receives data from input number 1 (the second input).

Imports

The Python package sklearn contains a class called svm which we'll use to create the predictions. We'll therefore add sklearn to the block's dependencies:

and add an import to the script. We're also going to import pandas, which is a package for data analysis. It provides data structures called DataFrame which are used by Omniscope for the data inputs and outputs. In addition to pandas, we're going to import numpy, which is a package for scientific computing. It provides matrix data structures and operations to modify. We'll need them to deal with the SVM. The two packages pandas and numpy do not need to be added to the Dependencies input as they are already required by the Omniscope API and therefore always already installed.

from omniscope.api import OmniscopeApi
from sklearn import svm
import pandas as pd
import numpy as np

omniscope_api = OmniscopeApi()

training_data = omniscope_api.read_input_records(input_number=0)
data_to_predict = omniscope_api.read_input_records(input_number=1)

omniscope_api.write_output_records(predicted_output_data, output_number=0)
omniscope_api.close()

Options

We now need some more information to continue. Namely: Which attribute fields should we use for training and prediction, and which is the categorical field that we want to predict. This is information that shouldn't be hardcoded. It depends on the data and the workflow in which the block is going to be used. We have to ask the workflow designer which fields they should be. So let's add some block options. A complete guide covering all option types can be found in the article Custom Block Options. For our particular example, we'll add an option to select the fields to use, and an option for the field to predict. Let's start with the fields to use:

Go to the Options section and click on the big plus button. You should see the following new and empy option configuration:

As Type, we need to choose Fields, since we want the user to be able to specify multiple fields to use. The option Name is particularly important, as it must be unique and is the identifier we can use in the script to read its values. We'll name it fieldsToUse. The Title can be anything, but should reflect the nature of the option and our intention. It'll simply be "Fields to use". We should now see the following:

When we click on the 3-dots menu next to "Title", we can change some additional configurations. Particularly, we'll make the option mandatory, so that at least one field has to be configured, and we'll make it slightly larger so that it's more comfortable to add and see multiple fields.

Now we repeat the process for the categorical field. We click on the big plus, chose "Field" as the option "Type", enter the name "fieldToPredict" and give it the title "Field to predict". In addition, we also make it mandatory.

A quick look into the "Options" tab now shows the newly added options:

In anticipation of doing the predictions, we'll quickly configure the block. We'll use some of the numerical fields as "Fields to use", and we'll use the "Purchased" field as the "Field to predict".

Of course we want to access these option values from within the script. We can do so using the Omniscope API: The fields to use will be stored in the variable fields_to_use and the field to predict in field_to_predict.

from omniscope.api import OmniscopeApi
from sklearn import svm
import pandas as pd
import numpy as np

omniscope_api = OmniscopeApi()

training_data = omniscope_api.read_input_records(input_number=0)
data_to_predict = omniscope_api.read_input_records(input_number=1)

fields_to_use = omniscope_api.get_option("fieldsToUse")
field_to_predict = omniscope_api.get_option("fieldToPredict")

omniscope_api.write_output_records(predicted_output_data, output_number=0)
omniscope_api.close()

Adding functionality

Now we have everything set up to add functionality to the block. We need to subset the training data and data to predict to only contain the fields we want to use. Also, we need to separate the field to predict from the training data. Let's do this now:

from omniscope.api import OmniscopeApi
from sklearn import svm
import pandas as pd
import numpy as np

omniscope_api = OmniscopeApi()

training_data = omniscope_api.read_input_records(input_number=0)
data_to_predict = omniscope_api.read_input_records(input_number=1)

fields_to_use = omniscope_api.get_option("fieldsToUse")
field_to_predict = omniscope_api.get_option("fieldToPredict")

y = training_data[field_to_predict]
x = training_data[fields_to_use]
x = np.array(x, dtype=float)
svm_model = svm.SVC(kernel='rbf', C=1, gamma="auto").fit(x, np.ravel(y))

omniscope_api.write_output_records(predicted_output_data, output_number=0)
omniscope_api.close()

This creates an SVM with an RBF (Radial Basis Function) Kernel. Now it's time to predict the data to predict from input 2 and send the data out for further processing or visualisation within Omniscope.

from omniscope.api import OmniscopeApi
from sklearn import svm
import pandas as pd
import numpy as np

omniscope_api = OmniscopeApi()

training_data = omniscope_api.read_input_records(input_number=0)
data_to_predict = omniscope_api.read_input_records(input_number=1)

fields_to_use = omniscope_api.get_option("fieldsToUse")
field_to_predict = omniscope_api.get_option("fieldToPredict")

y = training_data[field_to_predict]
x = training_data[fields_to_use]
x = np.array(x, dtype=float)
svm_model = svm.SVC(kernel='rbf', C=1, gamma="auto").fit(x, np.ravel(y))

x_new = data_to_predict[fields_to_use]

predicted_output_data = pd.DataFrame(svm_model.predict(x_new), columns=['Prediction'])

omniscope_api.write_output_records(predicted_output_data, output_number=0)
omniscope_api.close()

Sanity checks and early script abortion

This is already a fully functional block, however, much can still go wrong because the block doesn't employ any checks whether the configuration or data is sound. For example what if the second input with the data to predict is not connected? It's a good idea to add some sanity checks in order to make sure the script doesn't simply crash, but instead exit orderly with a sound warning that calls for action.

As an example, we'll add checks to see whether data is connected and abort if it isn't.

from omniscope.api import OmniscopeApi
from sklearn import svm
import pandas as pd
import numpy as np

omniscope_api = OmniscopeApi()

training_data = omniscope_api.read_input_records(input_number=0)
data_to_predict = omniscope_api.read_input_records(input_number=1)

if training_data is None:
	omniscope_api.abort("No training data, connect a block to input 1")

if data_to_predict is None:
	omniscope_api.abort("No data to predict, connect a block to input 2")

fields_to_use = omniscope_api.get_option("fieldsToUse")
field_to_predict = omniscope_api.get_option("fieldToPredict")

y = training_data[field_to_predict]
x = training_data[fields_to_use]
x = np.array(x, dtype=float)
svm_model = svm.SVC(kernel='rbf', C=1, gamma="auto").fit(x, np.ravel(y))

x_new = data_to_predict[fields_to_use]

predicted_output_data = pd.DataFrame(svm_model.predict(x_new), columns=['Prediction'])

omniscope_api.write_output_records(predicted_output_data, output_number=0)
omniscope_api.close()

Fine-tuning

Lastly, the block can be fine-tuned. At the bottom of the "Design" tab is a section called "Advanced" which looks like this:

The Introductory text can be changed and will be displayed in the options tab. We're going to add some text to introduce the concept of Support Vector Machines, and we also check "Show partitioning" to allow the user of the block to partition the incoming data and create different SVM models for each partition:

A glance back to the "Options" tab shows how this has changed the block options:

Further Information

This marks the end of the tutorial. Have a look and download the iox attached to the article which contains the workflow and SVM block we just built, and check out the blocks in the "Community" section of the block picker for more Custom Blocks.

For further information have a look at the following resources:

Attachments (1)

iox

Custom Block....iox
648 KB