Omniscope allows you not only to create data validation rules and checks, but also to automate the remedy processes in the same environment.
By maintaining and adjusting the workflow logic over time you will ensure your schema is capable of dealing with any problem, also that your data is transformed in a consistent and transparent way.
In this article we will look at:
- Data type and format violations
- Data value issues: duplications, missing data or data out of range
- Inconsistencies in capitalization or missing values
- Suspect values – visual, tabular or analytical approach
- Rule violations alerts
Every block in Omniscope data workflow will allow you to preview data (even before the whole dataset is loaded), sort it by clicking on the header and view the distribution in a preview chart for each individual field.
We wold recommend you to start your data work /discovery by visualising and exploring the dataset in its raw state, even before you start working on it, to diagnose issues quickly.
Plug your dataset into a Report block (pick the 'Data discovery' template) and use the combination of Fields view, Filter devices and granular data in a Table below, in combination with selection, move/keep commands to navigate quickly through the dataset and decide which fields need attention.
Data type checks
Field Organiser block will display all the fields in your dataset and allow you to rename them, change the data type / data format (e.g. change the date display format from "dd/MM/yyyy" to "M-d-yy"), also duplicate the fields, add new fields containing formulae or to delete the fields that are not relevant for your analysis.
For your data audit purposes it is helpful that all the changes are highlighted, so you can easily track data modifications and reverse them, including the field deletions.
Bulk field organiser block is handy for multiple field naming/renaming and changes to the data type and format, when you have a dataset with tens or hundreds of fields.
Ultimate fields diagnostics tool: Field metadata block will scan the fields and provide every vital statistic - see below.
Data schema and data range/value checks
Validate Data block will allow you to introduce multiple rules and evaluate the incoming data against a data schema, that checks both the data type and the values, making sure that % value is less than 101, for example. You can use a parameter in this box to remotely edit values for multiple fields and blocks.
Once configured, the schema will also cater for the field fluctuations - alert you to the missing fields, or flag unexpected fields found in your dataset.
Alerts and action : There are settings that will allow you to :
- set warnings (just a message, the data will flow through),
- create an error (if the data fails the rules it will not be allowed to pass this point)
- create an email to alert the relevant parties to take action.
Omniscope has several methods of helping you to deal with the data duplications:
De-duplicate block will remove duplicate records (e.g. multiple identical data rows that may have variations in some cells, but not all. The block allows you to specify which fields are relevant when flagging a record as duplicate.
If you wanted to identify and investigate multiple appearances of a value you could plug the data into a Frequency table block (under the Analytics) or the Aggregate block. In both cases you could combine multiple fields to define what constitutes a unique key for your dataset, then easily sort and channel those records where frequency > 1.
Normalise case block will let you pick the fields where you wish to automate the case tidying up. Omniscope will by default scan every dataset and alert you about the need to use the block (you can also choose to ignore the warning, as sometimes the variations in case are meaningful).
Data preview chart in every workflow block will give you a quick update of how many missing values are found in each field. Quick sort in a field header will list the records with missing values at the top, so you can inspect them.
In a situation where you have values such as "n/a", or code "999999", as well as blank cells, you could use a Search/ Replace block to enforce uniformity.
Suspect values - using analytics to catch them
Sometimes it can be difficult to spot an outlier just by sorting or filtering the data, where combination of values in several fields makes a record an outlier. You could automate the process by applying Anomaly Detection and Reduce Dimensions from our Analytics library, or by using your own R and Python code in the Custom analytics block.
Here is an article explaining the process on an example, using our own built-in Analytics library.
Where the data irregularities can be spotted by analysing correlation between the fields you could use the Covariance / Correlation block.
Note: Although Omniscope doesn't automatically generate data quality scores - you can easily define and add your own KPIs in a Field organiser block by adding formulae and formatting the result as % or decimal, such as:
We will look further at different methods for data editing in our next article, to go through the automated methods, as well as options to make manual corrections in the editable data table in the workflow or in the actual Report block.