Data Quality is a hot topic - what is it and how can your organisation measure it?
Not only that - finding a problem is just the first part of the job! How do you remedy multiple issues and do it in an automated way, so that anomalies are caught and fixed within your data prep process?
Omniscope allows you to create multiple data validation rules and to automate the transformation processes in the same environment. By maintaining and adjusting the workflow logic over time you will ensure the schema can deal with any problem, and that your data is transformed in a consistent and transparent way.
In this article we will look at:
- Data type and format violations
- Data value issues: duplications, missing or data out of range
- Inconsistencies in capitalisation or spelling
- Identifying anomalies – visual, mechanical or analytical approach
- Rule violations alerts
Every block in Omniscope workflow area will allow you to preview the data (even before the whole dataset is loaded), sort it by clicking on the column header and view the value distribution in a preview chart for each individual field.
This will help you to monitor effects of the transformations at every step of the process.
We wold recommend you to start your data modelling with visual discovery and to explore the dataset in its raw state, even before you start working on it.
Data profiler - visual data exploration for quick diagnostics
Plug your data source into a Report block (make sure to pick the 'Data discovery' report template) to instantly access the combination of macro and micro insight: Fields view data profiler complemented by filters and granular data table.
Let the exploration begin!
Zooming in and out of the dataset, focusing on a particular subset is easy by using filtering, selection and move/ keep buttons. Navigate quickly through the data and decide which fields and values require attention. The findings should give you an idea which operations are needed in the workflow schema, to deal with any issues identified through exploration.
Plug the data into this same Report after transformation to check if you're happy with the results!
Data type checks
The Field Organiser block will list all the fields in your dataset and allow you to rename them ( image above with the list of fields), change the data type / data format (e.g. change the date display format from "dd/MM/yyyy" to "M-d-yy"), also duplicate the fields, add new fields containing formulae, or to remove any columns that are not relevant for your analysis. If your data is consistent (always contains the same fields / same format) and it always requires the same intervention - this block will screen and remedy these issues automatically.
For your data audit purposes it is helpful that all the changes are highlighted, so you can easily track any schema modifications and reverse the changes, including the field deletions.
Bulk field organiser block is handy for multiple field naming/renaming and changes to the data type / format, when you have a dataset with tens or hundreds of fields e.g. change the format of all fields containing "date" in the field name from "dd-MM-yyyy" to "MM-dd-yy".
Another useful diagnostic tool: Field metadata block will scan all the fields and provide every vital statistic - see below. You could merge metadata from two data sources to compare them, before combining them.
Data schema and data range/value checks
Validate Data block is a data quality superstar - it will allow you to introduce multiple screening rules and evaluate the incoming data against a data schema, that checks :
- the field names, field count and the data type for each one,
- cell values (e.g. making sure that % value is less than 101)
The fastest way to set up this block is to connect it to a model dataset, with a correct data schema, and click on the button "Reset to input". Now this block is primed to receive a new batch of data and compare the new traffic against the set rules. The block will alert you if there are any missing fields, or it will flag unexpected fields or formats found in your incoming dataset.
Inside the same block you can set rules for the cell value validation and record count check e.g. after a join you will expect the number of mismatches to be zero. By adding a Validate data block you could be alerted to appearance of records that escaped your mapping (either because there is an error in the data, or because the mapping file needs an update).
Alerts and action - there are settings that will allow you to :
- set warnings (just a message, the data will flow through, and you will get a detailed diagnostics list),
- create an error - if the data fails the rules, it will not be allowed to pass this point
- create an email to alert the relevant parties to take action.
Validate data block is useful to catch inconsistencies in your data when some sources e.g. data edited and authored manually, has different issues from one batch to the next one.
Important - if you set the 'warning mode' in this block the Data tab will list all the issues found in the dataset, which is quite useful when there are multiple issues you need to deal with.
You can switch between the "Error" and "Warning" mode while you're working on the remedy, as the first mode will simply stop the data from passing through this block, while the second one will result in a detailed error report.
Omniscope has several methods of helping you to deal with data duplications.
De-duplicate block will identify and remove duplicate records (e.g. multiple data rows have the same CustomerID, but different name/surname spelling). The block allows you to specify which fields are relevant when flagging a record as a duplicate - it is not just the case of having multiple rows that contain identical content.
If you wish to look into the causes and investigate variations you could plug the data into a Frequency table block (from the Analytics section), or into the Aggregate block. In both cases you could combine multiple fields to define what constitutes a unique key for your dataset, then easily sort and filter on those records where frequency > 1.
In the Aggregate block - you could create a Split on the unique field (like Country ISO or CustomerID) and have on the Results side the fields with variations e.g. [Country name] / Unique list and Unique value count function, to identify which unique values are linked to multiple country or customer name spellings (see image below).
Normalise case block will provide consistency in use of upper/ lower case and automate text corrections. Omniscope will by default scan every dataset and alert you about the need to use this block (you can also choose to ignore the warning, as sometimes the variations are meaningful, or simply not relevant for the analysis).
The data preview chart in every workflow block will give you an update on how many missing values are found in each field. Quick sort (just click on the column header) will list the records with missing values at the top, so you can inspect them.
In a situation where you have values such as "n/a", some code like "999999", or blank cells, you could add a Search/ Replace block to enforce uniformity and populate the cells with another value.
In some datasets e.g. log files or digital media files, empty cell is not a sign of missing data - it is assumed that every empty cell has the same value as the last populated cell above. In this situation - add a Fill down block to fill the blanks in all, or just some of the fields.
Questionable values - using smart analytics to catch them
Sometimes it can be difficult to spot an outlier just by sorting, filtering or visualising the data. In some cases it is the combination of values in multiple fields that makes the record an outlier. You could automate the process by applying Anomaly Detection and Reduce Dimensions from our Analytics library, or by using your own R and Python code in the Custom block.
Here is an article explaining the process on an example.
Where the data irregularities can be spotted by analysing correlation between the fields you could use the Covariance / Correlation block.
Note: Although Omniscope doesn't automatically generate data quality scores - you could easily define your own KPIs in a Field organiser block, or in a Report block's Content view, by adding a formula and formatting the result as %, such as:
By building a workflow logic of your own, and applying the relevant methodology, you will ensure that only clean data, prepared in a consistent and transparent way, is reaching the management reports, so you can have full confidence that your decisions will be based on accurate data. Because all the steps are pre-defined and transformation is automated (especially if you're using the Scheduler app) the transition time from raw data to ready data is reduced to a few seconds and your team will have up-to-date information any time on any device.
Attached below is a demo file which combines many of the automated data cleaning methods mentioned in this article (IOZ attachment).
The answer will depend on the data origins and the nature of anomalies (are they consistent from one batch to the next, or they vary all the time?). In some cases a degree of manual intervention will be needed and Omniscope can manage this situation in two ways:
- data export of 'problematic data points', for external manual intervention
- allow viewing/ editing inside the project (Data table block in the workflow or Table view in the Report block).
The green block represents a file output, containing mismatching data that is exported, inspected and edited manually outside Omniscope. Edited data with corrections can be imported back and integrated into the transformation logic.
Different methods for data editing inside Omniscope can be found in this article on our Knowledge Base.