Data quality and profiling: evaluation and remedy in Omniscope

Modified on Fri, 20 Jun at 2:50 PM

Data quality is a pressing concern for organisations, but what exactly does it entail, and how can you measure it effectively? Furthermore, identifying data issues is just the beginning: the real challenge lies in automating their resolution within your data preparation processes.

Omniscope offers a comprehensive solution by enabling the creation of multiple data validation rules and automating transformation processes within a unified environment.

By maintaining and adjusting the data workflow logic over time, you ensure that the data quality model can both identify and remedy anomalies, and that your data is transformed in a consistent, transparent way.

In this article, we will look at:

Data type and format violations
Data value issues: duplications, missing values, or data that is out of range
Inconsistencies in capitalisation and spelling
Identifying anomalies – visual, mechanical, or analytical approaches
Rule violation alerts

Diagnostic phase

Every block in the Omniscope workflow area allows you to preview the data (even before the entire dataset is loaded), sort it by clicking on a column header, and view the value distribution in a preview chart for each individual field.

We would recommend starting your data modelling with visual discovery and exploring the dataset in its raw state before working on it.

Data profiler - visual data exploration for quick diagnostics

Connect your data source to a Report block and ensure you select the 'Data Discovery' report template. This gives you immediate access to both macro and micro insights via the Fields View data profiler, along with filters and a granular data table.

Let the exploration begin!

You can easily zoom in and out of the dataset or focus on a specific subset using filtering, selection, and the move/keep buttons. Navigate quickly through the data to identify which fields and values require attention. Your findings will help determine the transformations needed during the preparation phase to address any issues uncovered during exploration.

After applying transformations, plug the data into the same Report to verify whether the results meet your expectations.

Data type checks

The Field Organiser block lists all the fields in your dataset and allows you to perform a range of operations: rename fields (as shown in the image above), change data types or formats (e.g. switch the date display from dd/MM/yyyy to M-d-yy), duplicate fields, add new calculated fields using formulae, or remove columns that are not relevant to your analysis.

If your data is consistent—always containing the same fields and format—and requires the same intervention each time, this block can automatically detect and apply the required changes.

For auditing purposes, all modifications are highlighted, allowing you to easily track schema changes and revert any actions, including field deletions.

The Bulk Field Organiser block is particularly useful for renaming multiple fields and making bulk changes to data types or formats, especially when working with datasets containing tens or hundreds of fields. For example, you can change the format of all fields with 'date' in the name from dd-MM-yyyy to MM-dd-yy.

Another useful diagnostic tool is the Field Metadata block, which scans all fields and provides key statistics (see below). You can also merge metadata from two data sources to compare them before combining.

Data schema and data range/value checks

The Validate Data block is a data quality powerhouse. It allows you to introduce multiple screening rules and evaluate incoming data against a predefined schema.

This schema checks for:

Field names, field count, and data types
Cell values (e.g. ensuring that percentage values are less than 101)
Record count (e.g. if your dataset must contain exactly N rows)

The fastest way to configure this block is by connecting it to a model dataset with a correct schema and selecting the “Reset to input” option. This primes the block to receive new data batches and compare them against the defined rules. It will alert you if any fields are missing or if unexpected fields or formats appear in the incoming dataset.

Within the same block, you can also define rules for validating cell values and record counts. For example, after a join, you might expect the number of mismatches to be zero. By adding a Validate Data block, you’ll be alerted if unexpected records appear—whether due to data errors or because a mapping file needs updating.

Alerts and Actions

The Validate Data block includes settings that allow you to:

Set warnings – The data will pass through, but a detailed diagnostics report will be generated.
Create an error – The data will be blocked if it fails validation.
Trigger an email alert – Notify relevant team members when a rule is violated.

With these capabilities, the Validate Data block ensures data integrity and supports a clean, structured workflow.

The Validate Data block is useful for catching inconsistencies in your data, especially when dealing with sources that vary between batches: for example, manually edited or authored data.

Important: If you set this block to Warning mode, the Data tab will list all issues found in the dataset. This is particularly useful when multiple issues need to be addressed.

You can switch between Error and Warning mode while working on the remedy. Error mode stops the data from passing through the block, whereas Warning mode allows the data to flow while generating a detailed error report.

Duplicated values

Omniscope offers several methods to help you manage data duplications.

The De-duplicate block identifies and removes duplicate records (e.g. when multiple rows share the same CustomerID but have different first name or surname spellings). This block allows you to specify which fields are relevant when flagging a record as a duplicate—ensuring that it's not just about having identical rows.

If you want to investigate the causes of duplication and explore variations, you can use the Frequency Table block (from the Analytics section) or the Aggregate block. In both cases, you can combine multiple fields to define what constitutes a unique key for your dataset, then easily sort and filter records where the frequency is greater than 1.

In the Aggregate block, you can create a Split on a unique field (e.g. Country ISO or CustomerID) and, on the Results side, apply functions like Unique List or Unique Value Count to identify instances where multiple country or customer name spellings exist (see image below).

Inconsistent capitalisation

The Normalise Case block ensures consistency in the use of upper and lower case letters and automates text corrections. By default, Omniscope scans each dataset and alerts you if this block is recommended. You can choose to ignore the warning, as case variations are sometimes meaningful or not relevant to the analysis.

Missing data

The data preview chart in each workflow block provides an update on how many missing values are found in each field. A quick sort (simply click on the column header) will list records with missing values at the top, making them easy to inspect.

In cases where you have values such as "n/a", codes like "999999", or blank cells, you can add a Search/Replace block to enforce uniformity and populate the cells with another value.

In some datasets (e.g. log files or digital media files), an empty cell is not considered missing data - it is assumed that each empty cell takes the value of the last populated cell above it. In this case, you can add a Fill Down block to fill in the blanks for all or just some of the fields.

Empty rows and fields

Some data input files can contain empty rows or fields, which can increase your data size and cause issues when aggregating, visualising, or merging with other datasets. Two blanks are a match!

As the name suggests, the Delete Empty Data block will find and remove any rows or columns that are completely empty.

Questionable values - using smart analytics to catch them

Sometimes it can be difficult to spot an outlier just by sorting, filtering, or visualising the data. In some cases, it is the combination of values in multiple fields that makes the record an outlier. You could automate the process by applying Anomaly Detection and Reduce Dimensions from our Analytics library, or by using your own R and Python code in the Custom block.

Here is an article explaining the process with an example.

Where data irregularities can be spotted by analysing correlations between the fields, you could use the Covariance / Correlation block.

Note: Although Omniscope doesn't automatically generate data quality scores, you can easily define your own KPIs in a Field Organiser block or in a Report block's Content view by adding a formula and formatting the result as a percentage, such as:

SUBSET_UNIQUECOUNT([Field name])

ALLRECORDCOUNT()

By building your own workflow logic and applying the relevant methodology, you will ensure that only clean, consistently prepared data reaches the reports. This gives you full confidence that your decisions are based on accurate data. Since all steps are pre-defined and transformations are automated (especially if you’re using the Scheduler app), the transition time from raw data to ready data is reduced to just a few seconds, ensuring your team has access to up-to-date information anytime and on any device.

Attached below is a demo file that combines many of the automated data cleaning methods mentioned in this article (IOZ attachment).

Can any automated ETL process catch and fix 100% of the data issues?

The answer depends on the data origins and the nature of the anomalies (are they consistent across batches, or do they vary each time?). In some cases, a degree of manual intervention may be required, and Omniscope can manage this in two ways:

Data export of 'problematic data points' for external manual intervention
Allowing viewing/editing within the project (via the Data Table block in the workflow or Table View in the Report block)

The green block below represents a file output containing mismatched data, which is exported, inspected, and edited manually outside of Omniscope. Edited data with corrections can then be imported back and integrated into the transformation logic.

Different methods for data editing inside Omniscope can be found in this article on our Knowledge Base.

Attachments (2)

ioz

Digital medi....ioz
264 KB

ioz

Data Quality....ioz
1.39 MB