Function: Given a data set which has many columns/fields, this process calculates a revised data set with fewer columns/fields, i.e. a lower dimensional data set that preserves as much of the information useful for other analyses as possible.
Typical Use Case: The user has some large volume of data with lots of fields which potentially contain redundant information. They wish to extract non-redundant information to reduce the size of the dataset in order for subsequent operations to be performed faster.
A geneticist is attempting to identify genetic similarities between people for whom they have a genome, split into Single Nucleotide Polymorphisms (small fragments of genetic code). This data has several hundred dimensions but the geneticist wishes to cluster the information. Running clustering on this kind of data would take a very long time and take up a huge amount of memory, so by using Reduce Dimensions the overhead can be reduced, without losing essential information.
We construct a workflow by connecting the single nucleotide polymorphism demo data to a dimensionality reduction block.
Example Reduce Dimensions Workflow
The input data consists of more than a thousand fields, where each field represents a single nucleotide polymorphism. All fields need to be numeric. If some of the fields contain missing values, they’ll be ignored by default and can be entirely removed from the output data by checking “remove missing values” in the options.
The next step is configuring the Reduce Dimensions block. Clicking on the block reveals the following options:
Fields to Reduce: Select here the fields on which you would like to perform dimensionality reduction.
Number of Output Fields: This option provides you with choices on the number of output fields.
By default, this number is calculated automatically so that 99% of variance between your numerical fields is retained. Optionally you can choose to only output a specific number of fields, or, you can change the 99% variance threshold to some other value.
Use a grouping field: If your data contains a field which specifies a grouping, then it is advantageous to use this information for dimensionality reduction. Using groupings, instead of the normal Principal Component Analysis, a different algorithm called Linear Discriminant Analysis is used instead. One of its objectives is to provide an optimal dimensionality reduction based on your groups, such that they are maximally disentangled. Clicking the checkbox reveals the following field selector:
Field Selector for Groupings
Here, a field that contains the groupings can be selected.
Normalise data: If this option is checked, dimensionality reduction is performed on the correlation matrix, instead of the covariance matrix. Principal Component Analysis tried to minimise variation within each dimension. If the dimensions of your data however are scaled very differently, then your dimensionality reduction will not be optimal. In that case scaling might yield better results.
Missing value behaviour: Missing values might or might not represent problems with your data. If missing data is encountered by the Reduce Dimensions block, they cannot be used for dimensionality reduction and are ignored by default. They will still be present in the output data, but their fields for reduced dimensions will be left empty. If you would like to entirely remove these records from the output data, checking this option will do so.
Keep original inout fields in output: If this field is unchecked, only the new reduced dimensions will be part of the output.
Our workflow takes around 30 seconds and provides the original data augmented by additional fields that represent reduced dimensions.
Additional fields are added to the original data, which represent reduced dimensions