Optimising workflow performance using Savepoint blocks and understanding streaming behaviour

Modified on Thu, 22 May at 12:05 PM

Overview

This article explains how Omniscope handles data execution through streaming blocks such as Record Filter and Field Organiser, when to use Savepoint (Temporary) blocks for performance optimisation, and common misconceptions when building workflows with large datasets.

Key Concepts

1. Streaming Blocks

Many Omniscope blocks, like Record Filter and Field Organiser, are considered pure streaming blocks. These process data on-the-fly without storing the full dataset in memory or on disk. This allows fast performance with minimal overhead.

Example: Simple formulas (e.g., arithmetic operations) and field transformations typically use negligible resources and stream efficiently.

2. Field Organiser blocks with Non-scalable Formulas

Some formulas (like SUBSET) require evaluating the entire dataset as a whole and are therefore not streamable. These formulas cause the dataset to be fully loaded and retained, introducing memory and performance overhead.

If your workflow contains non-scalable formulas, you'll see a warning in Omniscope.

Certain other blocks are always and implicitly non-streaming, such as the Join and Custom blocks; they must consume the input data completely to satisfy the transformation correctly before emitting anything, and store the result temporarily for the remainder of the workflow execution.

Understanding Data Flow: Streaming vs. Stored Execution

Omniscope uses a pull-based model: downstream blocks request data from upstream. If the same upstream block is used in multiple pathways, it will execute separately for each pull unless a non-streaming block is used, or a dedicated a Savepoint block (with Temporary option selected) is used.

Scenario: If you split data using a Record Filter into two paths (Yes/No), both paths will trigger execution of streaming upstream blocks, such as data sources, twice.

When and Where to Use Savepoint (Temporary) Blocks

Purpose:

The Savepoint (Temporary) block stores intermediate results on disk for the duration of the workflow execution. This ensures data is only processed once, even when used in multiple downstream pathways.

Use Savepoint when:

You have a slow or large data source (e.g., 30M+ rows).
You apply a Record Filter that significantly reduces the data volume (e.g., to 40k rows).
You need to reuse the filtered data in multiple downstream branches.

Recommended Placement:

In this example, place the Savepoint block immediately after a Record Filter that significantly reduces the dataset size. This minimises the amount of data written and maximises performance gain.

❌ Not recommended: Placing Savepoint right after a 30M-row source. This introduces unnecessary overhead since it is immediately filtered in this example. It might still improve performance for slow sources, but could be much better...
✅ Recommended: Placing Savepoint after the filter that reduces data to 40k rows.

Note that the Savepoint block has two behaviours. Be sure to configure "Temporary" rather than "Persistent", for this goal. A persistent savepoint is used to optimise execution between successive executions that aren't refreshing everything, rather than within the same workflow execution with multiple pathways.

Additional Notes

Omniscope may temporarily cache results during step-by-step execution to help with interactive workflow design. However, this is not used during full workflow executions or refreshes.
Some blocks (e.g., Join) are not streaming blocks (as mentioned above) and implicitly behave like temporary Savepoints due to the nature of their operations.

Summary Recommendations

Use Savepoint (with Temporary option) when performance is affected by repeated upstream executions.
Place Savepoint (temporary) after filtering operations that dramatically reduce data volume.
Be aware of non-scalable formulas and avoid them where performance is a concern.
Remember: streaming blocks = fast, but will re-execute if reused downstream.