Apache Parquet and Avro read/write support

Modified on Wed, 7 Jun, 2023 at 12:49 PM

Introduction

Omniscope supports reading and writing Apache Parquet and Avro files.

Parquet is a popular column-based file format used by Hadoop systems. It is designed to efficiently storage large data sets and has the file extension .parquet.

Avro is a row-oriented file format, developed by Apache.

Reading a Parquet or Avro file

Inside your Omniscope workflow, add a new File input block. Double-click on the block to open the options. Select the location of the parquet file. If the file has the expected .parquet extension Omniscope will automatically pick the Parquet file format. Click the Play button to execute and read the data:

Writing a Parquet or Avro file

In side your Omniscope workflow, add a new File output block. Connect the data that you want to write to your output block:

Double-click on the File output block to open the options. Select the location and name of the file you want to create. Change the Format to Apache Parquet (.parquet file). Click the Play button to write the data:

Limitations

When reading a Parquet file, Omniscope only supports the following logical types: STRING, ENUM, INTEGER, DECIMAL, DATE, TIME, TIMESTAMP, JSON. Other types, such as LIST and MAP are not currently supported. If you need to import data with one or more missing types please get in touch with us, as it may be possible for us to develop support if required.