Journal "Software Engineering" (Programmnaya Ingeneria) | Storage and Data Handling Multi and Hyperspectral Satellite Images based on Apache Parquet

Main

New Issue

Archive

Most cited articles

Editor in chief

Editorial board

For the authors

Publishing ethic

Peer reviewing

Publishing House

Old site

Russian

Issue N3 2018 year

DOI: 10.17587/prin.9.123-131

Storage and Data Handling Multi and Hyperspectral Satellite Images based on Apache Parquet

V. P. Potapov, potapov@ict.sbras.ru, S. E. Popov, popov@ict.sbras.ru, A. Ju. Oshchepkov, aosivt@gmail.com, Institute of Computational Technologies SB RAS, Novosibirsk, 630090, Russian Federation

Corresponding author: Popov Semion E., Chief Scientist, Institute of Computational Technologies SB RAS, Novosibirsk, 630090, Russian Federation, E-mail: popov@ict.sbras.ru

Received on December 15, 2017

Accepted on January 12, 2018

The article describes ways storing and processing the satellite spectral imaging data by means of distributed computing systems included in the Apache Hadoop. The review of different works devoted to the distributed processing of such data shows that improvement of the performance is achieving by the build-up or extending hardware parts of the computing cluster. The distinctive feature of the proposed approach is the way of storing the spectral images data in the Parquet-file format. It shows that a columnar disposition of the data provides an access to different pieces of the image pixel values like to the record in the database, avoiding the whole image loading into CPU memory. Besides, it retains the way of the parallel image processing by the per-pixel manner. The authors have made a comparative analysis of the storage formats for the spectral images, such as JSON, XML, sequence-file, Apache Avro, Apache Parquet. Which is consists from the following steps: the data extraction from Parquet-file, the data conversion to the Spark or Flink Dataset, the computing of the normalized vegetation index, and includes the process of the result data iterating and saving them to the HDFS. The stress tests have been accomplished on the hybrid frameworks of the Apache Hadoop ecosystem. The Apache Spark API has been chosen as the preferable spectral images processor by the reason of the native input/output methods for the Parquet-file and lesser load to the cluster hardware. In conclusion, authors demonstrate the calculating of the normalized vegetation index (NDVI) on the example of two images of the spacecraft missions (Resource-P and Sentinel-1A) based on the Apache Spark and the Apache Flink frameworks for the auditing and confirmation that the choice of the technology described in the work was correct.

Keywords: Apache Parquet, Apache Spark, Apache Flink, Java, GDAL, distributed information systems, remote sensing data, spectral satellite images

pp. 123–131

For citation:
Potapov V. P., Popov S. E., Oshchepkov A. Ju. Storage and Data Handling Multi and Hyperspectral Satellite Images based on Apache Parquet, Programmnaya Ingeneria, 2018, vol. 9., no 3, pp. 123—131.