Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N3 2018 year

DOI: 10.17587/prin.9.123-131
Storage and Data Handling Multi and Hyperspectral Satellite Images based on Apache Parquet
V. P. Potapov, potapov@ict.sbras.ru, S. E. Popov, popov@ict.sbras.ru, A. Ju. Oshchepkov, aosivt@gmail.com, Institute of Computational Technologies SB RAS, Novosibirsk, 630090, Russian Federation
Corresponding author: Popov Semion E., Chief Scientist, Institute of Computational Technologies SB RAS, Novosibirsk, 630090, Russian Federation, E-mail: popov@ict.sbras.ru
Received on December 15, 2017
Accepted on January 12, 2018

The article describes ways storing and processing the satellite spectral imaging data by means of distributed computing systems included in the Apache Hadoop. The review of different works devoted to the distributed processing of such data shows that improvement of the performance is achieving by the build-up or extending hardware parts of the computing cluster. The distinctive feature of the proposed approach is the way of storing the spectral images data in the Parquet-file format. It shows that a columnar disposition of the data provides an access to different pieces of the image pixel values like to the record in the database, avoiding the whole image loading into CPU memory. Besides, it retains the way of the parallel image processing by the per-pixel manner. The authors have made a comparative analysis of the storage formats for the spectral images, such as JSON, XML, sequence-file, Apache Avro, Apache Parquet. Which is consists from the following steps: the data extraction from Parquet-file, the data conversion to the Spark or Flink Dataset, the computing of the normalized vegetation index, and includes the process of the result data iterating and saving them to the HDFS. The stress tests have been accomplished on the hybrid frameworks of the Apache Hadoop ecosystem. The Apache Spark API has been chosen as the preferable spectral images processor by the reason of the native input/output methods for the Parquet-file and lesser load to the cluster hardware. In conclusion, authors demonstrate the calculating of the normalized vegetation index (NDVI) on the example of two images of the spacecraft missions (Resource-P and Sentinel-1A) based on the Apache Spark and the Apache Flink frameworks for the auditing and confirmation that the choice of the technology described in the work was correct.

Keywords: Apache Parquet, Apache Spark, Apache Flink, Java, GDAL, distributed information systems, remote sensing data, spectral satellite images
pp. 123–131
For citation:
Potapov V. P., Popov S. E., Oshchepkov A. Ju. Storage and Data Handling Multi and Hyperspectral Satellite Images based on Apache Parquet, Programmnaya Ingeneria, 2018, vol. 9., no 3, pp. 123—131.