Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N5 2022 year

DOI: 10.17587/prin.13.239-246
An Algorithm for Finding Contradictions in Multiformat Data using Apache Spark
A. A. Vorobyev, Associate Professor, awa@mail.ru, S. M. Makeev, PhD, maksm57@yandex.ru, Russian Federation Security Guard Service Federal Academy, Oryol, 302015, Russian Federation
Corresponding author: Makeev Sergey M., PhD, Employer, Russian Federation Security Guard Service Federal Academy, Oryol, 302015, Russian Federation E-mail: maksm57@yandex.ru
Received on May 20, 2021
Accepted on March 17, 2022

The quality of managerial decision-making is significantly influenced by the inconsistency and heterogeneity of information obtained from various sources with the inability to unambiguously determine their reliability, for example, social networks, electronic media, opinion polls, as well as the types of representations used, for example, texts, graphs or tables. The purpose of the work was to conduct theoretical and experimental studies that ensure the choice of methods and their implementation in the algorithm for processing multiformat data to solve the problem of inconsistency and heterogeneity of information. To achieve this goal, the following tasks were solved: comparative analysis of the possibilities of methods for finding contradictions in heterogeneous information: latent-semantic analysis, neural networks and others; development of an algorithm for intelligent processing of big data using the Apache Spark module; evaluation of the algorithms performance for obtaining a qualitative result within a given time interval. As a result of the research, in the framework of solving the problem of finding contradictions for processing media publications, it is proposed to consistently use latent semantic analysis to select articles on a given topic, and then the method of determining the tonality of articles, and for processing the results of sociological surveys, the method of calculating the integral indicator for the question selected from the questionnaire. Based on the selected methods, a multi-step algorithm was developed and then implemented in Python using the Apache Spark platform in the form of a software product registered in the Register of Computer Programs. Based on the results of the experiments conducted in the work, it was concluded that the use of the Apache Spark module with the developed algorithm makes it possible to ensure an effective search for contradictions in information with the fulfillment of the requirements for efficiency.

Keywords: heterogeneous information, contradiction, latent semantic analysis, text tonality, search algorithm, Apache Spark
pp. 239—246
For citation:
Vorobyev A. A., Makeev S. M. An Algorithm for Finding Contradictions in Multiformat Data using Apache Spark, Programmnaya Ingeneria, 2022, vol. 13, no. 5, pp. 239—246.