Informacionnye Tehnologii, 2025, vol. 31, no. 10, pp. 517-525

Ðóññêèé

ABSTRACTS OF ARTICLES OF THE JOURNAL "INFORMATION TECHNOLOGIES".
No. 10. Vol. 31. 2025

DOI: 10.17587/it.31.517-525

V. M. Veselovsky, Student, R. F. Khalabiya, Ph.D. in Engineering sciences, Associate Professor, I. V. Stepanova, Ph.D. in Geology and Mineralogy Sciences, Associate Professor,
Moscow Technical University of Communications and Informatics, Moscow, 111024, Russian Federation

Algorithm for Detection Relevant Text Elements Based on Morphological and Frequency Analysis

Received on 20.03.2025
Accepted on 21.04.2025

The main object of this work is to automate the process of detection key words and phrases using modern natural language processing methods, which will improve the structure and classification of text data, as well as adapt them for further integration with classification systems. For this purpose, algorithm for automatic detection of key words and phrases from texts in Russian language is proposed for use in working with complex multi-level classification systems such as UDC, GRNTI. This algorithm can work with single texts without linking them to collections of documents. À joint frequency and morphological analysis was used to detect keywords and phrases, take into account the structure of the document. When detection of key phrases, lexical and grammatical patterns consists of adjectives and nouns were used as well as stable combinations of nouns. The algorithm effective works with large texts divided into segments (ones of relevant). To adjust the rank of a relevant text element calculated using frequency analysis. À special coefficient is introduced that depend on the areas of occurrence of keywords. The comparative analysis showed that, in comparison with the TF-IDF and TextRank algorithms, the developed algorithm demonstrates high efficiency in detection key words. The integration of the automatic text analysis algorithm with classification systems discovers an additional opportunities to structure knowledge and to improve process efficiency the large amounts of data.
Keywords: text analysis, keyword, key phrases, stable combination, frequency analysis, frequency dictionary, tokenization, lemmatization, morphological analysis, text classification

P. 517-525

Full text on eLIBRARY

References

Fomin V. V. Osochkin A. A. À comparative study of the index of the frequency and morphological methods for automatic text summarisation of texts, Novye Obrazovatel'nye Strategii v Sovremennom Informatsionnom Prostranstve, 2020, pp. 189—197(in Russian).
Larionov V. D. Comparison of algorithms for extracting keywords from Russian-language news articles, Zametki po Informatike i Matematike: Sbornik nauchnykh statei, Yaroslavl, Yaroslavskii gosudarstvennyi universitet im. P. G. Demidova, 2021, vol. 13, p. 118—125 (in Russian).
Mokhammad Zh. Kh. Keyword extraction based on large language models, Izvestiya YuFU. Tekhnicheskie Nauki, 2024, no. 5 (241), pp. 143—151, DOI: 10.18522/2311-3103-2024-5-143-151 (in Russian).
Romanadze E. L., Sudakov V. A., Kislinsky V. G. Development of a Keyphrase Extraction Method Based on a Probabilistic Topic Model, Modelirovanie i Analiz Dannykh, 2022, vol. 12, no 2, pp. 20—33, DOI: 10.17759/mda.2022120202 (in Russian).
Ovchinnikova K. A., Sidorova E. A. Generation of lexical and syntactic patterns of ontological design based on competence assessment questions, Sistemnaya Informatika, 2022, no. 21, pp. 47—64, DOI: 10.31144/SI.2307-6410.2022.N21.P47-64.
Abanin D. A., Kurmyza P. S., Sherkunov V. V. Development of algorithms and tools for extracting structure and keywords from text documents, Vestnik Ul'yanovskogo Gosudarstvennogo Tekhnicheskogo Universiteta, 2022, no. 4 (100), pp. 46—51 (in Russian).
Mokhammad Zh. Kh., Mansur A. M., Kravchenko Yu. À ., Bova V. V. À method for extracting keywords based on a new ranking function, Informatsionnye Tekhnologii, 2022, vol. 28, no. 9, pp. 465—474, DOI: 10.17587/it.28.465-474 (in Russian).
Savelyev A. O., Kuznetsov S. A. Estimation of similarity of weakly structured datasets based on cosine similarity and TF-IDF, Molodezh' i sovremennye informatsionnye tekhnologii: Sbornik trudov XVIII Mezhdunarodnoi nauchno-prakticheskoi konferentsii, Tomsk, Natsional'nyi issledovatel'skii Tomskii politekhnicheskii universitet, 2021, pp. 334—335 (in Russian).
Palmov S. V., Salikhov R. R. Comparative analysis of the PYMORPHY3 and PYMYSTEM3 libraries, Nauka i Biznes: Puti Razvitiya, 2024, no. 6(156), pp. 45—49 (in Russian).
Ivanova I. V., Palmina K. S. Using Python to tokenize text in sentiment analysis, Nauchnye Issledovaniya v Sovremennom Mire. Teoriya i Praktika: Sbornik izbrannykh statei Vserossiiskoi (natsional'noi) nauchno-prakticheskoi konferentsii, Saint-Petersburg, Gumanitarnyi natsional'nyi issledovatel'skii institut "NATSRAZVITIE", 2021, pp. 83—88 (in Russian).
Ayoshin I. T., Fedorov V. A., Gorodov A. A., Goncharov À . E. Tokenizing words and selecting n-grams from text on natural language, Reshetnevskie chteniya: Materialy XXV Mezhdun-arodnoi nauchno-prakticheskoi konferentsii, Krasnoyarsk,Sibirskii gosudarstvennyi universitet nauki i tekhnologii imeni akademika M. F. Reshetneva, 2021, vol. 2, pp. 14—16.
Shklyarova E. Yu., Zemlyanskaya S. Yu. Extracting useful information from scientific publications using NLP PYTHON libraries: analysis and practical experience, Materialy XIV Mezhdunarodnoi nauchno-tekhnicheskoi konferentsii Informatika, Upravlyayushchie Sistemy, Matematicheskoe i Komp'yuternoe Modelirovanie, 2023, pp. 318—324 (in Russian).
Politsyna E. V., Politsyn S. A., Porechnyi A. S., Rykunov À . N. Analysis of the quality of work and expansion of the capabilities of morphological analysis tools for texts in Russian, Vestnik VGU, Seriya: Sistemnyi analiz i Informatsionnye Tekhnologii, 2023, no. 2, pp.171—180, DOI: 10.17308/sait/1995-5499/2023/2/171-180 (in Russian).
Kovalevskii P. O. Automatic text processing (lemmatiza-tion problem), Yazyk, Kul'tura, Mental'nost': Problemy i Perspektivy Filologicheskikh Issledovanii: Sbornik IV Mezhdunarodnoi nauchnoi konferentsii, Kursk, Yugo-Zapadnyi gosudarstvennyi universitet, 2022, pp. 135—138 (in Russian).
Khramtsov N. S. The problems of evaluating algorithms for automatic keyword, Novye informatsionnye tekhnologii v avtomatizirovannykh sistemakh, 2019, no. 22, pp. 199—203(in Russian).
Ghukasyan Ts. G. Character N-gram-Based Word Embed-dings for Morphological Analysis of Texts. Trudy ISP RAN, 2020, vol. 32, issue 2, pp. 7—14, DOI: 10.15514/ISPRAS-2020-32(2)-1 (in Russian).

To the contents