Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N8 2025 year

DOI: 10.17587/prin.16.404-411
Developing the Ensemble of Models to Find the Most Relevant Text to the Question
N. V. Smirnov, Associate Professor, nvsmirnov87@mail.ru, V. A. Dzida, Student, dzida.vadim@yandex.ru, Petrozavodsk State University, Petrozavodsk, 185910, Russian Federation
Corresponding author: Nikolai V. Smirnov, Associate Professor, Petrozavodsk State University, Petrozavodsk, 185910, Russian Federation E-mail: nvsmirnov87@mail.ru
Received on May 27, 2025
Accepted on June 24, 2025

The wide application of intelligent assistants and chatbots makes information search a relevant task. Large language models are increasingly used to generate answers to user questions. The task of searching the Russian-language texts that contain information necessary to generate an answer to a user's question is considered in the article. Authors have developed text datasets in Russian, which differ both in the number of characters included in each text fragment and in the number of fragments themselves. The largest text dataset contains 1,000,000 texts with 5,000 characters each. Also, authors developed 6 classes of questions, which differ in the way the answers to them are presented in the text. The ensemble of models that uses the vector representation of words obtained using neural networks with transformer architecture and the values of the ranking function is developed and tested on different text datasets. The ensemble has proven its effectiveness in finding the most relevant text. Various filterings of text fragments are proposed and tested. The developed double filtering made it possible to remove from consideration chunks that do not contain an answer to the question. The obtained high metrics of accuracy and speed of applying the developed software that implements the ensemble of models and filtering determine the possibility of use in various intelligent assistants.

Keywords: information retrieval, RAG, neural networks, ranking function, natural language processing
pp. 347—357
For citation:
Smirnov N. V., Dzida V. A. Developing the Ensemble of Models to Find the Most Relevant Text to the Question, Programmnaya Ingeneria, 2025, vol. 16, no. 8, pp. 404—411. DOI: 10.17587/prin.16.404-411 (in Russian).
References:
  1. Voronin V. Y., Semenov N. D., Smirnov N. V. Chatbot of a House Management Company, Advances in Automation V: Proceedings of the International Russian Automation Conference (RusAutoCon 2023), 2024, pp. 291—299. DOI: 10.1007/978-3-031-51127-1_28.
  2. Ovsianikov M. S. Intelligent Assistants as a sales tool in the construction industry, Investicii, gradostroitel'stvo, tekhnologii kak draivery social'no-ekonomicheskogo razvitiya territorii i povysheniya kachestva zhizni naseleniya: Proceedings of 14th International Research and Practice Conference, Tomsk, 2024, pp. 630—637 (in Russian).
  3. Yantsevichiute Y. A., Dorrer M. G. An intelligent assistant for checking open answers when completing a training course in the Telegram messenger, Proceedings of All-Russian scientific and practical conference of students, postgraduates and young scientists "Molodye uchenye v reshenii aktual'nyh problem nauki", Krasnoyarsk, 2024, pp. 499—501 (in Russian).
  4. Basipov A. A., Demidovich O. V. Information Retrieval. Models of Information Retrieval, Informacionnye tekhnologii. Radioelektronika. Telekommunikacii, 2012, no. 2—1, pp. 159—165 (in Russian).
  5. Iven' G. Using large language models to develop skills in analyzing media context in students of philological specialties, Informatika i obrazovanie, 2024, vol. 39, no. 6, pp. 82—96 (in Russian).
  6. Seabra A., Cavalcante C., Nepomuceno J. A. et al. Dynamic Multi-Agent Orchestration and Retrieval for Multisource Question-Answering Using Large Language Models, International Journal on Cybernetics and Informatics, 2024, vol. 13, no. 6, pp. 11—30. DOI: 10.48550/arXiv.2412.17964.
  7. Red'ko D. A., Suhova E. M. Creating an intelligent assistant assistant based on the RAG system, Proceedings of International Conf. "Perspektivy razvitiya fundamental'nyh nauk", Tomsk, 2024, vol. 3, Mathematics, pp. 8—10 (in Russian).
  8. Smirnov N. V., Marakhtanov A. G. Development of the Automated Question-Answering System Specialised in the Tourism Field, Advances in Automation VI (RusAutoCon 2024), 2025, vol. 1324, pp. 293—302. DOI: 10.1007/978-3-031-82494-4_26.
  9. Zhan Z., Zhou S., Zhou X. et al. Retrieval-Augmented In-Context Learning for Multimodal Large Language Models in Disease Classification, 2025, arxiv.org.2505.02087. DOI: 10.48550/arXiv.2505.02087.
  10. Bul'ga F. S., Kurejchik V. M. Clustering the Corporation of Text Documents Using the K-Means Algorithm, Izvestiya vysshih uchebnyh zavedenij. Severo-kavkazskij region. Tekhnicheskie nauki, 2022, no. 3 (215), pp. 33—40 (in Russian).
  11. Alyoshin N. A. The Support Vector Machines (SVM), Proceedings of International Conf. "Tekhnika i tekhnologii: teoriya i praktika", Penza, 2020, pp. 9—11 (in Russian).
  12. Loboda L. D., Vol'f V. V. Logistic regression and modern artificial intelligence, Proceedings of International Conf. "Missiya intellektualov v sovremennom mire: problemy, ogranicheniya, perspe-ktivy", Kemerovo, 2023, pp. 513.1—513.4 (in Russian).
  13. Smirnov Y. V. Thematic Search in Modern Library Informa­tion Retrieval Systems, Nauchnye i tekhnicheskie biblioteki, 2021, no. 7, pp. 87—96. DOI: 10.33186/1027-3689-2021-7-87-96 (in Russian).
  14. Filatova D. K., Vil'dyajkn G. F. Investigation of the LSA Model for Semantic Information Retrieval, Proceedings of VI International Conf. "Molodezh' i nauka: Aktual'nye problemy fundamental'nyh iprikladnyh issledovanij". P. 2, Komsomolsk-on-Amur, 2023, pp. 393—396 (in Russian).
  15. Zotkina A. A., Tkachenko A. V. Text Data Analysis: Probabilistic and Algebraic Methods of Topic Modeling, Sovremennye informacionnye tekhnologii, 2024, no. 40, pp. 49—52 (in Russian).
  16. Chawla S., Kaur R., Aggarwal P. Text Classification Framework for Short Text Based in TFIDF-Fasttext, Multimedia Tools and Applications, 2023, vol. 82, pp. 40167—40180. DOI: 10.1007/s11042-023-15211-5.
  17. Wang L., Yang N., Huang X. et al. Multilingual E5 Text Embeddings: A Technical Report, 2024. arxiv.org.2402.05672. DOI: 10.48550/arXiv.2402.05672.
  18. Askari A., Abolghasemi A., Pasi G., Kraaij W., Verberne S. Injecting the BM25 Score as Text Improves BERT-Based Re-rankers, 2023. arXiv.2301.09728. DOI: 10.48550/arXiv.2301.09728.