DOI: 10.17587/prin.16.583-593
Intelligent System for Text Classification under Linguistic Uncertainty
A. A. Skvortsov, Associate Professor of the Department of Mathematical Modeling and Information Technologies, skvor_88@mail.ru,
M. S. Anureva, Associate Professor of the Department of Mathematical Modeling and Information Technologies, anuryeva@mail.ru,
A. N. Solodovnikov, Senior Specialist, IT Center, bearbearovich@gmail.com, Derzhavin Tambov State University, Tambov, 392000, Russian Federation
Corresponding author: Alexander A. Skvortsov, Candidate of Pedagogical Sciences, Associate Professor of the Department of Mathematical Modeling and Information Technologies, Derzhavin Tambov State University, Tambov, 392000, Russian Federation, E-mail: skvor_88@mail.ru
Received on May 06, 2025
Accepted on June 17, 2025
The article presents the development of an intelligent system for classifying user-generated texts in social networks under conditions of linguistic uncertainty. It addresses the growing need for automated identification of professional interests among social media users, which is especially relevant for recruitment, education, and professional orientation. Traditional methods for identifying target audiences are time-consuming, costly, and inefficient when dealing with unstructured data. The proposed system utilizes and compares modern text vectorization methods — TF-IDF, FastText, and BERT — to improve classification accuracy.
The system architecture includes modules for collecting user data from VKontakte via API, preprocessing text (normalization, lemmatization, noise reduction), and thematic classification using a predefined set of IT-related terms. The classification decision is based on the scalar product between text vectors and reference vectors of key terms. The system also supports an ensemble decision mechanism combining multiple models to increase reliability.
The study provides a comparative analysis of the effectiveness of the selected vectorization methods in binary classification tasks (IT-related or not) using real user data. Experiments demonstrate the superiority of BERT in terms of accuracy, followed by FastText. TF-IDF showed lower sensitivity to thematic content in short and informal messages. A web-based interface was developed to automate user classification. It allows the input of a VK community ID,
retrieves and analyzes user data, and generates reports on IT interest distribution. The system can be applied in career guidance, educational analytics, and HR processes to identify suitable candidates based on their digital footprint.
Keywords: text vectorization, intelligent classification, natural language processing, linguistic uncertainty, social networks
pp. 583—593
For citation:
Skvortsov A. A., Anureva M. S., Solodovnikov A. N. Intelligent System for Text Classification under Linguistic Uncertainty, Programmnaya Ingeneria, 2025, vol. 16, no. 11. pp. 583—593. DOI: 10.17587/ prin.16.583-593 (in Russian).
References:
- Abubakar H. D., Umar M., Bakale M. A. Sentiment classification: Review of text vectorization methods: Bag of Words, TF-IDF, Word2Vec and Doc2Vec, Sule Lamido Univ. J. of Science & Technology, 2022, vol. 4, no. 1, pp. 27—33. DOI: 10.56471/slujst.v4i.266.
- Rani D., Kumar R., Chauhan N. Study and comparison of vectorization techniques used in text classification, 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 2022, pp. 1—6. DOI: 10.1109/ICCCNT54827.2022.9984608.
- Wendland A., Zenere M., Niemann J. Impact of stemming and comparing TF-IDF and Count Vectorization, EuroSPI 2021. Comm. in Computer and Info Science, 2021, vol. 1442, pp. 289—300. DOI: 10.1007/978-3-030-85521-5_19.
- Khotin D. Yu. Comparative analysis of processing methods and vectorization of text for classification problem fake text news, Science Bulletin, 2024, no. 6 (75), pp. 1577—1583 (in Russian).
- Chelyshev E. A., Otsokov Sh.A., Raskatova M. V., Shchegolev P. Comparing classification methods for news texts in russian using machine learning algorithms, Proceedings in Cybernetics, 2022, no. 1 (45), pp. 63—71. DOI: 10.34822/1999-7604-2022-163-71 (in Russian).
- Chizhik A. V. Comparison of text vectorization models for the sentiment analysis of short messages from social media, Computer Linguistics and Computing Ontologies, 2024, no. 7, pp. 81—89. DOI: 10.17586/2541-9781-2024-7-81—89 (in Russian).
- Tumbinskaya M. V., Galiev R. A. Neural network-based approach for identifying fake news, Software & Systems, 2023, vol. 36, no. 4, pp. 590—599. DOI: 10.15827/0236-235X.142.590-599 (in Russian).
- Glazkova A. V. Intelligent system for automatic identification of text addressee category, Software & Systems, 2016, vol. 29, no. 3, pp. 85—89. DOI: 10.15827/0236-235X.115.085-089 (in Russian).
- Gruzdeva A. S., Yuryev R. N., Bessmertny I. A. Application of the text wave model to the sentiment analysis problem, Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2022, vol. 22, no. 6, pp. 1159—1165 DOI: 10.17586/22261494-2022-22-6-1159-1165 (in Russian).
- Hu Y., Khan L. Uncertainty-aware reliable text classification, Proc. KDD 2021, 2021, pp. 628—637. DOI: 10.1145/3447548.3467382.
- Zhang D., Sensoy M., Makrehchi M. et al. Uncertainty quantification for text classification, Proc. SIGIR 2023, 2023, pp. 3426—3429. DOI: 10.1145/3539618.3594243.
- Wu Z., Wang Y., Shen L., Hu F. Leveraging uncertainty for depth-aware hierarchical text classification', Computers, Materials & Continua, 2024, vol. 80, no. 3, pp. 4567—4581. DOI: 10.32604/cmc.2024.054581.
- He J., Zhang X., Lei S., Chen Z. Towards more accurate uncertainty estimation in text classification, Proc. EMNLP 2020, 2020, pp. 8362—8372. DOI: 10.18653/v1/2020.emnlp-main.671.
- Zhaksybaev D. O., Bakiev M. N. Algorithms for the classification of text documents, taking into account proximity in the attribute space, Modeling of Systems and Processes, 2022, vol. 15, no. 1, pp. 36—43. DOI: 10.12737/2219-0767-2022-15-1-36-43 (in Russian).
- Gromov Yu. Yu., Didrikh V. E., Didrikh I. V., Grechushkina A. Yu. Construction of intellectual systems of control of information processes in conditions of uncertainty, Modeling of Systems and Processes, 2018, vol. 11, no. 1, pp. 10—14. DOI: 10.12737/article_5b574c7c299958.66418026 (in Russian).
- Rogers D., Preece A., Innes M., Spasic I. Real-time text classification of user-generated content on social media: A systematic review, IEEE Trans. on Computational Social Systems, 2022, vol. 9, no. 4, pp. 1154—1166. DOI: 10.1109/TCSS.2021.3120138.
- Barbieri F., Camacho-Collados J., Neves L., Espinosa-Anke L. TweetEval: Unified benchmark and comparative evaluation for tweet classification, Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1644—1650.DOI: 10.18653/v1/2020.findings-emnlp.148.
- Camacho-Collados J., Rezaee K., Riahi T. et al. TweetNLP: Cutting-edge NLP for social media, Proc. EMNLP 2022: System Demonstrations, 2022, pp. 38—49. DOI: 10.18653/v1/2022.emnlp-demos.5.
- Al Asad M. A., Imran H. M., Alamin M. et al. Sentiment and interest detection in social media using GPT-based large language models', Proc. MLNLP 2023, 2023, pp. 209—214. DOI: 10.1145/3639479.3639523.
- Fan H., Qin Y. Research on text classification based on improved TF-IDF algorithm, Proc. Int. Conf. Network, Communication, Computer Engineering, 2018, pp. 501—506.
- Liu C.-Z., Sheng Y.-X., Wei Z.-Q., Yang Y.-Q. Text classification research based on improved TF-ID, Proc. IEEE IRCE 2018, 2018, pp. 218—222. DOI: 10.1109/IRCE.2018.8492945.
- Saeed A. M. An automated new approach in fast text classification: A case study for Kurdish text, Science Journal of University of Zakho, 2024, vol. 12, no. 3, pp. 330—336. DOI: 10.25271/sjuoz.2024.12.3.1296.
- Kuyumcu B., Aksakalli C., Delil S. An automated approach in fast text classification (fastText): A Turkish case study without preprocessing', Proc. 3rd Int. Conf. NLP and Info Retrieval, 2019, pp. 1—4. DOI: 10.1145/3342827.3342828.
- Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pretraining of deep bidirectional transformers for language understanding, Proc. NAACL-HLT2019, 2019, pp. 4171—4186. DOI: 10.48550/arXiv.1810.04805.
- Pronina E. V., Pronin D. D. Research potential of studying the corpus of works of russian literature using digital linguistic methods and artificial intelligence technologies (Lensky project), Modern Scientist, 2023, no. 3, pp. 92—105 (in Russian).
- Golikov A. A., Akimov D. A., Danilova Yu. Yu. Optimization of traditional methods for determining the similarity of project names and purchases using large language models, Litera, 2024, no. 4, pp. 109—121. DOI: 10.25136/2409-8698.2024.4.70455 (in Russian).