Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N1 2019 year

DOI: 10.17587/prin.10.30-37
Analysis of Methods for Converting Texts into the Form of Objects in a Vector Space
E. I. Burlaeva, ekaterina0853@mail.ru, V. N. Pavlysh, pavlyshvn@mail.ru, Donetsk National Technical University, Donetsk, 83008, Donetsk region
Corresponding author: Burlayeva Ekaterina I., Graduate Student, Donetsk National Technical University, Donetsk, 83008, Donetsk region, E-mail: ekaterina0853@mail.ru
Received on June 26, 2018
Accepted on September 05, 2018

In the modern world, the amount of information is constantly growing. A large part of it is unstructured text data. It is difficult for a person to independently process them. Moreover, manual analysis is ineffective for large volumes of text, since it is limited by speed, errors and errors due to the human factor. Therefore, methods that can automatically handle such data are required. One of the technologies for processing textual information is the automatic classification of text documents. The traditional representation of a document in the form of a sequence of symbols makes it difficult to work with it as an object of classification. Most machine learning algorithms work with documents as elements of a vector space, which determines the need for a corresponding transformation of texts into a vector object form. In this paper, we consider the possibilities of reducing the dimension of vectors for searching in the textual body of descriptions of close fragments of knowledge and linguistic forms of their expression. Some of the methods that reduce the dimensionality of the vector for automatic classification of the text are considered, their advantages and disadvantages are highlighted, which can serve as a starting point for the development of more effective approaches. So, to reduce the dimension of vectors, combinations of text preprocessors.

Keywords: vector representation, text document, word, composition of methods, tf-idf, classification, "Stemming", "Stop words", "Lower boundary"
pp. 30–37
For citation:
Burlaeva E. I., Pavlysh V. N. Analysis of Methods for Converting Texts into the Form of Objects in a Vector Space, Programmnaya Ingeneria, 2019, vol. 10, no. 1, pp. 30—37