Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397
Issue N1 2019 year
In the modern world, the amount of information is constantly growing. A large part of it is unstructured text data. It is difficult for a person to independently process them. Moreover, manual analysis is ineffective for large volumes of text, since it is limited by speed, errors and errors due to the human factor. Therefore, methods that can automatically handle such data are required. One of the technologies for processing textual information is the automatic classification of text documents. The traditional representation of a document in the form of a sequence of symbols makes it difficult to work with it as an object of classification. Most machine learning algorithms work with documents as elements of a vector space, which determines the need for a corresponding transformation of texts into a vector object form. In this paper, we consider the possibilities of reducing the dimension of vectors for searching in the textual body of descriptions of close fragments of knowledge and linguistic forms of their expression. Some of the methods that reduce the dimensionality of the vector for automatic classification of the text are considered, their advantages and disadvantages are highlighted, which can serve as a starting point for the development of more effective approaches. So, to reduce the dimension of vectors, combinations of text preprocessors.