Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N6 2019 year

DOI: 10.17587/prin.10.265-273
On Changing the Dimension of the Document Embeddings
A. S. Shundeev, alex.shundeev@gmail.com, Lomonosov Moscow State University, Moscow, 119192, Russian Federation
Corresponding author: Shundeev Aleksander S., Leading Researcher, Lomonosov Moscow State University, Moscow, 119192, Russian Federation, E-mail: alex.shundeev@gmail.com
Received on March 03, 2019
Accepted on April 04, 2019

Currently, data mining is the basis for building a wide range of information systems. A modern and rapidly developing approach to the analysis of textual data is the construction and use of the word and document embeddings. Such embeddings were originally applied for the word similarity task and the word analogies task. However, they turned out to be in demand also in the text classification task. From this point of view, the word and document embeddings are investigated in this paper. An approach based on the word embeddings transformations is described. In these transformations the model and dimension of the word embeddings are changed. The document embeddings may be associated with the word embeddings. In this case, the transformations considered can be extended to the document embeddings. For this purpose, multidimensional regression methods are used. To confirm the proposed approach, experiments on two test datasets were performed. The first data set contained movie reviews related to one of six genres. The second data set contained twitter messages, each of which was negative or positive. The initial Doc2Vec (DBOW) document embeddings of dimensions 50, 100, 200, 300 were built. Also the Word2Vec (CBOW, SG) and GloVe word embeddings of dimensions 50, 100, 200 were built. The experiments performed on these datasets showed the following result. The document embeddings constructed using the proposed method may have a smaller dimension. Moreover, their use in the considered text classification tasks in most cases gives a more accurate result than when using the original document embeddings.

Keywords: word embeddings, document embeddings, text classification, regression
pp. 265–273
For citation:
Shundeev A. S. On Changing the Dimension of the Document Embeddings, Programmnaya Ingeneria, 2019, vol. 10, no. 6, pp. 265—273