|
||||||||||
|
DOI: 10.17587/it.26.290-296 L. V. Savchenko, PhD (Candidate of Sciences), e-mail: lsavchenko@hse.ru, National Research University Higher School of Economics, N. Novgorod, Russian Federation The article deals with the problem of isolated words recognition based on deep convolutional neural networks. The use of existing recognition systems in practice is limited by an insufficiently high degree of their reliability functioning in conditions of intense acoustic noise, such as street noise, sounds from passing vehicles, etc. Nowadays, the most accurate recognition methods are characterized by the formation of acoustic models with deep learning technologies and, in particular, convolutional neural networks. For image processing problems the possibility of adaptation of such networks to a new domain with additional fine-tuning on rather small training samples is well studied. In this paper we proposed to perform additional training of networks for adaptation of acoustic models on a speaker voice with use of small number of the utterances. In order to reduce the error rate, we consider an ensemble of several different speaker-dependent neural network architectures that have been trained in such a way. The final decision is made by a weighted voting rule, in which the weight of each acoustic model is determined in proportion to the accuracy estimated on the training set. The experimental results for recognition of English commands proved that such ensemble of pre-trained acoustic models can significantly improve accuracy compared to traditional pre-trained models, especially if the white Gaussian noise is added to the input signal. P. 290–296
|