FULL TEXT IN RUSSIAN


Mekhatronika, Avtomatizatsiya, Upravlenie, 2018, vol. 19, no. 1, pp. 53—57
DOI: 10.17587/mau.19.53-57


Audiovisual Voice Activity Detector Based on Deep Convolutional Neural Network and Generalized Cross-Correlation

D. A. Suvorov, dmitry.suvorov@skolkovotech.ru, R. A. Zhukov, roman.zhukov@skolkovotech.ru, D. O. Tsetserukov, d.tsetserukou@skoltech.ru, Skolkovo Institute of Science and Technology, Moscow, 143026, Russian Federation,
S. L. Zenkevich, zenkev@bmstu.ru, Bauman Moscow State Technical University, Moscow, 105005, Russian Federation

Corresponding author: Suvorov Dmitry A., Ph. D., student, e-mail: dmitry.suvorov@skolkovotech.ru

Accepted on August 20, 2017

This paper presents a voice activity detector (VAD) which uses the data from the compact linear microphone array and a video camera, so developed VAD is robust to external noise conditions. It is able to ignore non-speech sound sources and speaking persons located out of the area of the interest. A deep convolutional neural network processes images from the video camera for searching face and lips of the speaking person. It was trained using the Max-Margin Object Detection loss. Pixel coordinates of found lips are converting to directions to lips in camera coordinate system using optical camera model. The sound from the microphone array is processing using the weighted GCC-PHAT algorithm and Kalman filtering. VAD searches for speaking lips on the video. It becomes activated only if the video camera finds lips and the microphone array confirms that there is a sound source in this direction. A prototype of the system based the linear microphone array with 30 mm spacing between microphones and the video camera was developed, manufactured using a 3D printer and tested in the laboratory conditions. The accuracy of the system was compared with the open source VAD from the WebRTC project (developed by Google) which uses only audio features extracted from the same microphone array. Developed VAD showed a high sustainability to external noise. It ignored the noise from not-target directions during 100% of the testing time. And the VAD from the WebRTC had 88 % of false positive activations.
Keywords: voice activity detector, microphone array, convolutional networks

Acknowledgments: The research was partially implemented due to the grant from the Fund for the Promotion of Innovation (project No. 102GRNTIS5/26071).

For citation:
Suvorov D. A., Zhukov R. A., Tsetserukov D. O., Zenkevich S. L. Audiovisual Voice Activity Detector Based on Deep Convolutional Neural Network and Generalized Cross-Correlation, Mekhatronika, Avtomatizatsiya, Upravlenie, 2018, vol. 19, no. 1, pp. 53—57

DOI: 10.17587/mau.19.53-57

 

To the contents