In this work, we propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding. We also provide training results on this dataset.read more read lessĪbstract: While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence. Secondly, we prepare and provide for general use a dataset for training and testing voice activation for the Lithuanian language. At the same time, the benefits of such features become less and disappear as the dataset size increases. First, we investigate the dependence of the quality of the voice activation system on the number of examples in training for English and Russian and show that the use of pre-trained audio features, such as wav2vec, increases the accuracy of the system by up to 10% if only seven examples are available for each keyword during training. The contribution of this article consists of two parts. In this paper, we explore the possibility of using pre-trained audio features to build voice activation with a small number of keyword examples.
CEPSTRAL VOICES ACTIVATION FOR ANDROID
Solutions such as keyword spotter “Ok, Google” for Android devices or keyword spotter “Alexa” for Amazon devices use tens of thousands to millions of keyword examples in training. In addition, we point to a number of open questions in this problem.read more read lessĪbstract: The problem of voice activation is to find a pre-defined word in the audio stream. We describe the principle of various voice activation systems’ operation, the characteristic representation of sound in such systems, consider in detail the acoustic modelling and, finally, describe the approaches used to assess the models’ quality. This work is a systematic literature review on voice activation systems that satisfy the above properties. The voice activation system must have the following properties: high accuracy, ability to work entirely on the device (without using remote servers), consumption of a small amount of resources (primarily CPU and RAM), noise resistance and variability of speech, as well as a small delay between the pronunciation of the key phrase and the system activation.
Therefore, most of these devices use a voice activation system, whose task is to find the specified in advance word or phrase in the audio stream (for example, Ok, Google) and to activate the voice request processing system when it is found.
Automatic recognition of the entire audio stream, however, is undesirable for the reasons of the resource consumption and privacy.
Abstract: A large number of modern mobile devices, embedded devices and smart home devices are equipped with a voice control.