Voice Command Recognition with Deep Neural Network on Edge Devices

Md Naim Miah, Purdue University

Abstract

Internet of Things (IoT) or network interconnected devices are growing faster with the advancements in wireless networking technologies. The number is expected to cross about 125 billion by 2030 [1]. It connects a massive number of sensors and devices in cloud data centers. However, it is gradually becoming a challenging task for the clouds to handle this big data. It is required to deliver a huge computation power, which is unquestionably a serious challenge. Additionally, the increasing demand for data traffic is also touching the global maximum limits. Especially, applications that need continuous monitoring such as keyword spotting (KWS) from speech data. This data might include personal information and sending it to the cloud raise a serious concern with privacy. Edge computing has proven to be an effective way out of this problem. Instead of sending the data to the cloud, it performs computation by itself and thus overcomes the issues with bandwidth cost, privacy, and scalability [2], [3]. However, edge devices often suffer from limited computation and storage capability. It also needs to provide high accuracy outputs in real-time. KWS on edge devices has already proven to be very useful to interact with electronic devices, for example “Google Home” and “Amazon Echo.” Only after detecting a keyword, such as ”Okay Google” or ”Alexa,” do these devices typically go online or record speech data and send it to the cloud. KWS is also very popular to interact with automated vehicles. Because of the unpredictable nature of cellular networks, it is not possible to maintain the connection between the vehicle and the cloud servers all the time. As the KWS does not need any internet connectivity, it can interact with the vehicle without any problem. These devices also need to be robust and noise resistant to implement in the real world. Deep Neural Networks (DNNs) have shown very high accuracy in complex applications like these. As the KWS mainly deals with time-series data, recurrent neural networks (RNN) provides a very good response for this type of application. But, this type of network needs high computation power and storage. The RNN neuron requires eight times more weight and complexity than that of a standard Convolutional Neural Network (CNN) [4]. As it emerges from a speaker’s mouth, nose, and cheeks, the speech signal is a one-dimensional function where air pressure varies with time. But feature extraction enables it to act as a single channel image. So, it can achieve very high efficiency for CNNs as well. Research is going on to implement CNN for KWS application and the efficiency is improving over time. These open up the possibility to successfully implement KWS on edge devices. The DNNs have successfully been implemented on edge devices for KWS and the accuracy is going higher. In [5], [6], the authors provided a comprehensive study for different neural networks, such as CNN, LSTM, RNN, etc. It provides a comparison of different networks considering accuracy, computational complexity, and memory footprint. They also implemented the model on embedded hardware, 32-bit ARM microcontroller and achieved the best performance for a CNN-based network, Depthwise Separable Convolutional Neural Network (DSCNN). This network is a modified version of the MobileNet. In [7] the authors demonstrated MobileNet to be very efficient in classifying 2-D spatial data, such as image classification. It demonstrated a huge reduction in the computational requirement, which makes it a good choice for resource constraint devices.

Degree

M.Sc.

Advisors

Wang, Purdue University.

Subject Area

Artificial intelligence|Computer science|Electrical engineering|Information Technology|Mathematics|Web Studies

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS