First published: 2017/11/20 (7 years ago) Abstract: Keyword spotting (KWS) is a critical component for enabling speech based user
interactions on smart devices. It requires real-time response and high accuracy
for good user experience. Recently, neural networks have become an attractive
choice for KWS architecture because of their superior accuracy compared to
traditional speech processing algorithms. Due to its always-on nature, KWS
application has highly constrained power budget and typically runs on tiny
microcontrollers with limited memory and compute capability. The design of
neural network architecture for KWS must consider these constraints. In this
work, we perform neural network architecture evaluation and exploration for
running KWS on resource-constrained microcontrollers. We train various neural
network architectures for keyword spotting published in literature to compare
their accuracy and memory/compute requirements. We show that it is possible to
optimize these neural network architectures to fit within the memory and
compute constraints of microcontrollers without sacrificing accuracy. We
further explore the depthwise separable convolutional neural network (DS-CNN)
and compare it against other neural network architectures. DS-CNN achieves an
accuracy of 95.4%, which is ~10% higher than the DNN model with similar number
of parameters.