First published: 2017/11/20 (5 years ago) Abstract: Keyword spotting (KWS) is a critical component for enabling speech based user
interactions on smart devices. It requires real-time response and high accuracy
for good user experience. Recently, neural networks have become an attractive
choice for KWS architecture because of their superior accuracy compared to
traditional speech processing algorithms. Due to its always-on nature, KWS
application has highly constrained power budget and typically runs on tiny
microcontrollers with limited memory and compute capability. The design of
neural network architecture for KWS must consider these constraints. In this
work, we perform neural network architecture evaluation and exploration for
running KWS on resource-constrained microcontrollers. We train various neural
network architectures for keyword spotting published in literature to compare
their accuracy and memory/compute requirements. We show that it is possible to
optimize these neural network architectures to fit within the memory and
compute constraints of microcontrollers without sacrificing accuracy. We
further explore the depthwise separable convolutional neural network (DS-CNN)
and compare it against other neural network architectures. DS-CNN achieves an
accuracy of 95.4%, which is ~10% higher than the DNN model with similar number
- Result of thourough research which not only covers major research, but also compares under same criteria/ dataset; This is also a great survey.
- Train on 32-bit FP model, run 8-bit model. No retraining required to convert to 8-bit w/o loss in accuracy.
- Provides comparison concerning computing resource, it's useful to design for typical (ARM) microcontroller systems.
- MobileNet inspired DS-CNN runs small and accurate, achieves the best accuracies of 94.4% ~ 95.4%. Maybe SOTA.
- Apatche licensed code/ pretrained models are available at https://github.com/ARM-software/ML-KWS-for-MCU.