First published: 2017/12/28 (5 years ago) Abstract: Neural network training relies on our ability to find "good" minimizers of
highly non-convex loss functions. It is well-known that certain network
architecture designs (e.g., skip connections) produce loss functions that train
easier, and well-chosen training parameters (batch size, learning rate,
optimizer) produce minimizers that generalize better. However, the reasons for
these differences, and their effects on the underlying loss landscape, are not
well understood. In this paper, we explore the structure of neural loss
functions, and the effect of loss landscapes on generalization, using a range
of visualization methods. First, we introduce a simple "filter normalization"
method that helps us visualize loss function curvature and make meaningful
side-by-side comparisons between loss functions. Then, using a variety of
visualizations, we explore how network architecture affects the loss landscape,
and how training parameters affect the shape of minimizers.
- Presents a simple visualization method based on “filter normalization.”
- Observed that __the deeper networks become, neural loss landscapes become more chaotic__; causes a dramatic drop in generalization error, and ultimately to a lack of trainability.
- Observed that __skip connections promote flat minimizers and prevent the transition to chaotic behavior__; helps explain why skip connections are necessary for training extremely deep networks.
- Quantitatively measures non-convexity.
- Studies the visualization of SGD optimization trajectories.
While preserving accuracy,
- Network architecture improvement decreases parameters 51X (240MB to 4.8MB).
- By using Deep Compression, parameters shrinks more 10X more (4.8MB to 0.47MB).
Even improves more accuracy for about 2% by using Simple Bypass (shortcut connection).
They show insightful architectural design strategies;
1. Less 3x3 filters to decrease size,
2. Decrease input channels also to decrease size,
3. Downsample late to have larger activation maps to lead to higher accuracy.
And great insights about CNN design space exploration by parametrize microarchitecture,
- Squeeze Ratio to find good balance between weight size and accuracy.
- 3x3 filter percentage to find enough number of it.
First published: 2017/11/20 (5 years ago) Abstract: Keyword spotting (KWS) is a critical component for enabling speech based user
interactions on smart devices. It requires real-time response and high accuracy
for good user experience. Recently, neural networks have become an attractive
choice for KWS architecture because of their superior accuracy compared to
traditional speech processing algorithms. Due to its always-on nature, KWS
application has highly constrained power budget and typically runs on tiny
microcontrollers with limited memory and compute capability. The design of
neural network architecture for KWS must consider these constraints. In this
work, we perform neural network architecture evaluation and exploration for
running KWS on resource-constrained microcontrollers. We train various neural
network architectures for keyword spotting published in literature to compare
their accuracy and memory/compute requirements. We show that it is possible to
optimize these neural network architectures to fit within the memory and
compute constraints of microcontrollers without sacrificing accuracy. We
further explore the depthwise separable convolutional neural network (DS-CNN)
and compare it against other neural network architectures. DS-CNN achieves an
accuracy of 95.4%, which is ~10% higher than the DNN model with similar number
- Result of thourough research which not only covers major research, but also compares under same criteria/ dataset; This is also a great survey.
- Train on 32-bit FP model, run 8-bit model. No retraining required to convert to 8-bit w/o loss in accuracy.
- Provides comparison concerning computing resource, it's useful to design for typical (ARM) microcontroller systems.
- MobileNet inspired DS-CNN runs small and accurate, achieves the best accuracies of 94.4% ~ 95.4%. Maybe SOTA.
- Apatche licensed code/ pretrained models are available at https://github.com/ARM-software/ML-KWS-for-MCU.
First published: 2017/10/25 (5 years ago) Abstract: Large deep neural networks are powerful, but exhibit undesirable behaviors
such as memorization and sensitivity to adversarial examples. In this work, we
propose mixup, a simple learning principle to alleviate these issues. In
essence, mixup trains a neural network on convex combinations of pairs of
examples and their labels. By doing so, mixup regularizes the neural network to
favor simple linear behavior in-between training examples. Our experiments on
the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show
that mixup improves the generalization of state-of-the-art neural network
architectures. We also find that mixup reduces the memorization of corrupt
labels, increases the robustness to adversarial examples, and stabilizes the
training of generative adversarial networks.
Very efficient data augmentation method. Linear-interpolate training set x and y randomly at every epoch.
for (x1, y1), (x2, y2) in zip(loader1, loader2):
lam = numpy.random.beta(alpha, alpha)
x = Variable(lam * x1 + (1. - lam) * x2)
y = Variable(lam * y1 + (1. - lam) * y2)
- ERM (Empirical Risk Minimization) is $\alpha = 0$ version of mixup, i.e. not using mixup.
- Reduces the memorization of corrupt labels.
- Increases robustness to adversarial examples.
- Stabilizes the training of GAN.
First published: 2018/02/08 (5 years ago) Abstract: With the popularity of deep learning (DL), artificial intelligence (AI) has
been applied in many areas of human life. Neural network or artificial neural
network (NN), the main technique behind DL, has been extensively studied to
facilitate computer vision and natural language recognition. However, the more
we rely on information technology, the more vulnerable we are. That is,
malicious NNs could bring huge threat in the so-called coming AI era. In this
paper, for the first time in the literature, we propose a novel approach to
design and insert powerful neural-level trojans or PoTrojan in pre-trained NN
models. Most of the time, PoTrojans remain inactive, not affecting the normal
functions of their host NN models. PoTrojans could only be triggered in very
rare conditions. Once activated, however, the PoTrojans could cause the host NN
models to malfunction, either falsely predicting or classifying, which is a
significant threat to human society of the AI era. We would explain the
principles of PoTrojans and the easiness of designing and inserting them in
pre-trained deep learning models. PoTrojans doesn't modify the existing
architecture or parameters of the pre-trained models, without re-training.
Hence, the proposed method is very efficient.
First published: 2016/12/25 (6 years ago) Abstract: We introduce YOLO9000, a state-of-the-art, real-time object detection system
that can detect over 9000 object categories. First we propose various
improvements to the YOLO detection method, both novel and drawn from prior
work. The improved model, YOLOv2, is state-of-the-art on standard detection
tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At
40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like
Faster RCNN with ResNet and SSD while still running significantly faster.
Finally we propose a method to jointly train on object detection and
classification. Using this method we train YOLO9000 simultaneously on the COCO
detection dataset and the ImageNet classification dataset. Our joint training
allows YOLO9000 to predict detections for object classes that don't have
labelled detection data. We validate our approach on the ImageNet detection
task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite
only having detection data for 44 of the 200 classes. On the 156 classes not in
COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes;
it predicts detections for more than 9000 different object categories. And it
still runs in real-time.
YOLOv2 is improved YOLO;
- can change image size for varying tradeoff between speed and accuracy;
- uses anchor boxes to predict bounding boxes;
- overcomes localization errors and lower recall not by bigger nor ensemble but using variety of ideas from past work (batch normalization, multi-scaling and etc) to keep the network simple and fast;
- "With batch nor-malization we can remove dropout from the model without overfitting"
- gets 78.6 mAP at 40 FPS.
- uses WordTree representation which enables multi-label classification as well as making classification dataset also applicable to detection;
- is a model trained simultaneously both for detection on COCO and classification on ImageNet;
- is validated for detecting not labeled object classes;
- detects more than 9000 different object classes in real-time.