First published: 2016/08/11 (6 years ago) Abstract: Recent years have seen tremendous progress in still-image segmentation;
however the na\"ive application of these state-of-the-art algorithms to every
video frame requires considerable computation and ignores the temporal
continuity inherent in video. We propose a video recognition framework that
relies on two key observations: 1) while pixels may change rapidly from frame
to frame, the semantic content of a scene evolves more slowly, and 2) execution
can be viewed as an aspect of architecture, yielding purpose-fit computation
schedules for networks. We define a novel family of "clockwork" convnets driven
by fixed or adaptive clock signals that schedule the processing of different
layers at different update rates according to their semantic stability. We
design a pipeline schedule to reduce latency for real-time recognition and a
fixed-rate schedule to reduce overall computation. Finally, we extend clockwork
scheduling to adaptive video processing by incorporating data-driven clocks
that can be tuned on unlabeled video. The accuracy and efficiency of clockwork
convnets are evaluated on the Youtube-Objects, NYUD, and Cityscapes video
From this frame of a video to next frame, maybe the pixel change a lot, but the semantic content changes slowly. Reflect on the the neutral network, shallow layers change more than deeper layers.
So they use a clock to decide if need to update the deeper layers or just use the previews output result.
The clock triggers by the differences of output of some layer on previous and next frames. The condition of clock execution can be fixed or learned.