SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition on ShortScience.org

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition
Camgöz, Necati Cihan and Hadfield, Simon and Koller, Oscar and Bowden, Richard
International Conference on Computer Vision - 2017 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Oleksandr Bailo 7 years ago

This paper tackles a challenging task of hand shape and continuous Sign Language Recognition (SLR) directly from images obtained from a common RGB camera (rather than utilizing motion sensors like Kinect). The basic idea is to create a network that is end-to-end trainable with input (i.e. images) and output (i.e. hand shape labels, word labels) sequences. The network is composed of three parts:
 - CNN as a feature extractor
 - Bidirectional LSTMs for temporal modeling
 - Connectionist Temporal Classification as a loss layer 
![Network structure](https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/3269d3541f0eec006aee6ce086db2665b7ded92d/1-Figure1-1.png)

Results:
 - Observed state-of-art results (at the time of publishing) on "One-Million Hands" and "RWTH-PHOENIX-Weather-2014" datasets.
 - Utilizing full images rather than hand patches provides better performance for continuous SLR. 
 - A network that recognizes hand shape and a network that recognizes word sequence can be combined and trained together to recognize word sequences. Finetuning combined system from for all layers works better than fixing "feature extraction" layers.
 - Combination of two networks where each network trained on separate task performs slightly better than training each network on word sequences.
 - Marginal difference in performance observed for different decoding and post-processing techniques during sequence-to-sequence predictions.

Your comment: