SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition
Camgöz, Necati Cihan
and
Hadfield, Simon
and
Koller, Oscar
and
Bowden, Richard
International Conference on Computer Vision - 2017 via Local Bibsonomy
Keywords:
dblp
This paper tackles a challenging task of hand shape and continuous Sign Language Recognition (SLR) directly from images obtained from a common RGB camera (rather than utilizing motion sensors like Kinect). The basic idea is to create a network that is end-to-end trainable with input (i.e. images) and output (i.e. hand shape labels, word labels) sequences. The network is composed of three parts:
- CNN as a feature extractor
- Bidirectional LSTMs for temporal modeling
- Connectionist Temporal Classification as a loss layer
![Network structure](https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/3269d3541f0eec006aee6ce086db2665b7ded92d/1-Figure1-1.png)
Results:
- Observed state-of-art results (at the time of publishing) on "One-Million Hands" and "RWTH-PHOENIX-Weather-2014" datasets.
- Utilizing full images rather than hand patches provides better performance for continuous SLR.
- A network that recognizes hand shape and a network that recognizes word sequence can be combined and trained together to recognize word sequences. Finetuning combined system from for all layers works better than fixing "feature extraction" layers.
- Combination of two networks where each network trained on separate task performs slightly better than training each network on word sequences.
- Marginal difference in performance observed for different decoding and post-processing techniques during sequence-to-sequence predictions.