This paper tackles a challenging task of hand shape and continuous Sign Language Recognition (SLR) directly from images obtained from a common RGB camera (rather than utilizing motion sensors like Kinect). The basic idea is to create a network that is end-to-end trainable with input (i.e. images) and output (i.e. hand shape labels, word labels) sequences. The network is composed of three parts:
- CNN as a feature extractor
- Bidirectional LSTMs for temporal modeling
- Connectionist Temporal Classification as a loss layer
- Observed state-of-art results (at the time of publishing) on "One-Million Hands" and "RWTH-PHOENIX-Weather-2014" datasets.
- Utilizing full images rather than hand patches provides better performance for continuous SLR.
- A network that recognizes hand shape and a network that recognizes word sequence can be combined and trained together to recognize word sequences. Finetuning combined system from for all layers works better than fixing "feature extraction" layers.
- Combination of two networks where each network trained on separate task performs slightly better than training each network on word sequences.
- Marginal difference in performance observed for different decoding and post-processing techniques during sequence-to-sequence predictions.