1. Abstract

Fingerspelling, where words are spelled out letter by letter using sign language, is a crucial part of American Sign Language. Traditionally, automatic recognition of fingerspelling has relied on pre-identified fingerspelling segments in signing videos. In this paper, we tackle the challenge of detecting fingerspelling in raw, untrimmed sign language videos, which is a vital step towards developing practical fingerspelling recognition systems. We introduce a benchmark and a set of evaluation metrics, some of which assess the impact of detection on subsequent fingerspelling recognition. Additionally, we propose a new model that employs multi-task learning, combining pose estimation and fingerspelling transcription with detection. This model is tested against several alternatives and consistently outperforms them across all metrics, setting a new standard on the benchmark.

We also propose a novel combination of models and testing methods, aiming for integration into AR glasses. This involves using a state-of-the-art pose estimation model, such as OpenPose, for accurate identification of hand and finger positions, alongside a convolutional neural network (CNN) for detecting fingerspelling regions. The recognition component leverages a recurrent neural network (RNN) for transcribing the detected fingerspelling into text. This combined model approach will be tested to optimize performance, with the goal of deploying it on AR glasses for real-time fingerspelling recognition and translation.

2. Introduction

Sign languages, such as American Sign Language (ASL), are natural languages conveyed through movements of the hands, face, and upper body. Automatic processing of sign languages can greatly enhance communication between deaf and hearing individuals, but it presents several challenges. For instance, sign languages do not have a standard written form, making automatic transcription into a written language like English a translation task. Moreover, sign language gestures are often coarticulated and do not appear in their canonical forms.

In this project, we focus specifically on fingerspelling, where words are signed letter by letter, each letter represented by a distinct handshape or trajectory corresponding to the alphabet of a written language (e.g., the English alphabet for ASL fingerspelling). Fingerspelling serves various purposes, such as for words without their own signs (like proper nouns, technical terms, and abbreviations), as well as for emphasis or expediency. In ASL, fingerspelling is particularly prevalent, accounting for 12% to 35% of the language, and is used more frequently than in other sign languages. Because important content words are often fingerspelled, automatic fingerspelling recognition can facilitate practical tasks like search and retrieval in ASL media.

Unlike the broader task of translating sign language into written language, fingerspelling recognition involves transcribing into a limited set of symbols, maintaining a direct correspondence with the written form. Linguistically, fingerspelling is distinct from other elements of ASL, such as lexical signs and classifiers, suggesting that a dedicated model for fingerspelling would be beneficial. The role of fingerspelling transcription in ASL to English translation is similar to that of transliteration in written language translation. Therefore, even as more general ASL processing methods are developed, having specialized modules for fingerspelling detection and recognition will remain advantageous.

Our project is focused solely on fingerspelling, with the aim of developing a model to run on AR glasses. This involves leveraging a combination of advanced models and testing methods to achieve real-time fingerspelling recognition and translation, enhancing communication for users wearing the AR glasses.

3. Dataset

3.1. About the dataset

3.2. Data evaluation and data augmentation

4. Method

4.1. Base method( using CNN + LSTM)

4.2. Advanced method( our main approach)

4.2.1. Introduction

My initial approach involved using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. CNNs were used to extract features from the raw audio signals, and LSTMs were employed to capture temporal dependencies in the sequential data. These models provided a baseline performance but did not achieve the desired balance between accuracy and computational efficiency. The average word error rate (WER) using this approach was around 25%, which indicated room for significant improvement.

Recognizing the limitations of CNN and LSTM models, I explored more advanced ASR methodologies, focusing on CTC (Connectionist Temporal Classification) and attention mechanisms. These techniques are well-regarded in the ASR community for their ability to handle complex sequence-to-sequence tasks.

inbox_5003978_7d1f7aee69242af5e6aa5ef19c9460f6_modeldesign.drawio-2.svg