1. Abstract

Communication barriers significantly impact the daily lives of the Deaf or Hard of Hearing (DHH) community, isolating nearly 70 million sign language users globally. In the United States alone, 33 babies are born with permanent hearing loss daily, the majority to parents who are unfamiliar with American Sign Language (ASL). This project introduces a transformative solution—an augmented reality (AR) glasses system embedded with advanced artificial intelligence (AI) to translate ASL into real-time textual communication. This technology aims to facilitate instant and accessible communication between the DHH community and hearing individuals, thereby reducing the psychological and social isolation experienced by many DHH individuals.

The AR glasses are designed to recognize and translate the complex hand shapes and motions of ASL fingerspelling, which is notably used to convey essential personal and navigational information quickly and efficiently. Studies suggest that ASL fingerspelling can reach speeds of 57 words per minute, surpassing the average typing speed on mobile devices by a significant margin. By leveraging this speed and efficiency, the AR glasses can provide a seamless and natural mode of communication for the DHH community, mimicking the fluidity of spoken language in visual form.

The proposed system addresses the critical period of language acquisition in deaf babies, who are at risk of Language Deprivation Syndrome due to the lack of early exposure to a natural language. By providing a tool that parents and educators can use to comprehend and interact using ASL, the AR glasses aim to support more effective language development and learning outcomes in deaf children.

Moreover, this project aligns with broader global initiatives to make technology universally accessible and useful. Collaborating with entities like Google and the Deaf Professional Arts Network, the project seeks to expand its impact by exploring scalable AI solutions for various sign languages, enhancing individual user experiences and fostering inclusion in digital and physical spaces. This AR glasses project not only promises to revolutionize how the DHH community interacts with the hearing world but also sets a precedent for future technologies that embrace and enhance diversity in human communication.

2. Introduction

ASR Algorithms

I had a deep interest in ASR, and after recognizing its correlation with the competition, I studied and experimented with various algorithms. While I couldn't find anything particularly superior to the widely used vanilla CTC among the numerous ASR algorithms, I learned a lot through various experiments, and I will mainly share those insights.

As I began studying ASR for this competition, I discovered from reading papers that current NN ASR algorithm mainly consists of CTC, Attention-based, and Transducer. I implemented all three (though there are more diverse algorithms out there). In conclusion, from a baseline performance perspective, all three algorithms were quite similar and each have pros & cons. Here are the insights I gathered from implementing each algorithm:

Rough single model inference time on test dataset with kaggle kernel is as follows.

CTC greedy(~40mins) < AttentionGreedy=CTCBeamsearch(~1h20mins) <= CTCAttentionJointGreedy(~1h30mins)

CTC-Attention Joint Training & Decoding

As both CTC and Attention showed similar performance, I tried to find a method to utilize both techniques. I primarily referred to the following two papers:

What is CTC loss, how it works

CTC loss

Joint CTC-Attention Based End-To-End Speech Recognition Using Multi-Task Learning, Kim et al. 2017.