This figure illustrates the process of our prototype system in translating an English sentence into an American Sign Language video. The top of the figure shows the example English text ("Do you commute to work by bike?''). Below it, the glosses and non-manual markers for each word in the sentence are displayed. The glosses are: IX-2p, TEND, COMMUTE, WORK, and BICYCLE, with a notation indicating a yes/no question. The next row contains a series of five diagrams representing five skeletal poses. Each diagram shows a specific handshape and location corresponding to the glosses and non-manual markers, color-coded for clarity. The bottom row contains images of a person performing the ASL signs, matching the glosses and non-manual markers above. In these images, the person is shown against a green background, demonstrating each sign with corresponding hand positions and facial expressions. Zoomed-in sections highlight the facial expressions associated with the yes/no question.

Towards AI-driven Sign Language Generation with Non-manual Markers

Han Zhang, Rotem Shalev-Arkushin, Vasileios Baltatzis, Connor Gillis, Gierad Laput, Raja Kushalnagar, Lorna Quandt, Leah Findlater, Abdelkareem Bedri, and Colin Lea. 2025. Towards AI-driven Sign Language Generation with Non-manual Markers. In Proceedings of the CHI Conference on Human Factors in Computing Systems.

Sign languages are essential for the Deaf and Hard-of-Hearing (DHH) community. Sign language generation systems have the potential to support communication by translating from written languages, such as English, into signed videos. However, current systems often fail to meet user needs due to poor translation of grammatical structures, the absence of facial cues and body language, and insufficient visual and motion fidelity. We address these challenges by building on recent advances in LLMs and video generation models to translate English sentences into natural-looking AI ASL signers. The text component of our model extracts information for manual and non-manual components of ASL, which are used to synthesize skeletal pose sequences and corresponding video frames. Our findings from a user study with 30 DHH participants and thorough technical evaluations demonstrate significant progress and identify critical areas necessary to meet user needs.

Leave a Reply

Your email address will not be published. Required fields are marked *