Machine learning for ASL translation

Advisor(s)

Karen Livescu, Bowen Shi

Location

Room A151

Start Date

26-4-2019 11:05 AM

End Date

26-4-2019 11:20 AM

Abstract

Machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. Machines learn by taking in large amounts of data and slowly adapting an artificial network to process the data. Machine learning has been used in a wide variety of applications including speech and language recognition and translation. Over the past few years, increased computational power has allowed machine translation using machine learning methods to become accurate for various languages. However, for languages that are not widely used, machine translation models may not be as accurate. One such language is American Sign Language (ASL), used by about 300,000 people. ASL translation has many problems that translation from other languages have, such as the lack of a large annotated dataset. Additionally, it also has problems that machine translation from other languages do not have: ASL does not have a spoken or written form, ASL’s grammatical structure is different from most spoken languages, and ASL borrows words from English by fingerspelling. Although translation of individual signs has been accurate for a long time, accurate fingerspelling readings have only begun to become accurate recently. These algorithms to do this are fairly data-hungry and thus, have been limited by the lack of a large dataset. However, there are many deaf news sites that have numerous hours of video in ASL. Therefore, to help collect a larger dataset, we are developing a machine learning algorithm to identify fingerspelling in videos. Our current approach is to process each frame through a convolutional neural network called VGG-16 for feature extraction, then feed the results through one 3D convolutional layer, then use a set of standard linear layers for prediction. However, currently, this model severely underfits the data, meaning that the model can’t pick up on the variance of the data and instead is not accurate enough to be used.

Share

COinS
 
Apr 26th, 11:05 AM Apr 26th, 11:20 AM

Machine learning for ASL translation

Room A151

Machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. Machines learn by taking in large amounts of data and slowly adapting an artificial network to process the data. Machine learning has been used in a wide variety of applications including speech and language recognition and translation. Over the past few years, increased computational power has allowed machine translation using machine learning methods to become accurate for various languages. However, for languages that are not widely used, machine translation models may not be as accurate. One such language is American Sign Language (ASL), used by about 300,000 people. ASL translation has many problems that translation from other languages have, such as the lack of a large annotated dataset. Additionally, it also has problems that machine translation from other languages do not have: ASL does not have a spoken or written form, ASL’s grammatical structure is different from most spoken languages, and ASL borrows words from English by fingerspelling. Although translation of individual signs has been accurate for a long time, accurate fingerspelling readings have only begun to become accurate recently. These algorithms to do this are fairly data-hungry and thus, have been limited by the lack of a large dataset. However, there are many deaf news sites that have numerous hours of video in ASL. Therefore, to help collect a larger dataset, we are developing a machine learning algorithm to identify fingerspelling in videos. Our current approach is to process each frame through a convolutional neural network called VGG-16 for feature extraction, then feed the results through one 3D convolutional layer, then use a set of standard linear layers for prediction. However, currently, this model severely underfits the data, meaning that the model can’t pick up on the variance of the data and instead is not accurate enough to be used.