Evaluating Machine Translation in Natural Language Processing
Advisor(s)
Dr. Ashique KhudaBukhsh; Carnegie Mellon University
Zirui Wang; Carnegie Mellon University
Discipline
Computer Science
Start Date
21-4-2021 10:05 AM
End Date
21-4-2021 10:20 AM
Abstract
Human evaluation of machine translation is extensive, but expensive and inefficient, thus, automated evaluation metrics were created. Yet despite recent advances in the field of machine translation, there has been little changed about standard automated evaluation metrics such as BLEU. As the quality of machine translation systems has dramatically improved over the past decade, evaluation becomes an increasingly challenging— and important— problem. The standardized use of automated evaluation metrics in machine translation assumes not only that the automated evaluation correlates strongly with human evaluation, but also that they are language independent. Yet recent research has shown that metrics such as BLEU may correlate poorly with human evaluation. Thus, the question arises that if such metrics correlate poorly with human evaluation in multilingual settings, do different languages need different evaluation metrics?
Evaluating Machine Translation in Natural Language Processing
Human evaluation of machine translation is extensive, but expensive and inefficient, thus, automated evaluation metrics were created. Yet despite recent advances in the field of machine translation, there has been little changed about standard automated evaluation metrics such as BLEU. As the quality of machine translation systems has dramatically improved over the past decade, evaluation becomes an increasingly challenging— and important— problem. The standardized use of automated evaluation metrics in machine translation assumes not only that the automated evaluation correlates strongly with human evaluation, but also that they are language independent. Yet recent research has shown that metrics such as BLEU may correlate poorly with human evaluation. Thus, the question arises that if such metrics correlate poorly with human evaluation in multilingual settings, do different languages need different evaluation metrics?