Evaluating Machine Translation in Natural Language Processing

Advisor(s)

Dr. Ashique KhudaBukhsh; Carnegie Mellon University

Zirui Wang; Carnegie Mellon University

Discipline

Computer Science

Start Date

21-4-2021 10:05 AM

End Date

21-4-2021 10:20 AM

Abstract

Human evaluation of machine translation is extensive, but expensive and inefficient, thus, automated evaluation metrics were created. Yet despite recent advances in the field of machine translation, there has been little changed about standard automated evaluation metrics such as BLEU. As the quality of machine translation systems has dramatically improved over the past decade, evaluation becomes an increasingly challenging— and important— problem. The standardized use of automated evaluation metrics in machine translation assumes not only that the automated evaluation correlates strongly with human evaluation, but also that they are language independent. Yet recent research has shown that metrics such as BLEU may correlate poorly with human evaluation. Thus, the question arises that if such metrics correlate poorly with human evaluation in multilingual settings, do different languages need different evaluation metrics?

Share

COinS
 
Apr 21st, 10:05 AM Apr 21st, 10:20 AM

Evaluating Machine Translation in Natural Language Processing

Human evaluation of machine translation is extensive, but expensive and inefficient, thus, automated evaluation metrics were created. Yet despite recent advances in the field of machine translation, there has been little changed about standard automated evaluation metrics such as BLEU. As the quality of machine translation systems has dramatically improved over the past decade, evaluation becomes an increasingly challenging— and important— problem. The standardized use of automated evaluation metrics in machine translation assumes not only that the automated evaluation correlates strongly with human evaluation, but also that they are language independent. Yet recent research has shown that metrics such as BLEU may correlate poorly with human evaluation. Thus, the question arises that if such metrics correlate poorly with human evaluation in multilingual settings, do different languages need different evaluation metrics?