What Does It Mean for Artificial Intelligence to Achieve Parity with a Human? A Case Study of Neural Machine Translation
Author: Dr. Jennifer DeCamp
On March 12, 2018, Microsoft’s Artificial Intelligence (AI) and Research group announced: “We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations.”  This achievement was well in advance of what researchers in AI had predicted would occur in 2024, specifically that AI will “perform translation about as good as a human, who is fluent in both languages, but unskilled at translation for most types of text and for most popular languages.”  However, such comparisons depend on what specifically is being evaluated, and which humans, processes, and business objectives are being used in the comparison. Neural machine translation (NMT) is an example of where developing a machine that performs as well as humans or even as well as “professionals” does not necessarily equate to providing technology that can autonomously do the same tasks as humans.
In the 2018 Microsoft study, researchers compared the output of their NMT to the translations produced by bilingual speakers. However, as the U.S. Government Interagency Language Roundtable (ILR) explains: “Competence in two languages is necessary but not sufficient for any translation task. Though the translator must be able to (1) read and comprehend the source language and (2) write comprehensibly in the target language, the translator must also be able to (3) choose the equivalent expression in the target language that both fully conveys and best matches the meaning intended in the source language (referred to as congruity judgment). A weakness in any of these three abilities will influence performance adversely and have a negative impact on the utility of the product.” The translator may also need specific technical knowledge (e.g., automotive engineering), cultural knowledge (e.g., an awareness of what a specific culture finds appropriate or inappropriate), and/or specific technical skills (e.g., in Computer Assisted Translation tools and Terminology Management Systems). In the scale for the ILR Translation Proficiency Guidelines, the lowest level that is termed professional (i.e., Level 3) mentions an ability to translate “abstract language”, “intended implications”, and “nuances,” as well as “value judgments.”  There thus may be a considerable difference between a “professional translator” and someone a company may wish to employ to do translation.
In the 2018 study, Microsoft also compared their NMT output to translations produced by professional translators. However, in the United States and many other countries, anyone can self-declare as a “professional translator.” The International Organization for Standardization (ISO) in ISO 17100:2015 Amendment 2018 Translation Services-Requirements for Translation Services recommends that consumers of translation services request evidence of the professional skills of prospective translators, such as certification testing, completion of training programs, or years of experience doing translation.  The Microsoft researchers did not screen translators for such evidence of professional skills.
In addition, professional translation is almost always a team sport. The ILR guidelines call for quality review of translations for even the most proficient translators.  Standards such as ISO 17100 outline detailed review processes. Standards such as ASTM F2574-14 Standard Guide for Quality Assurance in Translation describe roles of editors, terminology managers, project managers, subject matter experts, cultural experts, copy editors, and others who may contribute to the translation. 
Moreover, most professional translations and Language Service Companies now use a range of technology, including terminologies and Computer Assisted Translation (CAT) tools that compare new phrases in a document with a database of translation memories (i.e., how those phrases have been translated in the past). Most CAT tools now incorporate machine translation, usually as a default for reviewed terminologies and translation memories. A real-world comparison of NMT with human performance could include humans using and reviewing NMT.
There are also different definitions of quality. ASTM 2575-14 describes quality as the degree to which a product meets the requirements specified by the customer. Customers with different tasks, purposes, intended audiences, budgets, and urgency may make different decisions about what they need in translations.  In the Microsoft study, bilingual speakers were presented with NMT and human output and asked to select the better translation with no criteria or context. Evaluator comments focused on fluency, which is associated with correct grammar, spelling, and word usage  more than on adequacy, which is associated with mistranslation and omitted translation.  In a real-world situation, such mistakes and omissions often have more significant consequences than grammar and spelling.
Achieving parity for humans is the AI research objective that was set up more than 70 years ago, when there was far less understanding and awareness of the complexity of tasks such as translation. It is significant from a research perspective that our culture is reaching these long-sought moonshots of creating machines that can do as well as humans. However, the real questions are the ones closer to earth, such as how can consumer requirements for translation best be met with what combination of skills, technologies, and processes?
References H. Hassan, A. Aue, C. Chen, Vi. Chowdhary, J. Clark, C. Federmann, X. Huang, M. Junczys-Dowmunt, W. Lewis, M. Li, S. Liu, T. Y. Liu, R. Luo, A. Menezes, T. Qin, F. Seide, X. Tan, F. Tian, L. Wu, S. Wu, Y. Xia, D. Zhang, Z. Zhang, and M. Zhou (2018, March 12) Achieving Human Parity on Automatic Chinese to English Machine Translation. Retrieved May 2, 2018 from: https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
 K. Grace, J. Salvatier, A. Dafoe, B. Zhang, and O. Evans (2017, May 30). Will AI Exceed Human Performance? Evidence from AI Experts. Downloaded 13 February 2018 from Cornell Library site: arXiv:1705.08807v2 [csAI}
 Interagency Language Roundtable (ILR). ILR Skill Level Descriptions for Translation Retrieved May 2, 2018 from: http://www.govtilr.org/Skills/AdoptedILRTranslationGuidelines.htm
 International Organization for Standardization (ISO), (2018). ISO 17100 Amendment 2018 Translation Services-Requirements for Translation Services. Retrieved May 2, 2018 from https://www.iso.org/standard/71047.html. ASTM (2014). ASTM F2575-14. Standard Guide for Quality Assurance in Translation. Retrieved May 2, 2018 from: https://www.astm.org/Standards/F2575.htm
 ASTM (2018). ASTM WK 46396 New Practice for the Development of Translation Quality Metrics. Retrieved May 2, 2018 from https://www.astm.org/DATABASE.CART/WORKITEMS/WK46396.htm. Also described in: TAUS (2013, May 2). Adequacy/Fluency Guidelines. Retrieved May 2, 2018 from https://www.taus.net/academy/best-practices/evaluate-best-practices/adequacy-fluency-guidelines.
Dr. Jennifer DeCamp serves as principal engineer for Translation and Terminology at The MITRE Corporation. In this position, she works across the U.S. government and across international standards bodies to improve translation and terminology practices and tools. Dr. DeCamp is chair of the American Translators Association (ATA) Translation Committee, and has served as U.S. Head of Delegation for the International Organization (ISO) Technical Committee (TC) 37 on Terminology and Other Language and Content Resources. She currently serves as Chair of the ASTM Technical Advisory Group to ISO TC 37/Subcommittee 4 on Lexical Resource Management and serves on executive committees for ASTM and several other organizations.
© 2018 The MITRE Corporation. All rights reserved. Approved for public release. Distribution unlimited 18-1574-1.
Solving problems for a safer world. The MITRE Corporation is a not-for-profit organization that operates research and development centers sponsored by the federal government. Learn more about MITRE.