Pages

Tuesday, April 11, 2017

The Problem with BLEU and Neural Machine Translation

There has been a great deal of public attention and publicity given to the subject of Neural Machine Translation in 2016. While experimentation with Neural Machine Translation (NMT) has been going on for the last several years, 2016 has proven to be the year that NMT broke through and became a big deal, and became more widely understood to be of great merit outside of the academic and research community, where it was already understood that NMT has great promise for some years now.

The reasons for the sometimes excessive exuberance around NMT are largely based on BLEU (not BLUE) score improvements on test systems which are sometimes validated by human quality assessments. However it has been understood by some that BLEU, which is still the most widely used measure of quality improvement, can be misleading in its indications when it is used to compare some kinds of MT systems.



The basis for the NMT optimism is related both to the very slow progress in recent years with improving phrase-based SMT quality, and also the striking BLEU score improvements that were seen coming from neural net based machine learning approaches. Much has been written about the flaws of BLEU but it still remains the most easily implementable measurement metric, and also really the only one where there are long-term longitudinal data available. While we all love to bash on BLEU, there is clear evidence that there is a strong correlation between BLEU scores and human judgments of the same MT output. The research community and the translation industry have not been able to come up with a better metric that can be widely implemented to enable ongoing test and evaluation of MT output so it remains as the primary metric.The alternatives are too cumbersome, expensive or impractical to use as widely and as frequently as BLEU is used.


However, there is also evidence that BLEU tends to score SMT systems more favorably than RBMT and NMT systems, both of which may produce very accurate and fluent translations to a human perspective, but differ greatly from the reference translations that are used in calculating the BLEU score. To a great extent the BLEU score is based on very simplistic "text string matches". Very roughly, the larger the cluster of words that you can match exactly, the higher the BLEU score.


To illustrate this, lets take a very simple example, say a reference translation is: "The guests walked into the living room and seated themselves on the couch." and an NMT system produces something like: "The visitors entered the lounge and sat down on the sofa." This would result in a very low BLEU score for the NMT segment, even though many human evaluators might say it is quite an acceptable and accurate translation, and as valid as the reference sentence.

If you want a quick refresher on BLEU you can check this out:

The Need for Automated Translation Quality Measurement in SMT: BLEU


Some of the optimism around NMT is related to its ability to produce a large number of sentences that look very natural, fluent and astonishingly human. Thus, much of the early results with NMT output show that it is considered to be clearly better to human evaluators, even though BLEU scores may show only 5% to 15% improvement (which is also significant). The improvements are most noticeable when considering fluency and word order issues with machine translation output. NMT is also working much more effectively in what were considered difficult languages for SMT and Rule Based MT, e.g. Japanese and Korean.

And here are some examples provided by SYSTRAN from their investigations where the NMT seems to make linguistically informed decisions and changes the sentence structure away from the source to produce a better translation. But again these would not necessarily score much better in terms of BLEU scores even though humans might rate them as significant improvements in MT output quality and naturalness.



But we have seen that in spite of this there are still many cases where NMT BLEU scores significantly outpace the phrase-based SMT systems. These are described in the following posts in this blog:

A Deep Dive into SYSTRANs Neural Machine Translation (NMT) Technology

 

An Examination of the Strengths and Weaknesses of Neural Machine Translation

 

Real and Honest Quality Evaluation Data on Neural Machine Translation 

 

and this is even true to some extent in the exaggerated over-the-top claims made by Google when they claimed that Google NMT was “Nearly Indistinguishable From Human Translation” and “GNMT reduces translation errors by more than 55%-85% on several major language pairs" which is described below.

The Google Neural Machine Translation Marketing Deception

 

The KantanMT NMT vs PB-SMT Evaluation Results


I had an interesting conversation with Tony O'Dowd at KantanMT about his experience with his own initial NMT experiments.While Kantan does plan to publish their results in full detail in the near future, here are some highlights Tony provided from their experiments, that certainly raises some fundamental questions. (Emphasis below is mine.)

  1. Scope of Test - We built identical systems for SMT and NMT in the following language combinations - en-es, en-de, en-zh-cn, en-ja, en-it. Identical training data sets and test reference materials were used throughout the development phase of these engines. This ensured that our subsequent testing would be of identical engines, only differing in the approach to build the models. The engines were trained with an average of 5 million parallel segments ranging from 44 - 110 million words of training data.
  2. BLEU Scores - In all cases, the BLEU scores of NMT output was lower than SMT. 
  3. Human Evaluation:  We deployed a minimum of 3 evaluators for each language group and used KantanLQR to run the evaluation. We used the A/B Testing feature of KantanLQR. Sample A was from SMT, Sample B was from NMT. We randomized the presentation of the translations to ensure evaluators did not know what was NMT and SMT - this was done to remove any bias for one approach or the other. We sampled 200 translations for each language set.
  4. In all cases NMT scored higher in our A/B Testing than SMT. On average NMT was chosen twice as often as SMT in our controlled A/B testing.
  5. For low scoring BLEU NMT segments, we found a high correlation to these segments being the preferred translation by our [human] evaluators - this pretty much proves that BLEU is not a useful and meaningful score for use with NMT systems.


Clearly, this shows that BLEU is of limited value when the human vs. automated metric results are so completely different and even diametrically opposed. The whole point of BLEU is that should provide a quick and simple way to get an estimate of what a human might think of sample machine translated output. So going forward it looks like we are going to need better metrics that can map more closely to human assessments. BLEU is not a linguistically informed measure and thus the problem. This is easy to say but not so easy to do.  A recent study pointed out the following key findings:

  • Translations produced by NMT are considerably different than those produced by phrase-based systems. In addition, there is higher inter-system variability in NMT, i.e. outputs by pairs of NMT systems are more different between them than outputs by pairs of phrase-based systems.
  • NMT outputs are more fluent. We corroborate the results of the manual evaluation of fluency at WMT16, which was conducted only for language directions into English, and we show evidence that this finding is true also for directions out of English.
  • NMT systems do more reordering than pure phrase-based ones but less than hierarchical systems. However, NMT re-orderings are better than those of both types of phrase-based systems.
  • NMT performs better in terms of inflection and reordering. We confirm that the findings of Bentivogli et al. (2016) apply to a wide range of language directions. Differences regarding lexical errors are negligible. A summary of these findings can be seen in the next figure, which shows the reduction of error percentages by NMT over PBMT. The percentages shown are the averages over the 9 language directions covered.

 Reduction of errors by NMT averaged over the 9 language directions covered


Given that there are currently no real practical alternatives to BLEU, there is perhaps an opportunity for an organization like TAUS to develop an easy to apply variant from their overall DQF framework, that can focus on these key elemental differences and can be done quickly and easily. NMT systems will gain in popularity and better measures will be sought. The need for an automated metric will also not go away as developers need some kind of measure to guide system tuning while they are in the development phase. Perhaps there is some research underway that I am not aware of that might address this, but I have seen that SYSTRAN uses several alternatives but everybody still comes back to BLEU.

Comparative BLEU score-based MT system evaluations are particularly problematic as I pointed out in my critique of the Lilt Labs evaluation, which I maintain is deeply flawed, and will result in erroneous conclusions if you take the reported results at face value. Common Sense Advisory also wrote recently about how BLEU scores can be manipulated to make outlandish claims by those with vested interests and also point out that BLEU scores naturally improve as you add multiple references.

"However, CSA Research and leading MT experts have pointed out for over a decade that these metrics are artificial and irrelevant for production environments. One of the biggest reasons is that the scores are relative to particular references. Changes that improve performance against one human translation might degrade it with respect to another. "
Common Sense Advisory, April, 2017


There is really a need for two kinds of measures, one for general developer research that can be used everyday like BLEU today, and one for business translation production use which indicate quality from that different perspective. So as we head into the next phase of MT, driven by machine learning and neural networks, it would be good for us all to think of ways to better measure what we are doing.  Hopefully some readers or some in the research community might have some ideas on new approaches to do this but this is an issue that is something worth keeping an eye on. And if you come up with better a way to do this, who knows, they might even name it after you. I noticed that Renato Beninatto has been talking about NMT recently, and who knows he could come up with something, I know we would all love to talk about our Renato scores instead of those old BLEU scores!


11 comments:

  1. Great article, Kirti! We should compare our scores. I think yours is more accurate than mine :)

    ReplyDelete
  2. Gergely Horváth, Head of Development - GlobaleseApril 13, 2017 at 9:28 AM

    This is important, people. Read it if you still think you can judge or evaluate NMT based on BLEU (not BLUE, thanks, Kirti Vashee!) scores.

    ReplyDelete
  3. Christophe Servan, Researcher at SystranApril 13, 2017 at 2:44 PM

    The state of the Art in this article is not very well written. Even the F-measure has a better correlation score with human judgement than BLEU. This was the starting point to the metric METEOR in 2005... Please, update!

    ReplyDelete
    Replies
    1. Christophe Servan, Researcher at SystranApril 13, 2017 at 2:46 PM

      At the end of the day, every kind of MT system outputs a translation hypothesis. This is why METEOR, and all standart metrics (TER, GTM...) are suitable for NMT and don't need any "adaptation" to it. BLEU is used because everybody use it as benchmark, even if the score is not suitable to MT.

      Delete
    2. Christophe Servan: my point was not that BLEU has the best correlation or is the best, rather it is the one that is still used most frequently when results are reported. METEOR is better often but not used that much, and many people know this but still use BLEU. Please provide better references the state-of-the-art and I will add them to the post.

      Delete
    3. Christophe Servan, Researcher at SystranApril 14, 2017 at 10:14 AM

      Kirti Vashee: please, look at all "shared evaluation task" of WMT since 2008: http://www.statmt.org/wmt08, I'll not do your related works section for you ;-) Indeed, METEOR, TER and GTM are not the most famous metrics in MT. But when you say "there is clear evidence that there is a strong correlation between BLEU scores and human judgments of the same MT output. The research community and the translation industry have not been able to come up with a better metric that can be widely implemented to enable ongoing test and evaluation of MT output so it remains as the primary metric. The alternatives are too cumbersome, expensive or impractical to use as widely and as frequently as BLEU is used." These assumptions are totally wrong! Only the fact that a specific shared task exists since years about this specific problem goes against your text.

      Delete
    4. Christophe - The reference you provided clearly states: "Bleu remains the de facto standard in machine translation evaluation." Also when we look at more current evaluations of MT systems such as in WMT16, we see that BLEU and TER dominate the reporting (in that order) and that there is much more emphasis on human assessments possibly because none of the automatic metrics are satisfactory in themselves even though all are correlated with human judgements. My intent in the post was to highlight the KantanMT findings which I thought was interesting. I understand that you may have a different perspective on this whole issue.

      Delete
    5. From a practical "we use MT" perspective, I view these mathematical models of MT output as indicative, but nothing more. They certainly help as a determining factor in preparing an engine/model for use in a production environment. However, I'd generally seek to use such metrics in combination with sampling through 1 or more qualified linguists. In a way, this is a-kind-of Kantan A/B testing with BLEU. The combination of metrics-based indicators and qualitative feedback help to determine the PE effort and corresponding price point for the post editing that invariably follows any MT production.

      Delete
  4. Your observations concerning BLEU and Neural MT match my own findings having made relatively small-scale comparisons of SMT & NMT output. In my language pair of interest Dutch-English/English-Dutch (in which I am a qualified linguist) I have found quite adequate translations where there are very few correspondences between the Neural MT inferences and the reference text. BLEU can't take account of the self-learning capability of the RNN which can come up with some surprising translations.
    Terence Lewis

    ReplyDelete
  5. We at Intento compared major commercial cloud MT solutions, NMT scores higher than SMT. Although we've used hLEPOR, not BLEU.

    Anecdotal comparison (that some NMT sentence is scored lower than SMT) should be taken with caution. To our observation, sentence-level BLEU starts to converge and >1000 sentences.

    ReplyDelete
  6. My observation after testing both Bleu and meteor is that both are "comfortable" with chunked data and this explains why SMT is favoured over Nural MT

    ReplyDelete