Pages

Thursday, August 10, 2017

A Fun, Yet Serious Look at the Challenges we face in Building Neural Machine Translation Engines

This is a guest post by Gábor Ugray on NMT model building challenges and issues. Don't let the playful tone and general sense of frolic in the post fool you. If you look more closely, you will see that it very clearly defines an accurate list of challenges that one might come upon when one ventures into building a Neural MT engine. This list of problems is probably the exact list that the big boys (Microsoft, FaceBook, Google, and others) have faced some time ago. I  have previously discussed how SYSTRAN and SDL are solving these problems. While this post describes an experimental system very much from a do-it-yourself perspective, production NMT engines might differ only by the way in which they handle these various challenges. 

This post also points out a basic issue about NMT - while it is clear that NMT works, often surprisingly well,  it is still very unclear what predictive patterns are learned, which makes it hard to control and steer. Most (if not all) of the SMT strategies like weighting, language model, terminology over-ride etc.. don't really work here. Data and algorithmic strategies might drive improvement, but linguistic strategies seem harder to implement.

Silvio Picinini at eBay also recently compared output from an NMT experiment and has highlighted his findings here: https://www.linkedin.com/pulse/ebay-mt-language-specialists-series-comparing-nmt-smt-silvio-picinini 


While it took many years before an open source toolkit (Moses) appeared for SMT, we see that NMT already has four open source experimentation options: OpenMT, Nematus, Tensorflow NMT, and Facebook's Caffe2. It is possible the research community at large may come up with innovative and efficient solutions to the problems we see described here. Does anybody still seriously believe that LSPs can truly play in this arena building competitive NMT systems by themselves? I doubt it very much and would recommend that LSPs start thinking about which professional MT solution to align with because NMT indeed can help build strategic leverage in the translation business if true expertise is involved. The problem with DIY (Do It Yourself) is that having multiple tool kits available is not of much use if you don't know what you are doing.

Discussions on NMT also seem to be often accompanied by people talking about the demise of human translators (by 2029 it seems). I remain deeply skeptical, even though I am sure MT will get pretty damned good on certain kinds of content, and believe that it is wiser to learn how to use MT properly, than dismiss it. I also think the notion of that magical technological convergence that they call Singularity is kind of a stretch. Peter Thiel (aka #buffoonbuddypete) is a big fan of this idea and has a better investment record than I do, so who knows. However, I offer some quotes from Steven Pinker that have the sonorous ring of truth to them:

"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not a pixie dust that magically solves all your problems." Steven Pinker 

Elsewhere, Pinker also says:

"… I’m skeptical, though, about science-fiction scenarios played out in the virtual reality of our imaginations. The imagined futures of the past have all been confounded by boring details: exponential costs, unforeseen technical complications, and insuperable moral and political roadblocks. It remains to be seen how far artificial intelligence and robotics will penetrate into the workforce. (Driving a car is technologically far easier than unloading a dishwasher, running an errand, or changing a baby.) Given the tradeoffs and impediments in every other area of technological development, the best guess is: much farther than it has so far, but not nearly so far as to render humans obsolete."

The emphasis below is all mine.

=====

We wanted a Frankenstein translator and ended up with a bilingual chatbot. Try it yourself!    (The original title)

 

I don’t know about you, but I’m in a permanent state of frustration with the flood of headlines hyping machines that “understand language” or are developing human-like “intelligence.” I call bullshit! And yet, undeniably, a breakthrough is happening in machine learning right now. It all started with the oddball marriage of powerful graphics cards and neural networks.With that wedding party still in full swing, I talked Terence Lewis[*] into an even more oddball parallel fiesta. We set out to create a Frankenstein translator, but after running his top-notch GPU on full power for four weeks, we ended up with an astonishingly good translator and an astonishingly stupid bilingual chatbot.


And while we’re at it: Terence is obviously up for mischief, but more importantly, he offers a completely serious English<>Dutch machine translation service commercially. There is even a plugin available for memoQ, and the MyDutchPal system solves many of the MT problems that I’m describing later in this post.

.
And yet the plane is aloft! A fitting metaphor for AI’s state of the art.
Source: the internets.
So, check out the live demo below this image, then read on to understand what on earth is going on here.


 You can try the NMT engine at this link on the original posting.


Understanding deep learning

It all started in May when I read Adrian Colyer’s[2] summary of the article Understanding deep learning requires re-thinking generalization[3]. The proposition of Chiyuan Zhang & co-authors is so fascinating and relevant that I’ll just quote it verbatim:
What is it that distinguishes neural networks that generalize well from those that don’t?
[...]
Generalisation is the difference between just memorising portions of the training data and parroting it back, and actually developing some meaningful intuition about the dataset that can be used to make predictions.
The authors describe how they set up a series of original experiments to investigate this. The problem domain they chose is not machine translation, but another classic of deep learning: image recognition. In one experiment, they trained a system to recognize images – except they garbled the data set, randomly shuffling labels and photos. It might have been a panda, but the label said bicycle, and so on, 1.2 million times over. In another experiment, they even replaced the images themselves with random noise.

The paper’s conclusion is… ambiguous. Basically, it shows that neural networks will obediently memorize any random input (noise), but as for the networks’ ability to generalize from a real signal, well, we don’t really know. In other words, the pilot has no clue what they are doing, and yet the plane is still flying, somehow.

I immediately knew that I wanted to try this exact same thing, but with a purpose-built neural MT system. What better way to show that no, there’s no talk about “intelligence” or “understanding” here! We’re really dealing with a potent pattern-recognition-and-extrapolation machine. Let’s throw a garbled training corpus at it: genuine sentences and genuine translations, but matched up all wrong. If we’re just a little bit lucky, it will recognize and extrapolate some mind-bogglingly hilarious non-patterns, our post about it will go viral, and comedians will hate us.

 .

 

Choices, ingredients, and cooking


OK, let’s build a Frankenstein translator by training an NMT engine on a corpus of garbled sentence pairs. But wait…

What language pair should it be? Something that’s considered “easy” in MT circles. We’re not aiming to crack the really hard nuts; we want a well-known nut and paint it funny. The target language should be English, so you, dear reader, can enjoy the output. The source language… no. Sorry. I want to have my own fun too, and I don’t speak French. But I speak Spanish!

Crooks or crooked cucumbers? There is an abundance of open-source training data[4] to choose from, really. The Hansards are out (no French), but the EU is busy releasing a relentless stream of translated directives, rules and regulations, for instance. It’s just not so much fun to read bureaucratese about cucumber shapes. Let’s talk crooks and romance instead! You guessed right: I went for movie subtitles. You won’t believe how many of those are out there, free to grab.

Too much goodness. The problem is, there are almost 50 million Spanish-English segment pairs in the OpenSub2016[5] corpus. NMT is known to have a healthy appetite for data, but 50 million is a bit over the line. Anything for a good joke, but we don’t have months to train this funny engine. I reduced it to about 9.5 million segment pairs by eliminating duplicates and keeping only the ones where the Spanish was 40 characters or longer. That’s still a lot, and this will be important later.

Straight and garbled. At this stage, we realized we actually needed two engines. The funny translator is the one we’re really after, but we should also get a feel for how a real model, trained from the real (non-garbled) data would perform. So I sent Terence two large files instead of one.

The training. I am, of course, extremely knowledgeable about NMT, as far as bar conversations with attractive strangers go. Terence, on the other hand, has spent the past several months building a monster of a PC with an Nvidia GTX 1070 GPU, becoming a Linux magician, and training engines with the OpenNMT framework[6]. You can read about his journey in detail on the eMpTy Pages blog[7]. He launched the training with OpenNMT’s default parameters: standard tokenization, 50k source and target vocabulary, 500-node, 2-layer RNN in both encoder and decoder, 13 epochs. It turned out one epoch took about one day, and we had two models to train. I went on vacation and spent my days in suspense, looking roughly like this:

.


An astonishingly good translator

The “straight” model was trained first, and it would be an understatement to say I was impressed when I saw the translations it produced. If you’re into that sort of thing, the BLEU score is a commendable 32.10, which is significantly higher than, well, any significantly lower value.[8]

The striking bit is the apparent fluency and naturalness of the translations. I certainly didn’t expect a result like this from our absolutely naïve, out-of-the-box, unoptimized approach. Let’s take just one example:
La doctora no podía participar en la conferencia, por eso le conté los detalles importantes yo mismo.
---
The doctor couldn't participate in the conference, so I told her the important details myself.
Did you spot the tiny detail? It’s the feminine pronoun her in the translation. The Spanish equivalent, le, is gender-neutral, so it had to be extrapolated from la doctora – and that’s pretty far away in the sentence! This is the kind of thing where statistical systems would probably just default to masculine. And you can really push the limits. I added stuff to make that distance even longer, and it’s still her in the impossible sentence, La doctora no podía participar en la conferencia que los profesores y los alumnos habían organizado en el gran auditorio de la universidad para el día anterior, además no nos quedaba mucho tiempo, por eso le conté los detalles importantes yo mismo. 
 
But once our enthusiasm is duly curbed, let’s take a closer look at the good, the bad and the ugly. If you purposely start peeling off the surface layers, the true shape of the emperor’s body begins to emerge. Most of these wardrobe malfunctions are well-known problems with neural MT systems, and much current research focuses on solving them or working around them.

Unknown words. In their plain vanilla form, neural MT systems have a severe limitation on the vocabulary (particularly target-language vocabulary) that they can handle. 50 thousand words is standard, and we rarely, if ever, see systems with a vocabulary over 100k. Unless you invest extra effort into working around this issue, a vanilla system like ours produces a lot of unks[9], like here:
Tienes que invitar al ornitólogo también.
---
You have to invite the unk too.
This is a problem with fancy words, but it gets even more acute with proper names, and with rare conjugations of not-even-so-fancy words.

Omitted content. Sometimes, stuff that is there in the source simply goes AWOL in the translation. This is related to the fact the NMT systems attempt to find a most likely translation, and unless you add special provisions, they often settle for a shorter output. This can be fatal if the omitted word happens to be a negation. In the sentence below, the omitted part (in red) is less dramatic, but it’s an omission all the same.
Lynch trabaja como siempre, sin orden ni reglas: desde críticas a la televisión actual a sus habituales reflexiones sobre la violencia contra las mujeres, pasando por paranoias mitológicas sobre el bien y el mal en la historia estadounidense.
---
Lynch works as always, without order or rules: from criticism to television on current television to his usual reflections about violence against the women, going through right and wrong in American history.
Hypnotic recursion. Very soon after Google Translate switched to Neural MT for some of its language combinations, people started noticing odd behaviors, often involving loops of repeated phrases.[10] You see one such case in the example above, highlighted in green: that second television seems to come out of thin air. Which is actually pretty adequate for Lynch, if you think about it.

Learning too much. Remember that we’re not dealing with a system that “translates” or “understands” language in any human way. This is about pattern recognition, and the training corpus often contains patterns that are not linguistic in nature.
Mi hermano estaba conduciendo a cien km/h.
---
My brother was driving at a hundred miles an hour.
Mi hermano estaba conduciendo a 100 km/h.
---
My brother was driving at 60 miles an hour.
Since when is a mile a translation of kilometer? And did the system just learn to convert between the two? To some extent, yes. And that’s definitely not linguistic knowledge. But crucially, you don’t want this kind of arbitrary transformation going on in your nuclear power plant’s operating manual.

Numbers. You will have guessed by now: numbers are a problem. There are way too many of them critters to fit into a 50k-vocabulary, and they often behave in odd ways in bilingual texts attested in the wild. Once you stray away from round numbers that probably occur a lot in the training corpus, trouble begins.
Mi hermano estaba conduciendo a 102 km/h.
---
My brother was driving at unk.
Mi hermano estaba conduciendo a 85 km/h.
---
My brother was driving at 85 miles an hour.
Finally, data matters. Our system might be remarkably good, but it’s remarkably good at subtitlese. That’s all it’s ever seen, after all. In Subtitle Land, translations like the one below are fully legit, but they won’t get you far in a speech writing contest for the Queen.
No le voy a contar a la profesora.
---
I'm not gonna tell the teacher.

The garbled model

Now on to the “crazy” model! I made a tremendous mental effort to keep my expectations low, but secretly, at the bottom of my heart, I was hoping for the kind of nonlinear oddity that you get if you start inputting жо into Google Translate[11]:
жо > Jo
жожо > Jojo
жожожо > Joess
жожожожо > Reverently
жожожожожожо > Rejoicing
жожожожожожожожо > Reassuringly
жожожожожожожожожо > Reaping thee
Compared to this, our crazy system is somewhat underwhelming.
whisky > Thought!
sangría > Thought!?
Necesito un whisky. > I don't know what you're talking about.
жо > . honestly guess guess guess guess gues
Malkovich > . honestly guess guess guess guess guess
Malkovich Malkovich > You know, I don't know what you're talking about.
Let’s just put it this way: I’ve heard funnier jokes before. And those jokes tended to be a lot less repetitive, too. OK, with a bit of luck you do get a few highlights, in the “free self-help advice for nuts” kind of way, but that’s about it.
En este día de Julio, me gustaría escribir algunas reflexiones sobre como me siento, en relación con mi mismo, que es una de las relaciones más difíciles y complejas que una persona debe llevar a adelante, y en relación con los demás...
---
I'm sure you're aware of the fact that you're the only one who's been able to find out what's going on, and I don't want you to think that I'm the only one who can help you.
There seem to be two rules to this game:
  1. What you input doesn’t matter a whole lot. The only thing that makes a real difference is how long it is.
  2. The crazy “translations” have nothing to do with the source. They are invariably generic and bland. They could almost be a study in noncommittal replies.
And that last sentence right there is the key, as I realized while I was browsing the OpenNMT forums[12]. It turns out people are using almost the same technology to build chatbots with neural networks. If you think about it, the problem can indeed be defined in the same terms. In translation, you have a corpus of source segments and their translations; you collect a lot of these and train a system to give the right translation for the right source. In a chatbot, your segment pairs are prompts and responses, and you train the system to give the right response to the right prompt.

Except, this chatbot thing doesn’t seem to be working as well as MT. To quote the OpenNMT forum: People call it the "I Don't Know" problem and it is particularly problematic for chatbot type datasets.
 
For me, this is a key (and unanticipated) take-away from the experiment. We set out to build a crazy translator, but unwittingly we ended up solving a different problem and created a massively uninspired bilingual chatbot.

Two takeaways

Beyond any doubt, the more important outcome for me is the power of neural MT. The quality of the “straight” model that we built drastically exceeded my expectations, particularly because we didn’t even aim to create a high-quality system in the first place. We basically achieved this with an out-of-the-box tool, the right kind of hardware, and freely available data. If that is the baseline, then I am thrilled by the potential of NMT with a serious approach.

The “crazy” system, in contrast, would be a disappointment, were it not for the surprising insight about chatbots. Let’s pause for a moment and think about these. They are all over the press, after all, with enthusiastic predictions that in a very short time, they will pass the Turing test, the ultimate proof of human intelligence.

Well, it don’t look that way to me. Unlike translated sentences, prompts and responses don’t have a direct correlation. There is something going on in the background that humans understand, but which completely eludes a pattern recognition machine. For a neural network, a random sequence of letters in a foreign language is as predictable a response as a genuine answer given by a real human in the original language. In fact, the system comes to the same conclusion in both scenarios: it plays it safe and produces a sequence of letters that’s a generally probable kind of thing for humans to say.

Let’s take the following imaginary prompts and responses:
How old are you?
No, seriously, I took the red door by mistake.

Guess who came to yoga class today.
Poor Mary!
It would be a splendid exercise in creative writing to come up with a short story for both of them. Any of us could do it in a breeze, and the stories would be pretty amusing. There is an infinite number of realities where these short conversations make perfect sense to a human, and there is an infinite number of realities where they make no sense at all. In neither case can the response be predicted, in any meaningful way, from the prompt or the preceding conversation. Yet that is precisely the space where our so-called artificial “intelligence” currently live.

The point is, it’s ludicrous to talk about any sort of genuine intelligence in a machine translation system or a chatbot based on recurrent neural networks with a long short-term memory.

Comprehension is that elusive thing between the prompts and the responses in the stories above, and none of today’s technologies contains a metaphorical hidden layer for it. On the level our systems comprehend reality, a random segment in a foreign language is as good a response as Poor Mary!

About Terence *

Terence Lewis, MITI, entered the world of translation as a young brother in an Italian religious order, where he was entrusted with the task of translating some of the founder's speeches into English. His religious studies also called for a knowledge of Latin, Greek, and Hebrew. After some years in South Africa and Brazil, he severed his ties with the Catholic Church and returned to the UK where he worked as a translator, lexicographer[13] and playwright. As an external translator for Unesco, he translated texts ranging from Mongolian cultural legislation to a book by a minor French existentialist. At the age of 50, he taught himself to program and wrote a rule-based Dutch-English machine translation application which has been used to translate documentation for some of the largest engineering projects in Dutch history. For the past 15 years, he has devoted himself to the study and development of translation technology. He recently set up MyDutchPal Ltd to handle the commercial aspects of his software development. He is one of the authors of 101 Things a Translator Needs to Know[14].

. 


References

[1] The live demo is provided "as is", without any guarantees of fitness for purpose, and without any promise of either usefulness or entertainment value. The service will be online for as long as I have the resources available to run it (a few weeks probably).
Oh yes, I'm logging your queries, and rest assured, I will be reading them all. I am tremendously curious to see what you come up with, and I want to enjoy all the entertaining or edifying examples that you find.
[2] the morning paper. an interesting/influential/important paper from the world of CS every weekday morning, as selected by Adrian Colyer.
blog.acolyer.org/
[3] Understanding deep learning requires rethinking generalization. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. ICLR 2017 conference submission.
openreview.net/forum?id=Sy8gdB9xx¬eId=Sy8gdB9xx
[4] OPUS, the open parallel corpus. Jörg Tiedemann.
opus.lingfil.uu.se/
[5] OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Pierre Lison, Jörg Tiedemann.
stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
[6] OpenNMT: Open-Source Toolkit for Neural Machine Translation.
arxiv.org/abs/1701.02810
opennmt.net/
[7] My Journey into "Neural Land". Guest Post by Terence Lewis on the eMpTy Pages blog.
kv-emptypages.blogspot.com/2017/06/my-journey-into-neural-land.html
[8] Never trust anyone who brags about their BLEU scores without giving any context. I’m not giving you any context, but you have the live demo to see the output for yourself.
Also, a few words about this score. I calculated it on a validation set that contains 3k random segment pairs removed from the corpus before training. So they are in-domain sentences, but they were not part of the training set. The score was calculated on the detokenized text, which is established MT practice, except in NMT circles, who seem to prefer the tokenized text, for reasons that still escape me.
And if you want to max out on the metrics fetish, the validation set’s TER score is 47.28. There. I said it.
[9] Don’t get me wrong, I’m a great fan of unks. They can attend my parties anytime, even without an invitation. If I had a farm I would be raising unks because they are the cutest creatures ever.
[10] Electric sheep. Mark Liberman on Language Log.
languagelog.ldc.upenn.edu/nll/?p=32233
[11] From the same Language Log post quoted previously. Translations were retrieved on August 6, 2017; they are likely to change when Google updates their system.

[12] English Chatbot advice
forum.opennmt.net/t/english-chatbot-advice/32/5
[13] Harrap's English-Brazilian Portuguese business dictionary. Terence Lewis, Lígia Xavier, Cláudio Solano. [link]
[14] 101 Things a Translator Needs to Know. ISBN 978-91-637-5411-1
www.101things4translators.com


Gábor Ugray is co-founder of Kilgray, creators of the memoQ collaborative translation environment and TMS. He is now Kilgray’s Head of Innovation, and when he’s not busy building MVPs, he blogs at jealousmarkup.xyz and tweets as @twilliability.


Monday, July 24, 2017

The Ongoing Neural Machine Translation Momentum

This is largely a guest post by Manuel Herranz of Pangeanic, slightly abbreviated and edited from the original, to make it more informational and less promotional. Last year we saw FaceBook announce that they were going to shift all their MT infrastructure to a Neural MT foundation as rapidly as possible, this was later followed by NMT announcements from SYSTRAN, Google, and Microsoft. In the months since we have seen that many MT technology vendors have also jumped onto the NMT wagon. Some with more conviction than others. The view for those who can go right into the black box and modify things (SDL, MSFT, GOOG, FB and possibly SYSTRAN) is, I suspect, quite different from those who use open source components and have to perform a "workaround" on the output of these black box components. Basically, I see there are two clear camps amongst MT vendors: 
  1. Those who are shifting to NMT as quickly as possible (e.g. SYSTRAN)
  2. Those who are being much more selective and either "going hybrid = SMT+NMT" or building both PB-SMT and NMT engines and choosing the better one.(e.g. Iconic).
Pangeanic probably falls in the first group based on the enthusiasm in this post. Whenever there is a paradigm shift in MT methodology the notion of "hybrid" invariably comes up. A lot of people who don't understand the degree of coherence needed in the underlying technology generally assume this is a better way. Also, I think that sometimes, the MT practitioner has too much investment sunk into the old approach and is reluctant to completely abandon the old for the new. SMT took many years to mature and what we see today is an automated translation production pipeline that includes multiple models (translation, language, reordering etc..) together with pre and post processing of translation data. The term hybrid is sometimes used to describe this overall pipeline because data can be linguistically-informed on some of these pipeline steps. 

When SMT first emerged, many problems were noticed (relative to the old RBMT model), and it has taken many years to resolve some of them. The solutions that worked for SMT will not necessarily work for NMT and in fact, there is a good reason to believe they clearly will not. Mostly because the pattern matching technology in SMT is quite different, even though it is much better understood, and more evident than in NMT. The pattern detection and learning that happens in NMT is much more mysterious and unclear at this point. We are still learning what levers to pull to make adjustments and fix weird problems that we see. What can be carried forward easily are data preparation, data and corpus analysis and data quality measures that have been built over time. NMT is a machine learning (pattern matching) technology that learns from data that you show it. Thus far it is limited to translation memory and glossaries.

I am somewhat skeptical about the "hybrid NMT" stuff being thrown around by some vendors. The solutions to NMT problems and challenges are quite different (from PB-SMT) and to me, it makes much more sense to me to go completely one way or the other. I understand that some NMT systems do not yet exceed PB-SMT performance levels, and thus it is logical and smart to continue using the older systems in such a case. But given the overwhelming evidence with NMT research and actual user experience in 2017, I think the evidence is pretty clear that NMT is the way forward across the board. It is a question of when, rather than if, for most languages. Adaptive MT might be an exception in the professional use scenario because it is learning in real time if you work with SDL or Lilt. While hybrid RBMT and SMT made some sense to me, hybrid SMT+NMT does not make any sense to me and triggers blips on my bullshit radar, as it reeks of marketing-speak rather than science. However, I do think that Adaptive MT built with an NMT foundation might be viable, and could very well be the preferred model for MT for years to come, in post-editing and professional translator use scenarios in future. It is also my feeling that as these more interactive MT/TM  capabilities become more widespread the relative value of pure TM  tools will decline dramatically. But I am also going to bet that an industry outsider will drive this change, simply because real change rarely comes from people with sunk costs and vested interests. And surely somebody will come up with a better workbench for translators than standard TM matching, one which provides translation suggestions continuously, and learns from ongoing interactions.

I am going to bet that the best NMT systems will come from those who go "all in" with NMT and solve NMT deficiencies without resorting to force-fitting old SMT paradigm remedies on NMT models or trying to go "hybrid", whatever that means.

The value of the research data of all those who are sharing their NMT experience is immense to all, as it provides data that is useful to everybody else in moving forward faster. I have summarized some of this in previous posts:  The Problem with BLEU and Neural Machine Translation, An Examination of the Strengths and Weaknesses of Neural Machine Translation, and Real and Honest Quality Evaluation Data on Neural Machine Translation.The various posts on SYSTRAN's PNMT and the recent review of SDL's NMT also describe many of the NMT challenges.

In addition to the research data from Pangeanic in this post, there is also this from Iconic and ADAPT where they basically state that mature a PB-SMT systems will still outperform NMT systems in the use-case scenarios they tested, and finally, the reconstruction strategy pointed out by Lilt, whose results are shown in the chart below. This approach apparently improves overall quality and also seems to handle long sentences better in NMT than others have reported. I have seen other examples of "evidence" where SMT outperforms NMT but I am wary of citing references where the research is not transparent or properly identified.

 
Source: Neural Machine Translation with Reconstruction

This excerpt from a recent TAUS post is also interesting, and points out that finally, the data is essential to making any of this work:
Google Director of Research Peter Norvig said recently in a video about the future of AI/ML in general that although there is a growing range of tools for building software (e.g. the neural networks), “we have no tools for dealing with data." That is: tools to build data, and correct, verify, and check them for bias, as their use in AI expands. In the case of translation, the rapid creation of an MT ecosystem is creating a new need to develop tools for “dealing with language data” – improving data quality and scope automatically, by learning through the ecosystem. And transforming language data from today’s sourcing problem (“where can I find the sort of language data I need to train my engine?”) into a more automated supply line.
For me this statement by Norvig is a pretty clear indication that perhaps the greatest value-add opportunities for NMT come from understanding, preparing and tuning the data that ML algorithms learn from. In the professional translation market where MT output quality expectations are the highest, it makes sense that data is better understood and prepared. I have also seen that the state of the aggregate "language data" within most LSPs is pretty bad, maybe even atrocious. It would be wonderful if the TMS systems could help improve this situation and provide a richer data management environment to enable data to be better leveraged for machine learning processes. To do this we need to think beyond organizing data for TM and projects, but at this point, we are still quite far from this. Better NMT systems will often come from better data, which is only possible if you can rapidly understand what data is most relevant (using metadata) and can bring it to bear in a timely and effective way. There is also an excessive focus on TM in my opinion. Focus on the right kind of monolingual corpus can also provide great insight, and help to drive strategies to generate and manufacture the "right kind" of TM to drive MT initiatives further. But this all means that we need to get more comfortable working with billions of words and extracting what we need when a customer situation arises.

 ===============

The Pangeanic Neural Translation Project

So, time to recap and describe our experience with neural machine translation with tests into 7 languages (Japanese, Russian, Portuguese, French, Italian, German, Spanish), and how Pangeanic has decided to shift all its efforts into neural networks and leave the statistical approach as a support technology for hybridization.

We selected training sets from our SMT engines as clean data to train the same engines with the same data and run parallel human evaluation between the output of each system (existing statistical machine translation engines) and the new engines produced by neural systems. We are aware that if data cleaning was very important in a statistical system, it is even more so with neural networks. We could not add additional material because we wanted to be certain that we were comparing exactly the same data but trained with two different approaches.

A small percentage of bad or dirty data can have a detrimental effect on SMT systems, but if it is small enough, statistics will take care of it and won’t let it feed through the system (although it can also have a far worse side effect, which is lowering statistics all over certain n-grams).

We selected the same training data for languages which we knew were performing very well in SMT (French, Spanish, Portuguese) as well as those that have been known to researchers and practitioners as “the hard lot”: Russian as the example of a very rich morphologically language and Japanese as a language with a radically different grammatical structure where re-ordering (that’s what hybrid systems have done) has proven to be the only way to improve.

 

Japanese neural translation tests

Let’s concentrate first on the neural translation results in Japanese as they represent the quantum leap in machine translation we all have been waiting for. These results were presented at TAUS Tokyo last April. (See our previous post TAUS Tokyo Summit: improvements in neural machine translation in Japanese are real).

 We used a large training corpus of 4.6 million sentences (that is nearly 60 million running words in English and 76 million in Japanese). In vocabulary terms, that meant 491,600 English words and 283,800 character-words in Japanese. Yes, our brains are able to “compute” all that much and even more, if we add all types of conjugations, verb tenses, cases, etc. For testing purposes, we did what is supposed to do not to inflate percentage scores and took out 2,000 sentences before training started. This is a standard in all customization – a small sample is taken out so the engine that is generated translates what is likely to encounter. Any developer including the test corpus in the training set is likely to achieve very high scores (and will boast about it). But BLEU scores have always been about checking domain engines within MT systems, not across systems (among other things because the training sets have always been different so a corpus containing many repetitions or the same or similar sentences will obviously produce higher scores). We also made sure that no sentences were repeated and even similar sentences had been stripped out of the training corpus in order to achieve as much variety as possible. This may produce lower scores compared to other systems, but the results are cleaner and progress can be monitored very easily. This has been the way in academic competitions and has ensured good-quality engines over the years.

The standard automatic metric in SMT did not detect much difference between the output in NMT and the output in SMT. 

However, WER was showing a new and distinct tendency.
NMT shows better results in longer sentences in Japanese. SMT seems to be more certain in shorter sentences (training a 5 n-gram system)

And this new distinct tendency is what we picked up when the output was evaluated by human linguists. We used Japanese LSP Business Interactive Japan to rank the output from a conservative point of view, from A to D, A being human quality translation, B a very good output that only requires a very small percentage of post-editing, C an average output where some meaning can be extracted but serious post-editing is required and D a very low-quality translation without no meaning. Interestingly, our trained statistical MT systems performed better than the neural systems in sentences shorter than 10 words. We can assume that statistical systems are more certain in these cases when they are only dealing with simple sentences with enough n-grams giving evidence of a good matching pattern.

We created an Excel sheet (below) for human evaluators with the original English to the left with the reference translation. The neural translation followed. Two columns were provided for the rating and then the statistical output was provided.

Neural-SMT EN>JP ranking comparison showing the original English,  the reference translation,  the neural MT output and the statistical system output to the right

German, French, Spanish, Portuguese and Russian Neural MT results

The shocking improvement came from the human evaluators themselves. The trend pointed to 90% of sentences being classed as perfect translations (naturally flowing) or B (containing all the meaning, with only minor post-editing required). The shift is remarkable in all language pairs, including Japanese, moving from an “OK experience” to a remarkable acceptance. In fact, only 6% of sentences were classed as a D (“incomprehensible/unintelligible”) in Russian, 1% in French and 2% in German. Portuguese was independently evaluated by translation company Jaba Translations.


This trend is not particular to Pangeanic only. Several presenters at TAUS Tokyo pointed to ratings around 90% for Japanese using off-the-shelf neural systems compared to carefully crafted hybrid systems. Systran, for one, confirmed that they are focusing only on neural research/artificial intelligence and throwing away years of rule-based work, statistical and hybrid efforts.

Systran’s position is meritorious and very forward thinking. Current papers and some MT providers still resist the fact that despite all the work we have done over the years, Multimodal Pattern Recognition has got the better hand. It was only computing power and the use of GPUs for training that was holding it behind.

Neural networks: Are we heading towards the embedment of artificial intelligence in the translation business?

BLEU may be not the best indication of what is happening to the new neural machine translation systems, but it is an indicator. We were aware of other experiments and results by other companies pointing in a similar direction. Still, although the initial results may have made us think that there was no use to it, BLEU is a useful indicator – and in any case, it was always an indicator of an engine’s behavior not a true measure of an overall system versus another.  (See the Wikipedia article https://en.wikipedia.org/wiki/Evaluation_of_machine_translation).

Machine translation companies and developers face a dilemma as they have to do without the research, connectors, plugins and automatic measuring techniques and build new ones. Building connectors and plugins are not so difficult. Changing the core from Moses to a neural system is another matter. NMT is producing amazing translations, but it is still pretty much a black box. Our results show that some kind of hybrid system using the best features of an SMT system is highly desirable and academic research is moving in that direction already – as it happened with SMT itself some years ago.

Yes, the translation industry is at the peak of the neural networks hype. But looking at the whole picture and how artificial intelligence (pattern recognition) is being applied in several other areas, in order to produce intelligent reports, tendencies, and data, NMT is here to stay – and it will change the game for many, as more content needs to be produced cheaply with post-edition, at light speed when good machine translation is good enough. Amazon and Alibaba are not investing millions in MT for nothing – they want to reach people in their language with a high degree of accuracy and at a speed, human translators cannot.




Manuel Herranz is the CEO of Pangeanic. Collaboration with Valencia’s Polytechnic research group and the Computer Science Institute led to the creation of the PangeaMT platform for translation companies. He worked as an engineer for Ford machine tool suppliers and Rolls Royce Industrial and Marine, handling training and documentation from the buyer’s side when translation memories had not yet appeared in the LSP landscape. After joining a Japanese group in the late 90’s, he became Pangeanic’s CEO in 2004 and began his machine translation project in 2008 creating the first, command-line versions of the first commercial application of Moses (Euromatrixplus) and was the first LSP in the world to implement open source Moses successfully in a comercial environment, including re-training features and tag handling before they became standard in the Moses community.

Tuesday, July 18, 2017

Linguistic Quality Assurance in Localization – An Overview

This is a post by Vassilis Korkas on the quality assurance and quality checking processes being used in the professional translation industry. (I still find it really hard to say localization, since that term is really ambiguous to me, as I spent many years trying to figure out how to deliver properly localized sound through digital audio platforms. To me, localized sound = cellos from the right and violins from the left of the sound stage. I  have a strong preference for instruments to stay in place on the sound stage for the duration of the piece.  )

 As the volumes of translated content increase, the need for automated production lines also grows. The industry is still laden with products that don't play well with each other, and buyers should insist that vendors of the various tools that they use enable and allow easy transport and downstream processing of any translation related content. Froom my perspective automation in the industry is also very limited, and there is a huge need for human project management because tools and processes don't connect well. Hopefully, we start to see this scenario change. I also hope that the database engines for these new processes are much smarter about NLP and much more ready to integrate machine learning elements as this too will allow the development of much more powerful, automated, and self correcting tools.

As an aside, I thought this chart was very interesting, (assuming it is actually based on some real research), and shows why it is much more worthwhile to blog than to share content on LinkedIn, Facebook or Twitter. However, the quality of the content does indeed matter and other sources say that high quality content has an even longer life than shown here.

Source: @com_unit_inside

Finally, CNBC had this little clip describing employment growth in the translation sector where they state: "The number of people employed in the translation and interpretation industry has doubled in the past seven years." Interestingly, this is exactly the period where we have seen the use of MT also dramatically increase. Apparently, they conclude that technology has also helped to drive this growth.

The emphasis in the post below is mine.


 ==========

In pretty much any industry these days, the notion of quality is one that seems to crop up all the time. Sometimes it feels like it’s used merely as a buzzword, but more often than not quality is a real concern, both for the seller of a product or service and the consumer or customer. In the same way, quality appears to be omnipresent in the language services industry as well. Obviously, when it comes to translation and localization, the subject of quality has rather unique characteristics compared to other services, however, ultimately it is the expected goal in any project.

In this article, we will review what the established practices are for monitoring and achieving linguistic quality in translation and localization, examine what the challenges are for linguistic quality assurance (LQA) and also attempt to make some predictions for the future of LQA in the localization industry.

Quality assessment and quality assurance: same book, different pages


Despite the fact that industry standards have been around for quite some time, in practice, terms such as ‘quality assessment’ and ‘quality assurance’, and sometimes even ‘quality evaluation’, are often used interchangeably. This may be due to a misunderstanding of what each process involves but, whatever the reason, this practice leads to confusion and could create misleading expectations. So, let us take this opportunity to clarify:
  • [Translation] Quality Assessment (TQA) is the process of evaluating the overall quality of a completed translation by using a model with pre-determined values which can be assigned to a number of parameters used for scoring purposes. Such models are the LISA, the MQM, the DQF, etc.
  • Quality Assurance “[QA] refers to systems put in place to pre-empt and avoid errors or quality problems at any stage of a translation job”. (Drugan, 2013: 76)
Quality is an ambiguous concept in itself and making ‘objective’ evaluations is a very difficult task. Even the most rigorous assessment model requires subjective input by the evaluator who is using it. When it comes to linguistic quality, in particular, we would be looking to improve on issues that have to do with punctuation, terminology and glossary compliance, locale-specific conversions and formatting, consistency, omissions, untranslatable items and others. It is a job that requires a lot of attention to detail and strict adherence to rules and guidelines – and that’s why LQA (most aspects of it, anyway) is a better candidate for ‘objective’ automation.

Given the volume of translated words in most localization projects these days, it is practically prohibitive in terms of time and cost to have in place a comprehensive QA process, which would safeguard certain expectations of quality both during and after translation. Therefore it is very common that QA, much like TQA, is reserved for the post-translation stage. A human reviewer, with or without the help of technology, will be brought in when the translation is done and will be asked to review/revise the final product. The obvious drawback of this process is that significant time and effort could be saved if somehow revision could occur in parallel with the translation, perhaps by involving the translator herself with the process of tracking errors and making these corrections along the way.

The fact that QA only seems to take place ‘after the fact’ is not the only problem, however. Volumes are another challenge – too many words to revise, too little time and too expensive to do it. To address this challenge, Language Service Providers (LSPs) use sampling (the partial revision of an agreed small portion of the translation) and spot-checking (the partial revision of random excerpts of the translation). In both cases, the proportion of the translation that is checked is about 10% of the total volume of translated text, and that is generally considered agreeable to be able to say whether the whole translation is good or not. This is an established and accepted industry practice that was created out of necessity. However, one doesn’t need to have a degree in statistics to appreciate that this small sample, whether defined or random, is hardly big enough to reflect the quality of the overall project.

The progressive increase of the volumes of text translated every year (also reflected in the growth of the total value of the language service industry, as seen below) and the increasing demands for faster turnaround times makes it even harder for QA-focused technology to catch up. The need for automation is greater than ever before. 
Source: Common Sense Advisory (2017)


Today we could classify QA technologies into three broad groups:
  • built-in QA functionality in CAT tools (offline and online),
  • stand-alone QA tools (offline),
  • custom QA tools developed by LSPs and translation buyers (mainly offline).
Built-in QA checks in CAT tools range from the completely basic to the quite sophisticated, depending on which CAT tool you’re looking at. Stand-alone QA tools are mainly designed with error detection/correction capabilities in mind, but there are some that use translation quality metrics for assessment purposes – so they’re not quite QA tools as such. Custom tools are usually developed in order to address specific needs of a client or a vendor who happens to be using a proprietary translation management system or something similar. This obviously presupposes that the technical and human resources are available to develop such a tool, so this practice is rather rare and exclusive to large companies that can afford it.

 

Consistency is king – but is it enough?


Terminology and glossary/wordlist compliance, empty target segments, untranslated target segments, segment length, segment-level inconsistency, different or missing punctuation, different or missing tags/placeholders/symbols, different or missing numeric or alphanumeric structures – these are the most common checks that one can find in a QA tool. On the surface at least, this looks like a very diverse range that should cover the needs of most users. All these are effectively consistency checks. If a certain element is present in the source segment, then it should also exist in the target segment. It is easy to see why this kind of “pattern matching” can be easily automated and translators/reviewers certainly appreciate a tool that can do this for them a lot more quickly and accurately than they can.

Despite the obvious benefits of these checks, the methodology on which they run has significant drawbacks. Consistency checks are effectively locale-independent and that creates false positives (the tool detects an error when there is none), also known as “noise”, and false negatives (the tool doesn’t detect an error when there is one). Noise is one of the biggest shortcomings of QA tools currently available and that is because of the lack of locale specificity in the checks provided. It is in fact rather ironic that the benchmark for QA in localization doesn’t involve locale-specific checks. To be fair, in some cases users are allowed to configure the tool in greater depth and define such focused checks on their own (either through existing options in the tools or with regular expressions).
 
Source: XKCD

But, this makes the process more intensive for the user and it comes as no surprise that the majority of users of QA tools never bother to do that. Instead, they perform their QA duties relying on the sub-optimal consistency checks which are available by default.

 

Linguistic quality assurance is (not) a holistic approach


In practice, for the majority of large scale localization projects, only post-translation LQA takes place, mainly due to time pressure and associated costs – an issue we touched on earlier in connection with the practice of sampling. The larger implication of this reality is that:
  • a) effectively we should be talking about quality control rather than quality assurance, as everything takes place after the fact; and 
  • b) quality assurance becomes a second-class citizen in the world of localization. This contradicts everything we see and hear about the importance of quality in the industry, where both buyers and providers of language services prioritise quality as a prime directive.
 As already discussed, the technology does not always help. CAT tools with integrated QA functionality have a lot of issues with noise, and that is unlikely to change anytime soon because this kind of functionality is not a priority for a CAT tool. On the other hand, stand-alone QA tools with more extensive functionality work independently, which means that any potential ‘collaboration’ between stand-alone QA tools and CAT tools can only be achieved in a cumbersome and intermittent workflow: complete the translation, export it from the CAT tool, import the bilingual file in the QA tool, run the QA checks, analyse the QA report, go back to the CAT tool, find the segments which have errors, make corrections, update the bilingual file and so on.

The continuously growing demand in the localization industry for the management of increasing volumes of multilingual content in pressing timelines and the compliance with quality guidelines means that the challenges described above will have to be addressed soon. As the trends of online technologies in translation and localization become stronger, there is an implicit understanding that existing workflows will have to be uncomplicated in order to accommodate future needs in the industry. This can indeed be achieved with the adoption of bolder QA strategies and more extensive automation. The need in the industry for a more efficient and effective QA process is here now and it is pressing. Is there a new workflow model which can produce tangible benefits both in terms of time and resources? I believe there is, but it will take some faith and boldness to apply it.

 

Get ahead of the curve


In the last few years, the translation technology market has been marked by substantial shifts in the market shares occupied by offline and online CAT tools respectively, with the online tools gaining rapidly more ground. This trend is unlikely to change. At the same time, the age-old problems of connectivity and compatibility between different platforms will have to be addressed one way or another. For example, slowly transitioning to an online CAT tool and still using the same offline QA tool from your old workflow is inefficient as it is irrational, especially in the long run.

A deeper integration between CAT and QA tools also has other benefits. The QA process can move up a step in the translation process. Why have QA only in post-translation when you can also have it in-translation? (And it goes without saying that pre-translation QA is also vital, but it would apply to the source content only so it’s a different topic altogether.) This shift is indeed possible by using API-enabled applications – which are in fact already standard practice for the majority of online CAT tools. There was a time when each CAT tool had its own proprietary file formats (as they still do), and then the TMX and TBX standards were introduced and the industry changed forever, as it became possible for different CAT tools to “communicate” with each other. The same will happen again, only this time APIs will be the agent of change.

Source: API Academy
 Looking further ahead, there are also some other exciting ideas which could bring about truly innovative changes to the quality assurance process. The first one is the idea of automated corrections. Much in the same way that a text can be pre-translated in a CAT tool when a translation memory or a machine translation system is available, in a QA tool which has been pre-configured with granular settings it would be possible to “pre-correct” certain errors in the translation before a human reviewer even starts working on the text. With a deeper integration scenario in a CAT tool, an error could be corrected in a live QA environment the moment a translator makes that error.

This kind of advanced automation in LQA could be taken even a step further if we consider the principles of machine learning. Access to big data in the form of bilingual corpora which have been checked and confirmed by human reviewers makes the potential of this approach even more likely. Imagine a QA tool that collects all the corrections a reviewer has made and all the false positives the reviewer has ignored and then it processes all that information and learns from it. Every new text processed and the machine learning algorithms make the tool more accurate in what it should and should not consider to be an error. The possibilities are endless.

Despite the various shortcomings of current practices in LQA, the potential is there to streamline and improve processes and workflows alike, so much so that quality assurance will not be seen as a “burden” anymore, but rather as an inextricable component of localization, both in theory and in practice. It is up to us to embrace the change and move forward.

Reference
Drugan, J. (2013) Quality in Professional Translation: Assessment and Improvement. London: Bloomsbury.


---------------
 




Vassilis Korkas is the COO and a co-founder of lexiQA. Following a 15-year academic career in the UK, in 2015 he decided to channel his expertise in translation technologies, technical translation and reviewing into a new tech company. In lexiQA he is now involved with content development, product management, and business operations.

Note
This is the abridged version of a four-part article series published by the author on lexiQA’s blog: Part 1Part 2Part 3Part 4

This link will also provide specific details on the lexiQA product capabilities.





Tuesday, July 11, 2017

Translation Industry Perspective: Utopia or Dystopia?

This is a follow-up post by Luigi Muzii on the evolving future of the "professional translation industry". His last post has already attracted a lot of attention based on Google traffic rankings. In my view, Luigi provides great value to the discussion on "the industry" with his high-level criticism of dubious industry practices, since much of what he points to is clearly observable fact. Bullshit marketing speak is a general problem across industries, but Luigi hones in on some of the terms that are thrown around at localization conferences and in the industry. You may choose to disagree with him, or perhaps possibly see that there are good reasons to start a new, more substantive discussion. Among other things, Luigi challenges the improper usage of the term "agile" in the localization world in this post. The concept of agile comes from the software development world and refers most often, to the rapid prototyping, testing and production implementation of custom software development projects.
Agile software development is a set of principles for software development in which requirements and solutions evolve through collaboration between self-organizing, cross-functional teams. It promotes adaptive planning, evolutionary development, early delivery, and continuous improvement, and it encourages rapid and flexible response to change.
To apply this concept to translation production work is indeed a stretch in my view. (Gabor, can you provide me a list of who uses this word (agile) the most on their websites?) While there is a definite change in the kind of projects and the manner in which translation projects are defined and processed today, using terms from the software industry to describe minor evolutionary changes to very archaic process and production models, is indeed worth raising some questions on. The notion of "augmented translation" is also somewhat silly in a world where only ~50% of translators use the 1990's technology called translation memory, a database technology that is archaic at best. 

It is my feeling that step one to make a big leap forward is to shift the focus from the segment level to the corpus level. Step two is to focus on customer conversation text rather than documentation text that few ever read. Step three is to have proper metadata and build more robustness on this leveragable linguistic asset.

MT is already the dominant means to translate content in the world today, but few LSPs or translators really know how to use it skillfully. Change in the professional translation world is slow, especially if it is evolutionary, and involves the skillful use of new technology (i.e. not DIY MT or DIY TMS). In my long-term observation of how the industry has responded to and misused MT, I can attest to this. Those few who get MT right I think will very likely be the leaders who will define the new agenda as tightly integrated MT and HT work is a key to rapid response and business process agility (not "agile" ) and continuous improvement.  Effectiveness is closely related to developing meaningful new technology savvy translation process skills, which few invest in, and thus many are likely to be caught in the cross-hairs of new power players who might enter the market and change the rules.

Given the recent rumors of Amazon developing MT technology services, we should not be too surprised, if in the next few years a new force emerges in "professional translation", that from the outset properly integrates MT + HT +  Outsourced Labor (Super Duper Mechanical Turk) with continuous improvement machine learning and AI infrastructure to deliver equivalent translation product at a fraction of the cost of  an LSP like Lionbridge or Transperfect for example. They are already building MT engines across a large number of subject domains, so have deep knowledge of how to do this on the billions-of-words per month scale, and they are also the largest provider of cloud services today.  As I pointed out last year the players who make the most money from machine translation are companies outside the translation industry. Amazon has already displaced Sears and several other major retailers and they have the right skills to pull this off if they really wanted to.  Check out the chart on retail store closings that is largely driven by AMZN

Even if they only succeed in taking only 5-10% of the "translation market" it would still make them a multi-billion dollar LSP that could handle 3 word or 3 billion word projects into 10 languages with equal ease, and do this with minimal need to labor through project management meetings and discussions about quality. It might also be in the most automated and continuous improvement modus operandi we have ever seen. So, think of a highly automated and web-based customer interaction language translation service, that has a Google scale MT with output better quality across 50 subject domains and an AI backbone, and the largest global network of human translators and editors who get paid on delivery, and are given a translator workbench that enhances and assists actual translation work at every step of the way. Think of a workbench that integrates corpus analysis, TM, MT, dictionaries, concordance, synonym lookup, and format management all in one single interface, and which makes translation industry visions of "augmented translation" look like toys.

So get your shit together boys and girls cause the times they are a-changing.



The highlights and emphasis in this post and most of the graphics are my choices. I have also added some comments within Luigi's text in purple italics. The Dante quote below is really hard to translate. I will update it if somebody can suggest a better translation.

-----------------

Vuolsi così colà dove si puote/ciò che si vuole, e più non dimandare.
 

It is so willed there where is power to do/That which is willed; and farther question not.

Merriam-Webster defines technology as “the practical application of knowledge especially in a particular area.” The Oxford Dictionary defines it as “the application of scientific knowledge for practical purposes, especially in industry.” The Cambridge Dictionary defines technology as “(the study and knowledge of) the practical, especially industrial, use of scientific discoveries.” Collins defines technology as the “methods, systems, and devices which are the result of scientific knowledge being used for practical purposes”. More extensively, Wikipedia defines technology as “the collection of techniques, skills, methods, and processes used in the production of goods or services or in the accomplishment of objectives.”

This should be enough to mop away the common misconception that technology is limited to physical devices. In fact, according to Encyclopedia Britannica, hard technology is concerned with physical devices, and soft technology is concerned with human and social factors. Hard technologies cannot do without the corresponding soft technologies, which, however, are hard to acquire because they depend on human knowledge that is obtained through instruction, application, and experience. Technology is also divided in basic and high.

That said, language, to all effects and purposes is a technology. A soft technology, and a basic one, yet highly sophisticated.

Why this long introduction on technology? Because we have been experiencing an exponential technological evolution for over half a century that we can hardly master. We have been adapting fast, as usual, but every day less.

This exponential technological evolution is the daughter of the Apollo program, whose upshot has been universally acknowledged as the greatest achievement in human history. It stimulated practically every area of technology.

Some of the most important technological innovations from the Apollo program were popularized in the '80s and the '90s, and even the so-called translation industry is, in some ways, a spinoff of that season.

Translation Industry Evolution


Indeed, if the birth of the translation profession as we know it today can be traced back to the years between the two world wars of the last century, with the development of world trade, the birth of the translation industry can be set around the late 1980s with the spread of personal computing and office automation.

The products in this category were aimed at new customers, SMEs and SOHO, rather than the usual customers, the big companies that had the resources and the staff to handle bulky and complex systems. These products could be sold to a larger public, even overseas, but for worldwide sales to be successful, they had to speak the languages of the target countries. Translation then received a massive boost, and the computer software industry was soon confronted with the problem of adapting its increasingly multifaceted products to local markets.

The translation industry as we know it today is then the abrupt evolution of a century-old single person practice into multi-person shops. As a matter of fact, intermediaries (the translation agencies) existed even before tech companies helped translation become a global business, but their scope and ambition were strictly local. They were mostly multiservice centers, and their marketing policy was to essentially renew an ad on the local yellow pages every year.

With the large software companies, the use of translation memories (TMs) also burst onto the scene. The software industry saw in the typical TM feature of finding repetitions, a way to cut translation costs.

So far, TMs have been the greatest and possibly the single disruptive innovation in translation. As SDL’s Paul Filkin recently recalled, TMs were the application of the research of Alan Melby and his team at Brigham Young University in the early ‘80s.

Unable to bear the overhead that the large volumes from big-budget clients were procuring, translation vendors devised a universal way to recover from profit loss by asking for discounts to their vendors, regardless of the nature of jobs.

In the late 1990s, Translation Management Systems (TMSs) began to spread; they were the only other innovation, way less important and much less impacting than TMs.

At the end of the first decade of the 2000s, free online machine translation (MT) engines started releasing “good-enough” outputs, and since the surge in demand for global content over the last three decades has resulted in a far greater need to translate content than enough talent available, MT has been growing steadily and exponentially, to the point that today, machines translate over 150 billion words per day, 300 times more than humans, in over 200 combinations, serving more than 500 million users every month. (Actually, I am willing to bet the daily total is in excess of 500 billion words. KV)
We are now on the verge of full automation of translation services. Three main components of the typical workflow might, indeed, be almost fully automatized: production, management, and delivery. Production could be almost fully automatized with MT; TMSs have almost fully automatized translation management and delivery. Why almost? Because the translation industry is not immune to waves and hype, but it is largely and historically very conservative,  a little reactive, and therefore a “late” adopter of technology. A manifest evidence is an infatuation with the agile methodology, and the consequent excitement affecting some of the most prominent industry players. Of course, prominence does not necessarily mean competence.


In fact, agile is rather a brand-name, with the associated marketing hype, and as such, is more a management fad, that has a limited lifespan. In fact, localization processes can hardly be suitable for agile methodology, for its typical approach and process. If it is true that no new tricks can be taught to any old dog, for agile to be viable, a century-old teaching and practicing attitude should be profoundly reformed. Also, although agile has become the most popular software project management paradigm, it is understood for not having even really improved software quality, that is generally considered low. (Here is a website called http://agileisbullshit.tumblr.com/ that documents the many problems of this approach. KV) In contrast, the translation industry has always been claiming to be strongly focused on and committed to quality. If quality is the main concern for translation buyers, this possibly means that most vendors are still far from achieving a constant level of appreciable quality. In fact, while lists of security defects for major software companies show high levels of open deficiencies, the complaints of translation users and customers  around the world say that the industry works poorly.

Raising the bar, increasing the stakes, pushing the boundary always a little further, are all motives for the adoption of a new working methodology like agile. These motives translate into more, faster and cheaper, but not necessarily better. Indeed, higher speed, greater agility, and lower cost of processes are supposed to make reworks and retrofitting expedient.

Anyway, flooding websites, blogs, presentations, and events with paeans praising the wonders of a methodology that is supposed to be fashionable is not just ludicrous, it is of no help. Mouthing trendy words without knowing much about their meaning and the underlying concepts may seem like an effective marketing strategy, but, in the end, it is going to hurt when the recipients realize that this only disguises the actual backwardness and ignorance.


The explosion of content has been posing serious translation issues to emerging global companies. The relentless relocation of businesses on the Web made DevOps and continuous delivery the new paradigms, pushing the demand for translation automation even further.

Many in the translation community speak and act as if they were and will be living in imaginary and indefinitely remote place that possesses highly desirable or nearly perfect qualities for its inhabitants. They see the future, whatever it is depicted, as an imaginary place where people lead dehumanized and often fearful lives.

 In the meanwhile, a survey presented a few weeks ago in an article in the MIT Technology Review reports, there’s a 50% chance of AI outperforming humans in all tasks in 45 years and of automating all human jobs in 120 years. Specifically, researchers predict AI will outperform humans in translating languages by 2024, writing high-school essays by 2026, writing a bestselling book by 2049, and working as a surgeon by 2053.

After all, innovation and translation have always been strange bedfellows. Innovations come from answering new questions, while the translation community has been struggling with the same old issues for centuries. Not surprisingly, any innovation in the translation industry is and will most certainly be sustaining innovation, perpetuating the current dimensions of performance.

Nevertheless, despite the fear that robots will destroy jobs and leave people unemployed, the market for translation technologies is increasing, but translation Luddites are convinced that translation technologies will not endanger translation jobs anytime soon, and point rather to the lack of skilled professionals.

Indeed, the translation industry resembles a still life painting, with every part of it seemingly immutable. A typical part of this painting is quality assessment, still following the costly and inefficient inspection-based error-counting approach and the related red-pen syndrome.

In this condition of increasing automation and disintermediation, a tradeoff on quality seems the most easily predictable scenario. As for the software industry, increasing speed and agility, while controlling costs could make reworks and retrofits acceptable. MT will be spreading more and more and post-editing will be the ferry to the other banks of global communication, allowing direct transit between points at a capital cost much lower than bridges or tunnels. MT is not Charon, though.

Charon as depicted by Michelangelo in his fresco The Last Judgment in the Sistine Chapel
 
The key question regarding post-editing is how much longer it will be necessary or even requested. Right now, most translation jobs are small, rushed, basic, and unpredictable in frequency, and yet production models are still basically the same as fifty years ago, despite the growing popularity of TMS systems and other automation tools. This means that the same amount of time is required for managing big and tiny projects, as translation project management still hinges on the same rigid paradigms borrowed from tradition.

The most outstanding forthcoming innovations in this area will be confidence scoring and data-driven resource allocation. They have already been implemented and will be further improved when enough quality data is going to be available. In fact, confidence scoring is almost useless if scores cannot be first compared with benchmarks and later with actual results. Benchmarks can only come from project history while results must be properly measured, and measures must be known to be read and then classified.

This is not yet in the skillset of most LSPs and is far, very far to be taught in translation courses or in translator training programs.

However, this is where human judgment will remain valuable for a long time. Not quality assessment, which is still today not yet objective enough. Information asymmetry will remain a major issue, as there will always be a language pair totally outside the scope of any customer, who has no way of knowing if the product would match the promises made to the customer. Indeed, human assessment of translation quality, if based on the current universal approach, implies the use of a reference model, although implicit. In other words, everyone who is requested to evaluate a translation does it based on his/her own ideal.

MT capability will be integrated into all forms of digital communication, and MT itself will soon become a commodity. This will further make post-editing replace translation memory leveraging as the primary production environment in industrial translation in the next few years. This also means that, in the immediate future, the urge for post-editing of MT could escalate and find translation industry players unprepared.

In fact, the quality of the MT output has been steadily improving, and now it is quite impressive. This is what most translators should be afraid of, that expectations on professional translators will be increasing.

With machines being soon better at almost everything humans do, translation companies will have to rethink their business. Following the exponential pace of evolution, MT will soon leave little room for translation business. This does not mean that human translations will not be necessary any longer. Simply that today’s 1 percent will shrink even further, much further. Humans will most possibly be required where critical decisions must be made. This is precisely the kind of situation where information asymmetry plays a central role, in those cases where one party has no way of knowing if the product received from the other party would match the promises, for example when a translation should be handled as evidence in court.

With technology making it possible to match resources, find the most suitable MT engine for a particular content, predict quality, etc. human skills will have to change. Already today, no single set of criteria guarantees an excellent translation, and the quality of people alone has little to do with the services they render and the fees they charge.

This implies that vendor management (VM) will be an increasingly crucial function. Assessing vendors, of all kinds, will require skills and tools that have never been developed in translation courses. Today, vendor managers are mostly linguists who have developed VM management competence on their own, and most of the time cannot dedicate all their time and efforts to vendor assessment and management and are forced to do their best with spreadsheets, without having the chance to attend HRM or negotiation courses. Vendor management systems (VMSs) have been around for quite some time now, but they are still unknown to most LSPs. And yet, translation follows a typical outsourcing supply chain, down to freelancers.

So, translation industry experts, authorities, and players, should stop bullshitting. True, the industry has been growing more or less steadily, even in a time of general crisis, but the translation business still only counts for a meager 1 percent of the total. In other words, when translation buyers are deciding to waive the zero-translation option and have all or most content translated, the growth is still linear.

Agile in translation is not the only mystification via marketing-speak being used in the localization business. Now it is the turn of “augmented translation” and “lights-out project management."  (Lights Out Management (LOM) is the ability for a system administrator to monitor and manage servers by remote control.) Borrowing terms (not concepts) from other fields is clearly meant to disguise crap, look cool, and astonish the audience, but, trying to look cool does not necessarily mean being cool. In the end, it can make one seem she/he does not really know what she/he is talking about. Even trendy models are shaped by precise rules and roles: using them only as magic words may backfire.

Nonetheless, this bad habit does not seem to decline even a bit. Indeed, it still dominates industry events.

Localization World, for example, is supposed to be the world’s premiere conference when it comes to unveiling new translation technology and trends. Anyway, most of the over 400 participants gathered in Barcelona seemed to have spent their time in parties and social activities, while room topics strayed quite far away from the conference theme of continuous delivery and the associated technologies and trends, despite the fact that the demand for better automation and more advanced tools are growing steadily. Maybe it is true that social aspect in conferences is what conferences are for, but then why pick a theme and layout presentations and discussions?

Presentations revolve around the usual arguments, widely and repeatedly dealt with before, and after the event, and are often slavish repetitions of commercial propositions. Questions and comments are usually not meant to be challenging or to generate debate, although stimulating and enriching it would be. Triviality rules, because no one is willing to burn his/her stuff that is intended to be presented in other times to different audiences.

Anyway, change is coming fast and, once again, the translation industry is about to be found unprepared when the effects of the next innovation will mess it up. So, it is time for LSPs—and their customers—to rethink their translation business and awaken from the drowsiness in which they have always received innovations. Also, jobs are changing quickly and radically too, and the gap to bridge between education and business would be even wider than it is now, which is already large. It is making less and less sense to imagine for one’s own children a future in translation as a profession, and this is going to make it harder and harder to find young talents who are willing and able to work with the abundance of technology, data and solutions available in the industry, however fantastic.

This said it won’t be long before “skilled in machine learning” becomes the new “proficient in Excel”. And now very few in the translation community are concretely doing something about this. Choosing an ML algorithm will soon be as simple as selecting a template in Microsoft Word, but so far, very few translation graduates and even professional translators seem that proficient. In Word, of course.

Luigi Muzii's profile photo


Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm. He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts.