Pages

Monday, August 1, 2016

Overview of Expert MT Systems - KantanMT

This is second post in a series on Expert MT systems vendors focusing on an interview with Tony O'Dowd of KantanMT. As some of you may notice, Kantan is much further along the marketing curve and some may say it is somewhat slick, but their client base, both Enterprise and LSPs speaks for itself, and Kantan is a serious contender for expert MT services for both enterprise and LSPs in my opinion. Kantan has made many of the SMT support tools look very pretty and easy to use, but I suspect that those who understand what they are actually doing with these tools are likely to be more successful. 
 
----------------------------

Can you provide a brief overview of your recent history so that we can better understand your company and technology? Have you seen growth in the use of MT during your history so far? 
 
KantanMT.com was founded out of an idea I had while preparing for my PhD at Dublin City University. I was very curious as to why so few enterprises and LSPs were using MT within their Localisation workflows. From market research I found that the main causes of this were cost, complexity, challenge and quality. 
  • Cost – Traditional MT systems were sold via Professional Services teams. This required expensive upfront development costs and long lead times to commission an engine. Of the 11 vendors identified in our market research all of them sold their systems via this sales mechanism.
  • Complexity – The complexity of MT systems was identified as a significant barrier to usage and deployment. However, since the traditional MT market was sold via Professional Service type sales-engagements, there was no motivation or reason why complexity should reduce overtime. Put simply it was in the industry’s favour to talk up complexity, however, this was driving down usage and overall market penetration.
  • Challenge – MT systems were challenging to manage, improve and deploy. Since little or no attempt was made to resolve the complexity issue (it wasn’t in the industry’s interest to do this…), the challenge of managing MT was out of the reach of most organisations. Additionally, if an organisation decided to go it alone and build their own solution they would have to recruit PhD staffers which were challenging to find and expensive to hire.
  • Quality – The real and perceived quality of MT systems was both an operational and psychological barrier to entry. Translators felt threatened by MT due to a lack of successful industry implementations and this led to dis-trust and an elevated sense that MT is just not good-enough. Meanwhile, an explosion in web based content and volumes meant that the industry was looking for mechanisms that were good-enough (for purpose.) Aligning these two polar views was a significant barrier to the wide spread usage of MT.

The idea behind Kantan was driven by this analysis and the desire to solve these four fundamental challenges. At KantanMT.com we imagined a platform what would be easy to access, improve development time, measure and predict translation quality and significant address the cost, lead-time and quality challenges. 
The KantanMT.com platform only does three things – It helps our community members develop, improve and deploy SMT solutions within their organisation. It addresses the four challenges and helps enterprises embrace SMT as a productivity enabler for their globalisation strategy. 

Do you build the MT engines for your clients or do you let them do it for themselves?
The KantanMT platform is flexible to accommodate both approaches. The vast majority of LSPs will build their own engines as they view this as a necessary skill they need to embrace and understand within their organisations. For the ISV sector, we generally build the engines using our in-house professional services team and work with linguists to test the translation outputs prior to production release. 
 
If the clients build the engine themselves – do you have a team available to help and guide this?
Yes - within the KantanMT engineering team is a group called Professional Services. This team comprises Solution Architects, Project Managers, Product Trainers and Engine Developers. Their primary role is to develop, improve and deploy engines for large enterprises. This team also has the support of our SRE (Site Reliability Engineering) Team (the main role of this team is to ensure that KantanMT solutions stay running 24x7x365 on our cloud. Remember the KantanMT cloud consists of over 700 servers so this is a vital role in the management of large scale MT solutions). 

We provide the same support for everyone, depending on their need. We also work with the world’s largest LSPs and build and manage KantanMT engines for them too. 

While it has gotten very easy to build a low quality MT engine with Moses, it is my experience that these DIY engines very rarely deliver any business value. What do you do to ensure that you are delivering value for your customers beyond making it easy to build a Moses based engine?
One of the biggest challenges in customizing Statistical Machine Translation systems is rapidly improving the engine after its initial training. While for the most part, you can build a baseline engine using existing Translation Memory assets - the real challenge is how do you go beyond this and achieve higher levels of quality. More importantly, how can you do this rapidly and with minimum cost and effort? 
At KantanMT we tackled this problem in several ways:-
  • KantanBuildAnalytics – This is an interactive development environment, designed for localisation engineers and engine developers, that is used to build and improve KantanMT engines. It uses a range of automated scoring methods ( e.g. BLEU, F-Measure and TER) to assess translation quality, a training normalisation environment that helps improve training candidates, extensive 12-step data cleansers, automatic Gap Analysers and version control. Of course at the core of this environment are the automated scores which are comparative measures that can only meaningfully be used during engine development.

  • KantanAnalytics – This is a technology, jointly developed by the Centre of Next Generation Localisation and KantanLabs, which can predict the quality of translation outputs. Displayed as a percentage value it provides quality guidance to users of KantanMT translations as to the quality of generated outputs – the higher the score the better the fluency and adequacy of the translation. This technology seamlessly integrates with the industry standard Fuzzy match scoring mechanism so that it’s easy for Translation Project Managers to identify the quality of MT outputs. 
  • KantanPEX – PEX stands for Post-Editing Automation. PEX is a series of rules that can be applied to an engine to dynamically modify translations outputs. The KantanMT community use this to address inconsistencies within translations and rapidly ensure engines comply with their quality expectations.
  • KantanTotalRecall – This is a high speed, low latency cloud-based translation memory which is automatically built using the training data uploaded by our clients. The KantanMT is a fusion of both TM (TotalRecall) and MT (KantanMT) technologies which seamlessly blends the best matches from TM with the best translations from MT.
  • KantanLQR – LQR stands for Language Quality Review and KantanLQR is an environment built into the heart of the KantanMT platform which provides a fully interactive workflow for Professional Translators to score the quality of translations. The workflow is fully distributed, highly customisable and Project Managers can determine translation quality in real-time using the industry standard Multidimensional Quality Metrics (MQM). More importantly, the feedback and post-edits from the Professional Translators can be used to fine-tune and improve the KantanMT translation outputs.
  • KantanNER – This is Named Entity Recognition and is built into every KantanMT engine. This is a highly customisable component that is used to ensure numerical data (such as dates, times, currencies, specification data, text entities) are handled outside of the decoding process. For example, we can detect imperial measurements such as feet, inches and miles and convert these measurements to metres, centimetres and kilometres. KantanNER is part of the GENTRY NLP layers developed at KantanLabs and is easy to customise and extend to embrace the precise requirements of the KantanMT community.
  • GENTRY - Gentry is the NLP programming kernel of each KantanMT engine. It’s easy to extend and customise. For example, you can programme additional segmentation and tokenisation rules, extend the 12-step Kantan data cleansers, implement pre-ordering and re-ordering models and even create text pre-processors and post-processors to ensure each KantanMT engines is compliant with the quality expectations of the KantanMT community.
  • KantanFleet – For community members that wish to start translation immediately and avoid the build, test and deploy process, they can use KantanFleet. This is a large collection of pre-built and fully-tested engines in Legal, Financial, Medical, IT and General domains. At present there are over 100 KantanFleet engines. Each KantanFleet engine can easily be extended and customised engines can be built using them as a baseline. This has a significant impact in the time to build, improve and deploy customised engines.
  • KantanLibrary – KantanLibrary is use by community members that may not have sufficient training data to customise their own engines. KantanLibrary is a collection of pre-cleansed, scored and publicly available training data sets. These have all been tested, cleansed and optimised for Legal, Financial, Medical IT, General, Conversational domains.
  • KantanTemplates – This provides an intuitive and powerful way to customise, improve and deploy multiple KantanMT engines that share common training data-sets. Using KantanTemplates™, shared data-sets of bilingual and terminology training files can be used across multiple KantanMT engines, which allows them to be easily modified and updated all at once. KantanTemplates helps you easily customise multiple Machine Translation engines and provides cutting edge analytics and reporting tools to track your progress.
What are some key tools, technologies that you provide on your platform to leverage your customers and maximize their possibility of success?
KantanMT is a complete platform for the development, improvement and deployment of SMT within small, medium and large enterprises. It consists of a large collection of technologies and innovations, all highly integrated with each other to help accelerate the deployment of high quality SMT. Some of these technologies are:- 

  • KantanWidgets – This is a Suite of Productivity Apps that can be used to integrate KantanMT engines into the heart of any localisation workflow. KantanTranslate™ is an App that can be used to provide real-time, on-demand translation of text snippets, KantanDesktopApp™ is a App that can be used to translate one of more documents directly from your desktop, KantanPlugins™ are a collection of application plugins for MS Office and range of browsers that provide real-time translation of content. All KantanWidgets are connected directly to the KantanMT engines developed by the KantanMT community.
  • KantanAPI – this is a RESTful interface into the complete KantanMT platform. It provides both synchronous and asynchronous functionality so that the KantanMT community can build applications exploiting their KantanMT engines.
  • KantanAutoScale – This is a fully distributed, cloud-based deployment technology that helps the KantanMT community release high speed, high capacity engines on the cloud. Using KantanAutoScale technology, KantanMT deployments will scale-up and scale-down based on inbound traffic. This provides the optimal speed and cost balance for clients that wish to translation at scale.
  • KantanSwift – This technology is applied to all engines that require super-fast launch times. The KantanMT community uses this technology in conjunction with KantanAutoScale to manage large deployed KantanMT engines hosted on hundreds of servers.
  • KantanTemplates – This provide an intuitive and powerful way to customise, improve and deploy multiple KantanMT engines that share common training data-sets. Using KantanTemplates™, shared data-sets of bilingual and terminology training files can be used across multiple KantanMT engines, which allows them to be easily modified and updated all at once. KantanTemplates helps you easily customise multiple Machine Translation engines and provides cutting edge analytics and reporting tools to track your progress.
  • KantanLQR – LQR stands for Language Quality Review and KantanLQR is an environment built into the heart of the KantanMT platform which provides a fully interactive workflow for Professional Translators to score the quality of translations. The workflow is fully distributed, highly customisable and Project Managers can determine translation quality in real-time using the industry standard Multidimensional Quality Metrics (MQM). More importantly, the feedback and post-edits from the Professional Translators can be used to fine-tune and improve the KantanMT translation outputs.
  • KantanPEX – PEX stands for Post-Editing Automation. PEX is a series of rules that can be applied to an engine to dynamically modify translations outputs. The KantanMT community use this to address inconsistencies within translations and rapidly ensure engines comply with their quality expectations.
  • KantanTotalRecall – This is a high speed, low latency cloud-based translation memory which is automatically built using the training data uploaded by our clients. The KantanMT is a fusion of both TM (TotalRecall) and MT (KantanMT) technologies which seamlessly blends the best matches from TM with the best translations from MT.
  • KantanBuildAnalytics – This is an interactive development environment, designed for localisation engineers and engine developers, that is used to build and improve KantanMT engines. It uses a range of automated scoring methods ( e.g. BLEU, F-Measure and TER) to assess translation quality, a training normalisation environment that helps improve training candidates, extensive 12-step data cleansers, automatic Gap Analysers and version control. Of course at the core of this environment are the automated scores which are comparative measures that can only meaningfully be used during engine development.
  • KantanAnalytics – This is a technology, jointly developed by the Centre of Next Generation Localisation and KantanLabs, which can predict the quality of translation outputs. Displayed as a percentage value it provides quality guidance to users of KantanMT translations as to the quality of generated outputs – the higher the score the better the fluency and adequacy of the translation. This technology seamlessly integrates with the industry standard Fuzzy match scoring mechanism so that it’s easy for Translation Project Managers identify the quality of MT outputs.
  • KantanNER – This is Named Entity Recognition and is built into every KantanMT engine. This is a highly customisable component that is used to ensure numerical data (such as dates, times, currencies, specification data, text entities) are handled outside of the decoding process. For example, we can detect imperial measurements such as feet, inches and miles and convert these measurements to metres, centimetres and kilometres. KantanNER is part of the GENTRY NLP layers developed at KantanLabs and is easy to customise and extend to embrace the precise requirements of the KantanMT community.
  • GENTRY - Gentry is the NLP programming kernel of each KantanMT engine. It’s easy to extend and customise. For example, you can programme additional segmentation and tokenisation rules, extend the 12-step Kantan data cleansers, implement pre-ordering and re-ordering models and even create text pre-processors and post-processors to ensure each KantanMT engines is compliant with the quality expectations of the KantanMT community.
  • KantanFleet – For community members that wish to start translation immediately and avoid the build, test and deploy process, they can use KantanFleet. This is a large collection of pre-built and fully-tested engines in Legal, Financial, Medical, IT and General domains. At present there are over 100 KantanFleet engines. Each KantanFleet engine can easily be extended and customised engines can be built using them as a baseline. This has a significant impact in the time to build, improve and deploy customised engines.
  • KantanLibrary – KantanLibrary is use by community members that may not have sufficient training data to customise their own engines. KantanLibrary is a collection of pre-cleansed, scored and publicly available training data sets. These data sets have all been tested, cleansed and optimised for Legal, Financial, Medical IT, General, Conversational domains.
Do you have any stock engines ready to run for those clients who need something quick & dirty and cannot use the public Google or Microsoft engines? How do they compare to the generic free engines?
The KantanMT platform comes pre-configured with collections of pre-built engines and pre-cleansed training catalogues. These have previously been described in the question above:-
  • KantanFleet – Pre-built engines described above.
  • KantanLibrary – Cleaned and optimized base training data also described above.
Do you gather translator feedback to better understand their PEMT experience and do you have any plans to improve this feedback cycle and get translators more directly engaged?
Yes we do. In fact we built an environment called KantanLQR to focus on this one aspect of engine development. Put simply, to achieve the highest level of production translation quality, it’s imperative that Professional Translators are involved in the development and improvement of MT engines. More importantly, a structured error typology (similar to MQM, the one used in KantanLQR) is required to capture, organise and then to analyse the feedback from the Professional Translators. This feedback can be harnessed to fine tune vocabulary and terminology selection, improve consistency and impact overall fluency and adequacy of translation outputs. 

What is your approach to pricing? (Please be as vague or as specific as you want to be.)
KantanMT operates a Pay-as-you-go model whereby the KantanMT community simply subscribe to a monthly plan. The monthly plans include access to all the platform features, technologies and applications. Included in each monthly subscription is a generous free-word allowance which ensures that our community can keep their costs low when embracing MT within their localisation workflow. 

What are you doing to ensure your technology stays current and relevant in future? As you may have heard Facebook thinks that SMT is done, and the future is all about Neural MT, do you have any plans in this area?
KantanLabs is the advanced research group within the KantanMT organisation. It is headed up by Dr Dimitar Shterionov. KantanLab’s Chief Scientific Advisor is Professor Andy Way from the ADAPT Centre at DCU, Ireland. KantanLabs primary objective is to explore new ways and novel approaches to Statistical Machine Translation. At present we have three research projects already up and running. These are:-
  • Optimised Training Methods and Adaptive MT – this is a joint project between ADAPT Centre and KantanMT.com which is focusing on ways of accelerating the training process and exploring adaptive MT technologies so that KantanMT engines can be retrained superfast with the latest translation suggestions from Professional Translators. The first deliverable from this project (which accelerates the training time for large engines by as much as 70%) will be launched very shortly.
  • Re-Ordering Models for Challenging and Complex Languages – this is a joint project with EAMT which is exploring interesting ways of re-ordering complex languages for the purposes of improving translation quality.
  • Exploiting Neural Networks in a Commercial Environment – we have just recently announced this project in conjunction with the Marie Curie Foundation. This will be a 2 year research project on how neural networks can be exploited in statistical machine translation systems. 
 
What are the most promising areas for MT in future in your opinion?
In the immediate timeframe we are targeting adaptive MT and interesting re-ordering models for complex languages. However, we cannot ignore the potential impact that Neural Networks may have on statistical methods. An area I’m particularly interested in is using a hybrid combination of neural and phrase-based SMT approaches. Akin to using the best of both worlds, so to speak. Another area of significant importance for us is Named Entity recognition and support. This is especially key in the hospitability and eCommerce industries. 

Have you found the TDA data useful in any of your MT engine development?
We have never used the TAUS data for the purposes of building KantanMT engines, so I’m not in a position to comment on the data. However, I believe that TAUS is incredibly important for the industry as it has fostered a better understanding, higher level of engagement and seeded the industry with successful stories and implementations of MT in commercial contexts. 

Are there any LSPs (other than SDL) that you see as really understanding MT, and know how to develop high quality engines and use MT to solve big translation problems?
Yes, many of our Partners are now experts on developing, improving and deploying large scale MT systems to address very large translation challenges.
  1. For example, MATRIX in Germany has built a system to translate technical information for a market-leading, publicly traded engineering client. This system is built on the KantanMT platform and translates documentation into 12 languages. The source language for these engines is German.
  2. Another one of our partners (which is the largest privately owned LSP) translates the entire photograph catalogue of istock.com last year. This project was in 11 languages (source was English) and the project resulted in over 750M source words being translated into 11 languages.
  3. Another one of our LSP clients, Milengo, recently translated the entire beauty catalogue of the largest Nordic eCommerce platform. They achieved this feat in less than 3 weeks. They are now doing this again into one additional Nordic language.
The KantanMT partner network consists of LSPs all of which are now in a position to implement MT within their localisation workflows. They can do this using the KantanMT platform, generally after completing our MT Orientation and Training Programme. 

1 comment:

  1. As long as scientists fail to define intelligence in a natural way, we only know very little of the logic of language. Currently, in knowledge technology, rich and meaningful sentences are degraded to “a bag of keywords”.

    So, words like definite article “the”, conjunction “or”, possessive verb “has/have” and past tense verbs “was/were” and “had” are simply ignored in knowledge technology, while these non-keywords provide information to our brain about the structure of the sentence. Hence the problems with MT.


    More on the logic of language:

    For centuries, algebra supports reasoning based on present tense verb “is/are”, like:

    > Given: “Every father is a man”
    > Given: “John is a father”

    • Logical conclusion:
    < “John is a man.”

    But humans are also capable of possessive reasoning (using possessive verb “has/have”), and they are able to reason in past tense:

    > Given: “James was the father of Peter”

    • Generated conclusions:
    < “Peter has no father anymore”
    < “Peter had a father, called James”

    So, why doesn't algebra support past tense reasoning, and possessive reasoning?

    Humans are able to generate questions:

    > Given: “Every person is a man or a woman”
    > Given: “Addison is a man and a woman”

    • Generated question:
    < “Is Addison a man or a woman?”

    So, why doesn't algebra support generation of questions?

    I defy anyone to beat the simplest results of my natural language reasoner in a generic way: http://mafait.org/challenge/.

    It is open source software. So, everyone is invited to join.

    ReplyDelete