A number of milestones have been reached in the last couple of months in the ACCEPT project. First, version 2 of the ACCEPT pre-edit plug-in has been made available for download to all registered users of the ACCEPT Portal. This version brings a number of bug fixes as well as new features, including the ability to edit text in the plug-in itself and the option to learn new words and to ignore rules. Second, an instance of this pre-edit plug-in has been made available to all registered users of the English Norton forum, which allows us to collect valuable feedback. A short video tutorial was also created to introduce the new functionality. Third, some post-editing demo projects have been added to the Post-Edit section of the ACCEPT Portal. These demos help newly registered users understand better how post-editing can be performed thanks to the ACCEPT post-edit plug-in. This plug-in was recently presented at the MT Summit XIV Workshop on Post-editing Technology and Practice. Finally, the documentation of the ACCEPT Portal and all of its components has been greatly enhanced for both end user and developers.
During the past few months, ACCEPT partners Acrolinx and UNIGE have developed, finalized and tested rules for pre-editing French and English content, specialized for the two considered communities of support forum users and volunteer translators. These rules help Norton Community users to improve their writing style and to simplify constructions that are notoriously hard to translate automatically, thereby improving the readability of helpful machine-translated posts. Likewise, the rules support volunteer translators employing MT systems with pre-processing the input documents to reduce their post-editing workload. Additionally, we have created sets of rules that are most suitable for automatic application, such that the translation quality can be improved without any human intervention.
We have conducted large-scale evaluations with thousands of pre-edited instances and dozens of judges to measure the impact of the rules on the MT output quality. The results have shown that the developed rules consistently improve the output of the ACCEPT baseline translation system. Depending on the translation scenario, 40% to 60% of rule-supported pre-editing actions led to better translation results. Only in relatively few cases, the translation quality degraded even though the input improved.
In particular, Geneva University has developed French pre-editing rules for the forum with substantial impact on the Baseline, both for humans and for machine (Gerlach et al, 2013a). The rules focus mainly on four phenomena, which proved to be troublesome for SMT of French forums into English: word confusion (due to homophones), informal and familiar French, punctuation, and local structural divergences between French and English. The main criteria for their definition have been precision and impact on translation into English. Evaluation shows that there is a good correlation between human contrasting quality evaluation and improvement in post-editing times (Gerlach et al., 2013b), which shows that pre-editing rules could potentially reduce postediting effort.
Comparisons with other methods, included in WP4, suggest that pre-editing is competitive with data-driven methods. In the two studies carried out so far, pre-editing was either slightly better than these methods (Rayner et al., 2012) or about the same (Bouillon et al., 2013). Both experiments concerned problems commonly encountered when translating French into English using statistical machine translation methods. The first is concerned with familiar second-person constructions (available training data is taken from sources like the proceedings of the European Parliament, where 'tu', 'te', and associated verb forms hardly ever occur); the second involves correction of homophone errors (e.g. 'sa' for 'ça' or 'son' for 'sont'), which are frequent in informal text like forum chat. Both experiments were collaborations between the Geneva and Edinburgh groups.
For the large-scale evaluations, the ACCEPT project gathered together over 150 volunteers in both Symantec and TWB communities to run its experiments and analyses. These two ACCEPT communities will also be involved in the evaluation phase.
Anyone wanting to try out the pre-editing rule sets can do so by using the ACCEPT pre-editing API (ACCEPT Portal). The new pre-editing rules have already been rolled out in the Norton Community forum. With the help of Symantec, the usability of the rules has considerably improved: found issues are presented more clearly and explained more concisely, and users can now ignor unwanted rules.
Finally, one of the project’s key aims was to allow Forum users to auto-translate and understand the content of any text that is not in their native language. Having identified tips and tricks for monolingual post-editing, the ACCEPT project has been able to guide forum users in identifying lack of meaning or gaps in translations, and even in correcting the MT Output.
The main work of Edinburgh has been in the “Improving Statistical Machine Translation (SMT)” work package of the project, and in particular in two main topics: domain adaptation and dealing with noisy data.
The problem of domain adaptation occurs in SMT when the training data differs in some systematic way from the test data. This is in fact the normal situation, since SMT requires large quantities of training data and there is rarely sufficient data in the domain required. The challenge is to make use of this “out-of-domain” data so that it has a positive effect on translation quality.
Methods for domain adaptation include techniques for selecting appropriate data, as well as for weighting the data, and techniques for combining models built from different domains. At the NAACL conference this year, we presented a paper describing a new technique for combining (interpolating) models which allows more direct targeting of measures of translation performance. We also performed a comparison of several recently published techniques for domain adaptation in the WMT (Workshop in Machine Translation) shared task.
In collaboration with Geneva, we have been looking at methods for coping with noisy input data in SMT systems. One particular problem that we have considered is that in French, there are large numbers of misspellings due to the number of homophones. As well as developing a rule-based spelling correction system, we also used a pronunciation dictionary to generate a list of possible misspellings. This list is weighted to create a “confusion network”, and given to the translation system so it can use its knowledge of output language to choose the best translation. Both techniques were found to be effective and complementary and the work was presented at the ACL HyTra (Hybrid Translation Systems) workshop.