The use of residual analysis to improve the error rate accuracy of machine translation

The aim of the study is to compare two different approaches to machine translation—statistical and neural—using automatic MT metrics of error rate and residuals. We examined four available online MT systems (statistical Google Translate, neural Google Translate, and two European commission’s MT tools—statistical mt@ec and neural eTranslation) through their products (MT outputs). We propose using residual analysis to improve the accuracy of machine translation error rate. Residuals represent a new approach to comparing the quality of statistical and neural MT outputs. The study provides new insights into evaluating machine translation quality from English and German into Slovak through automatic error rate metrics. In the category of prediction and syntactic-semantic correlativeness, statistical MT showed a significantly higher error rate than neural MT. Conversely, in the category of lexical semantics, neural MT showed a significantly higher error rate than statistical MT. The results indicate that relying solely on the reference when determining MT quality is insufficient. However, when combined with residuals, it offers a more objective view of MT quality and facilitates the comparison of statistical MT and neural MT.

The use of residual analysis to improve the error rate accuracy of machine translation Ľubomír Benko 1* , Dasa Munkova 1 , Michal Munk 1,2 , Lucia Benkova 1 & Petr Hajek 2 The aim of the study is to compare two different approaches to machine translation-statistical and neural-using automatic MT metrics of error rate and residuals.We examined four available online MT systems (statistical Google Translate, neural Google Translate, and two European commission's MT tools-statistical mt@ec and neural eTranslation) through their products (MT outputs).We propose using residual analysis to improve the accuracy of machine translation error rate.Residuals represent a new approach to comparing the quality of statistical and neural MT outputs.The study provides new insights into evaluating machine translation quality from English and German into Slovak through automatic error rate metrics.In the category of prediction and syntactic-semantic correlativeness, statistical MT showed a significantly higher error rate than neural MT.Conversely, in the category of lexical semantics, neural MT showed a significantly higher error rate than statistical MT.The results indicate that relying solely on the reference when determining MT quality is insufficient.However, when combined with residuals, it offers a more objective view of MT quality and facilitates the comparison of statistical MT and neural MT.
Although relying on human translation offers more accuracy and fluency, human translation is of limited efficiency and it is challenging for it to meet the needs of long text translation 1 .This limitation stimulates the search for new approaches to translation.One such approach is the implementation of intelligent algorithms within machine translation (MT) system.Various algorithms address the issues of MT system such as RNN encoding-decoding in existing log-linear SMT, transfer learning method, self-attention mechanism, unsupervised training algorithm, the adversarial augmentation method, reinforcement learning, neural MT (LSTM and transformer), hybrid (neural MT + statistical MT), rule-based MT, phrase-based MT, and others 2 .Currently, machine translation employs deep neural network (NN) learning, which initially learns rules and then automatically produces translations.This approach has yielded very good results for tasks with sufficiently labelled data for learning.However, if there is little tagged data, machine translation produces poor performance 3 .The primary obstacle for market-oriented neural MT systems or applications lies in its weak translation quality that fails to meet users' needs 4 .MT evaluation is a fundamental step in improving the performance of MT systems.The continuous enhancement of the performance of current neural MT systems is closely tied to research on evaluating the quality of MT output based on sentence comparison 4 .This comparison involves two inseparable aspects-qualitative/human and quantitative/automatic evaluation.The first serves as the foundation and guiding principle for the second, while the latter represents the digital outcome of the former.
Two main approaches exist for evaluating MT systems-human/manual and automatic evaluation.Blur criteria and scales for manual translation quality, along with different human evaluator sensitivity to translation errors may result in the judge subjectivity, which can be reflected in the poor consistency and instability of the evaluation results 5 .Human evaluation is an effective way to assess translation quality, but is challenging to find reliable bilingual annotators 6 .In addition to poor consistency and subjectivity, manual evaluation is both financially and time-consuming; however, unlike automatic evaluation, it does not require a reference translation.The advantages of automatic evaluation lie in its objectivity, consistency, stability, speed, reusability, and language independence.It is cost-effective and easy to use for comparing multiply systems, but at the expense of quality 6 .Furthermore, automatic evaluation requires reference-human translation (gold standard)-since the evaluation is based on comparing MT output with reference translation.8][9] and others).Automatic metrics of MT evaluation only capture lexical similarity and correctly measure neither semantic and grammatical diversity, nor syntactic structures 6,10 .
In comparison with WER, which focuses on word operations only, TER considers shifts as part of edit operations.The higher the score of the error rate metrics, the worse the translation quality, and vice versa.The main motivation for using character-based metrics is their improved performance in evaluating morphologically rich languages like Slovak or other Slavic languages 19,20 .
CharacTER is a character-level metric inspired by the TER metric 21 .It is defined as the minimum number of character edit operations required to match a MT output with a reference, normalized by the length of the MT output: CharacTER first performs shift edits at the word-level; then, the shifted MT output sequence and the reference are split into characters, and the Levenshtein distance between them is computed.
Cross-lingual optimized metric for the evaluation of translation (COMET) is a PyTorch-based framework for training highly multilingual and adaptable MT evaluation models that can function as metrics 22 .It supports both architectures: the estimator model (trained to regress directly on a quality score) and the translation ranking model (trained to minimize the distance between "good" MT output and its corresponding reference/ original source).
The most commonly used approach to determine the ability of automatic metrics to substitute human evaluation metrics is to compare the correlations between human evaluation metrics and the scores of automatic metrics 6 .However, it is still only a score (a number from the < 0, 1 > interval) that does not indicate the level of translation error rate at the segment/sentence/text level within the corpus.Additionally, automated metrics provide varying results and varying degrees of correlation with human evaluations, which are often inconsistent themselves.The translation quality of a pair of MT systems often relies on the differences between automatic scores (BLEU index) to draw conclusions without performing any further assessment 23,24 .
This motivated us to search for other techniques that would be suitable for comparing translation quality and help us to identify segments/sentences/texts within a corpus that vary extremely (significantly) in translation quality, but with minimal human intervention.The advantage of using residuals when comparing translations WER = # of insertions + deletions + substitutions reference lenght .
is the ability to detect specific segments/sentences/texts within the corpus that deviate significantly from the golden-standard translation.Residual analysis and error analysis are closely related analyses; both measure a distance (deviation or error).Residual analysis evaluates a regression model's validity by examining the differences between observed values and predicted values by the model; in our case, the model is the MT model.
The deviation or error is the distance of the observed value from the predicted/expected value, i.e., residuals represent the distance of observed values from predicated values: where, in our research, N represents the number of examined texts in the data set, observed value is represented by the neural MT error rate and the predicted/expected value by the statistical MT error rate of a given text.
Extreme distances between the examined models (MT systems in our research) are identified based on the ±2sigma rule, similar to outliers in residual analysis: where the residual values represent the differences in the error rate of the examined MT models, neural MT system and statistical MT system in our case.
Residuals allow us to identify patterns, better understand and interpret model errors, and subsequently eliminate, correct, or analyze them, as well as their influence on MT quality 25 .The aim of the study is to compare two different approaches to machine translation-statistical and neural-using automatic MT metrics of error rate and residuals.We examined four available online MT systems (statistical Google Translate, neural Google Translate, and two European commission's MT tools-statistical mt@ec and neural eTranslation) through their products (MT outputs).
The statistical MT (SMT) systems are represented by Google Translate (GT_SMT) 26 and mt@ec (the European Commission's MT tool) 27 and their transformations into neural MT (NMT) systems, which are represented by Google Translate (GT_NMT) 28 and eTranslation (the European Commission MT tool) 29 .The shift from mt@ ec to eTranslation improved the translation quality, speed, and security of the interface.Google team made the same transformation in September 2016; it switched to Google neural machine translation, focusing on an end-to-end learning framework that learns from millions of examples and provides significant improvements in translation quality 30 .
The main objective consists of three partial objectives: • The first objective lies in the comparison of statistical MT systems and neural MT systems based on the automatic MT metrics of error rate (WER, PER, and TER).• The second objective aims to identify or detect machine-translated segments/sentences/texts that deviate significantly from human translations based on the score of error rate metrics and residuals.This includes identifying texts in which statistical MT was closer to human translation than neural MT or vice versa.• The third objective involves verifying the validity of the obtained results through metrics such as BLEU and COMET, as well as the characTER metric, which correlates better with human evaluation in the case of morphologically richer languages 19,20 .
The translation directions were from English and German into an inflectional and a low-resourced Slovak.Moreover, Slovak belongs to one of the official languages of the European Union.
The structure of the paper is as follows.The second section contains related work in the field of automated MT evaluation and a comparison of various MT systems.The third section describes the used data set and the applied research methodology.The subsequent section focuses on the research results based on the evaluation of error rate metrics and residuals.The fifth section offers a discussion of the results.The last section comprises research conclusions.

Related work
Statistical MT and neural MT are the most extensively used architectures within the MT systems 31 .
Pinnis et al. 32 compared the NMT and phrase-based SMT systems for highly inflected and low-resourced languages.They compared large and small bilingual corpora, focusing on six language pairs: [Latvian, Estonian]-English, Estonian-Russian, and vice versa.MT evaluation was conducted using the automatic evaluation metrics (BLEU, NIST, and ChrF2) and manual error analysis.The error analysis was focused on identification morphological, syntactical, and lexical errors.The results showed that the NMT system produced twice as many errors in lexical choice (wrong or incorrect lexical choice) as the phrase-based SMT system.On the other hand, the NMT system demonstrated much better grammatical accuracy (forms and structure of words, and word order) than the SMT system.
Yang et al. 33 examined translation quality from ancient Chinese to modern Chinese.They proposed a novel automatic evaluation model-dual-based translation evaluation, without multiple references.To compare the results, BLEU and the Levenshtein distance were used as baselines.They proved that dual-based translation evaluation achieved better agreement, and/or concordance with human evaluation (human judgements).
Fomicheva and Specia 34 conducted a broad meta-evaluation study of automatic evaluation metrics.They evaluated more than 20 automatic evaluation metrics on multiple data sets (WMT16 data set, MTSummit17 English-Latvian data set, Multiple-Translation Chinese data set, WMT17 Quality Estimation German-English www.nature.com/scientificreports/data set, GALE Arabic-English data set, and EAMT11 French-English data set).Data sets also contained manual assessments based on different quality criteria (adequacy, fluency, or PE effort) collected using several different methods.The meta-evaluation was conducted based on three aspects: MT quality, MT system types, and manual evaluation type.They showed that the accuracy of automatic MT evaluation varies depending on the overall MT quality.They showed that the automatic metrics perform poorly when faced with low-quality translations, but additionally that evaluating low-quality translations is also more challenging for humans.They also showed that metrics are more reliable when evaluating neural MT than statistical MT systems.Metric performance can be affected by various factors, such as text-domain, language-pair, or type of MT system.Moghe et al. 35 evaluated nine metrics consisting of string overlap metrics, embedding-based metrics, and metrics trained using scores from human MT evaluation on three extrinsic tasks (dialogue state tracking, question answering, and semantic parsing) covering 39 unique language pairs.They showed that interpreting the quality of the produced MT translation based on a number is unreliable and difficult.They also showed that scores provided by neural metrics (e.g.COMET) are not interpretable, in large part due to having undefined ranges, and also that it is unclear if automatic metrics can reliably distinguish good translations from bad at the sentence level.
Alvarez-Vidal and Olivier 23 found that automatic metrics such as BLEU were intended to be used as a development tool and cannot be blindly used to assess MT systems without taking into account the final use of the translated text.They recommend a two-step MT evaluation which can ensure the quality of the MT output.They compared two different NMT engines-the commercial online available DeepL NMT system and a system trained on news-domain by the authors for the English-Spanish language pair.They showed that automatic metrics used (BLEU, NIST, WER, TER, EdDist, and COMET) yield better results for the NMT system trained by the authors, except for COMET.
Almahasees 36 compared the MT outputs of Google Translate and Microsoft Bing Translator (both based on SMT).The used data contained political news in English and were translated into Arabic.The data were evaluated using the automatic evaluation metrics, and the results showed better results for MT outputs produced by Google Translate.Later, Almahasees 37 conducted similar research with journalistic texts for the language pair English-Arabic, but with MT systems operating on neural networks.He compared the MT outputs based on automatic MT evaluation metrics of error rate.The results showed similar results for both MT systems in orthography and grammatical accuracy.The difference was found in the case of lexis, where the neural MT (Google Translate) achieved better results than the statistical MT (Bing).
Marzouk and Hansen-Schirra 38 focused on controlled languages (CLs) to improve the quality of NMT output.They compare the impact of applying nine CL rules on the quality of MT output produced by five MT systems (Google, Bing, Lucy, SDL, and Systran, i.e., neural, rule-based, statistical, and two hybrid MT systems) by applying three methods: error annotation, human evaluation, and automatic evaluation (TERbase and hLEPOR).The data set consisted of 216 source sentences of technical-domain translated from German into English.They showed that the NMT does not require CL rules; i.e., before and after applying the CL rules, NMT system showed the lowest number of errors.
Li and Wang 39 focused on the optimization of automatic MT evaluation.They applied representative 'listwise learning to rank' approach, ListMLE.The selection of features was motivated by the BLEU-n metrics, phrase-based SMT, and NMT.They used the data sets released for WMT'2014 and WMT'2015 metrics tasks.To evaluate the results of the experiment, they compared the list-wise approach with the most used metrics, such as BLEU-n, METEOR, TER, etc.The results showed that the novel approach achieved better results than the above-mentioned metrics.
Singh and Singh 40 focused on MT quality for low-resource languages.They aimed at an NMT system that should improve the translation quality for the English-Manipuri language pair.They compared multiple approaches such as SMT, RNN, and transformer architecture.The results showed a higher quality translation in terms of statistically significant automatic scores and manual evaluation compared to the statistical and neural supervised baselines, as well as the pretrained mBART and existing semi-supervised models.
Shterionov et al. 41 compared phrase-based SMT and NMT systems based on lexical similarities.They applied automatic evaluation metrics (BLEU, TER, and F-measure) to assess the performance of MT systems.Based on the same data set, they built five NMT and phrase-based SMT engines for various language pairs.They showed that the quality evaluation scores indicated better performance for the PBSMT engines, contrary to human evaluation.They suggested that automatic evaluation metrics (BLEU, TER, and F-measure) are not always convenient for evaluation and do not correspond with NMT quality.
Tryhubyshyn et al. 42 examined the relationship between MT system quality and QE system performance.They showed that QE systems trained on lower-quality MT translations (a mix of translations from different MT models) tended to perform better than those trained on higher-quality MT outputs (translations from one MT system).
As mentioned in the introduction, automated metrics (such as BLEU) yield varying results depending on the reference translation, text domain and languages.They draw conclusions without performing further evaluation or analysis, such as error analysis.Moreover, when the results of automatic evaluation were compared with those of manual evaluation, their correlation reached different degrees of agreement.Additionally, evaluators in the manual evaluations were often inconsistent regarding the error rate of the machine translation.These findings are also supported by the studies focused on Slavic languages or low-resource languages.
The aforementioned studies (as well as ours in the first objective) found that, on average, NMT is better than SMT.However, our proposed approach through residual analysis (regardless of which automatic metric is used) identifies segments that, on the contrary, show higher SMT quality.We have shown that our approach is suitable not only for automatic metrics of accuracy, but also for automatic metrics of error rate, which distinguishes us from all previous studies focused on Slovak so far.Moreover, it turns out that it is more appropriate

Materials and methods
This study focuses on comparing NMT systems (represented by Google Translate and eTranslation) and SMT systems (represented by Google Translate and mt@ec, the European Commission's MT tool, which has later transformed into neural MT-eTranslation).
The statistical machine translated articles were obtained in 2016 from both, Google Translate (GT_SMT) and the European Commission's DGT tool (mt@ec).Later, in 2021, the same articles were machine-translated by the NMT engines Google Translate (GT_NMT) and the European Commission's DGT tool (eTranslation).The translation directions were from English and German into Slovak, where Slovak is a synthetic language containing inflected morphology and with loose word order 43 .Human translation and post-editing of machine translation were conducted in interactive online system OSTEPERE 25,[44][45][46][47] .
The examined articles were tokenized and aligned using the Hunalign tool 48 in the following order: source sentence with one human translation (HT), four machine translations (MTs), and one post-edited machine translation (PEMT).
The evaluation of the two different MT systems was conducted through automatic metrics of error rate (WER, PER, and TER).We aimed to identify the errors produced by the examined MT systems and determine whether changing the architecture of the MT systems resulted in decreasing to produce the same errors or, on the contrary, whether they start to create new ones.To verify the validity of the obtained results of the error rate, we used the metrics of accuracy-BLEU and COMET, and also character-based metric of error rate-characTER.

Dataset
The data set comprises articles published by the British online newspaper The Guardian and the German online newspaper Der Spiegel, along with their machine and human translations.The corpus consists of eight data sets, and/or two English-Slovak and German-Slovak corpora: (1) articles written in English and German as source texts, (2) articles machine-translated from English and German into Slovak by four different MT engines (by SMT in 2016 and by NMT 2021), (3) human-translated articles from English and German into Slovak by professional translators (both in 2016), and (4) post-edited machine-translated articles by professional translators (in 2016).
The lexico-grammatical structure of the dataset 49 was obtained using Stanza 50 , an automatic morphological annotator tool (Table 1).
Due to the fact that the created corpora are composed of articles with the features of newspaper writing (own register), the examined corpora mainly consist of nouns, followed by verbs and adjectives.Regarding the readability of the examined translations (from EN to SK and also from DE to SK), there are unequal proportions of short (n < 10) and long (n > = 10) sentences among MTs.The reduction in words within the sentence occurs frequently in statistical MT (mt@ec), which indicates word omission and a shift in meaning, and/or a certain loss of meaning (e.g., short sentences (n < 10) for EN: GT_SMT = 18.13%;GT_NMT = 18.75%; mt@ ec_SMT = 21.88%;eTranslation_NMT = 15.63%; and for DE: GT_SMT = 36.14%;GT_NMT = 36.49%;mt@ ec_SMT = 41.86%;eTranslation_NMT = 37.39%).

Applied Methodology
The applied methodology, inspired by other studies [51][52][53] , consists of these stages (Fig. 1): (1) Acquisition of unstructured textual data source text (journalistic texts).We focused on journalistic texts (newspaper writing) as they belong to the most read and translated texts by people.We chose the two most popular journals, from which we obtained all freely available texts from various fields (politics, sports, show business, and technology) published in the given year 2016.(2) Data preparation consisting of following tasks: www.nature.com/scientificreports/ (3) Automatic MT evaluation using automatic metrics of error rate at the segment level.We applied automatic MT metrics based on the Levenshtein distance, which computes the minimum edit distance to transform a MT output into a reference through edit operations (insertions, substitutions, deletions, and shift of words necessary to transform one string into another).WER(h, r) = min#(I+D+S)

|r|
, where r is a reference of a hypothesis/MT output h, I-insertion, D-deletion, and S-substitution.
The minimum number of edit operations is divided by the number of words in the reference 54 .

|r|
, where r is a reference of a hypothesis/MT output h, n is the number of similar words 18 ., where r is a reference of a hypothesis/MT output h, I-insertion, D-deletion, S -substitution, and shift (a number of changes in word order).Compared to WER, TER considers shifts as a part of edit operations.TER deals with more edit operations, allowing it to capture various differences in word order.
The higher the score of error rate metrics, the worse the translation quality, and vice versa.(4) Comparison of MT quality based on (1) MT system used (Google Translate or European Commission's DGT system) and ( 2) artificial intelligence approach to MT (statistical approach to MT or neural approach to MT).
(i) We test the differences in the score of automatic MT metrics between two MT systems (Google Translate (GT) and the European Commission's MT tool (EC)), separately for WER, PER, and TER.(ii) We test the differences in the score of automatic MT metrics between artificial intelligence approached to MT (statistical vs neural), separately for WER, PER, and TER.
(5) I d e n t i f i c a t i o n o f e x t r e m e d i f f e r e n c e s b e t w e e n s t a t i s t i c a l a n d n e ur a l M T. To i d e nt i f y e x t re me v a lu e s , we apply t he re s i du a l an a ly s i s , i. e. , We used one of the main models of COMET: wmt22-comet-da.This model uses a reference-based regression approach and has been trained on direct assessments from WMT17 to WMT20.It provides scores ranging from 0 to 1, where 1 represents a perfect translation.CharacTER(h, r) , where h is a hypothesis/MT output, I-insertion, D-deletion, S -substitution, and shift.BLEU-n 7 is a geometric mean of n-gram precisions with a brevity penalty (BP), i.e. penalty to prevent very short sentences: where w n is weights for different p n , where r is a reference of a hypothesis h.

Automatic MT evaluation based on metrics of error rate
For all automatic metrics (WER, PER, and TER), the Mauchley sphericity test is significant (p < 0.05), i.e., the assumption is violated.We adjusted the degrees of freedom using the Greenhouse-Geisser adjustment.Based on the results of adjusted univariate tests for repeated measure (Greenhouse-Geisser adjustment) among GT_SMT, GT_NMT, mt@ec_SMT, and eTranslation_NMT, there are significant differences in MT quality concerning the Based on multiple comparisons (Table 2), there are significant differences in the score of metric WER between NMT (GT) and the others, as well as between NMT (eTranslation_NMT) and the others, but there is no difference between SMT (GT) and SMT (EC).Were identified three homogeneous groups (****p > 0.05) in terms of the agreement/concordance of the examined texts.NMT produced by GT achieved the lowest error rate (0.679) compared to other MTs.On the other hand, SMT produced by mt@ec achieved the highest error rate (0.800), but is very close to SMT produced by GT (0.778).
In terms of lexical similarity, regardless of word order (PER metric), there is a difference between SMT, produced by GT tool or EC tool and neural MT, but there is no difference between neural MT produced by GT tool and EC tool (Table 3).Based on multiple comparisons (Table 3), were identified three homogeneous groups (****p > 0.05) in terms of the agreement/text similarity of the examined texts.Moreover, three out of four MTs achieved lower PER scores of error rate (PER ≤ 0.642) than all MTs evaluated by metric WER (WER ≥ 0.679).
The TER values copy the WER values (Tables 2, 4).Based on multiple comparisons (Table 4), were identified three homogeneous groups (****p > 0.05) in terms of the agreement/text similarity of the examined texts.There are significant differences in the score of the metric TER between GT_NMT (neural GT) and the others, as well as between EC_NMT (eTranslation_NMT) and the others (Table 4), but there is no difference between GT_SMT (statistical GT) and EC_SMT (mt@ec_SMT).Neural MT produced by GT achieved the lowest error rate (0.674) compared to other MTs.On the other hand, statistical MT produced by mt@ec achieved the highest error rate (0.796), but is very close to statistical MT produced by GT (0.774).
We applied the same analysis to machine-translated texts from German into Slovak.Due to the violation of the assumption of sphericity of the covariance matrix, we used modified tests for repeated measurements (Greenhouse-Geisser adjustment) to test the differences in MT quality among GT_SMT, GT_NMT, mt@ec_SMT, and eTranslation_NMT represented by the metrics of error rate (PER: W = 0.868, Chi-sqr.= 78.816,df = 5, p < 0.001; WER: W = 0.873, Chi-sqr.= 75.826,df = 5, p < 0.001; TER: W = 0.889, Chi-sqr.= 65.643,df = 5, p < 0.001).The highest rate of violation of the assumption was identified in the case of the metric WER (G-G Epsilon = 0.912), followed by PER (G-G Epsilon = 0.919), on the contrary, the lowest for the metric TER (G-G Epsilon = 0.923).Overall, the rate of violation of the assumption of sphericity of the covariance matrix was low for all applied metrics, we used adjusted significance tests (WER, PER, and TER: G-G Epsilon < 0.923, G-G Adj.p < 0.001) and subsequently, we compared them with unadjusted univariate tests for repeated measure (F > 211.214, p < 0.001).
Based on the results, we reject the global H0 at the 0.001 significance level in the case of all metrics, which claims that there is no statistically significant difference in the quality of MT when translating from German to Slovak, represented by the error rate metrics PER, WER, and TER, among GT_SMT, GT_NMT, mt@ec_SMT, and eTranslation_NMT.NMTs were of statistically significantly better quality than SMTs regardless of which MT tool was used (Tables 5, 6).NMT produced by GT tool (Tables 5, 6) achieved statistically significant the lowest error rate (PER = 0.495, WER = 0.609, TER = 0.607).On the other hand, SMT produced by mt@ec, a EC tool (Tables 5, 6) achieved statistically significant the highest error rate (PER = 0.720, WER = 0.821, TER = 0.820).
We conclude that the assumption regarding better NMT quality compared to SMT has been confirmed, regardless of the language pair.We showed statistically significant differences between SMT and NMT in favor Table 5. Bonferroni (adjustment) post-hoc test for multiple comparisons of (a) the PER and (b) WER metrics between different MT systems (GT tools or EC tools) and approaches (statistical or neural) in the German-Slovak language pair.****Homogenous groups p > 0.05.These findings indicate that the error rate in the examined texts is probably related to recall (lexical accuracy).Considering the reference, the error rate of the examined MTs is more associated with lexical accuracy, i.e., vocabulary and word omission, than grammatical accuracy, i.e., forms and structure of words and word order.This motivated us to apply residual analysis to identify and specify in more detail MT errors that occurred in individual machine translations.

Identification of extreme differences based on the score of error rate metrics between SMT and NMT-English-Slovak machine translations
We used residuals to identify texts with extreme values of error rate metrics (WER, PER, and TER) between SMT and NMT for each MT tool separately.We applied the rule ± 2sigma, i.e., values outside the interval are considered extremes.The mean of NMT-SMT differences for all metric values (WER/PER/TER) is negative (Figs. 2, 3, 4, 5, 6, 7), which confirms our finding (previous subsection) that in terms of error rate, NMT achieved a statistically significantly lower error rate, i.e., better translation quality.The neural MT outputs were more similar to the references than the statistical MT outputs.In the case of the European Commission's MT tool (Fig. 2), we identified 8 texts (ID_142, ID_156, ID_180, ID_205, ID_258, ID_267, ID_279, and ID_280), that showed a statistically significantly better WER score of NMT against SMT (residuals ≈ − 0.5).Only 2 texts (ID_259 and ID_298) achieved a significantly better WER score of SMT against NMT (residuals ≈ 0.33), but both texts consist of short sentences (less than 7 words, including articles), which could have had an impact on the results.
In the case of Google translate (Fig. 3), we identified 5 texts (ID_142, ID_156, ID_180, ID_205, ID_258, ID_267, ID_279, and ID_280), that showed a statistically significantly better WER score of NMT against SMT (residuals ≈ -0.5) and 4 texts (ID_155, ID_156, ID_192, and ID_221) with a significantly better WER score of  SMT against NMT (residuals ≈ 0.35).These texts were more similar to the reference than NMT (NMT was correct, but used synonyms, which could have had an impact on the results).
In the case of the European Commission's MT tool (Fig. 4), we identified 5 texts (ID_156, ID_163, ID_180, ID_267, and ID_280), that showed a statistically significantly better PER score of NMT against SMT (residuals ≈ -0.55).Only 4 texts (ID_183, ID_223, ID_224, and ID_298) achieved a significantly better PER score of SMT against NMT (residuals ≈ 0.3).Again, they were texts with short sentences, and NMT added extra words compared to the reference, which could have had an impact on the results.
Based on our results, we can infer that the issue in MT systems lies in lexical semantics rather than in word order in the case of neural machine translation.

Discussion
The applied automatic metrics are based on a comparison with a reference, which, in our case, was created independently (pure human translation, not affected by MT output).This could cause a distortion of MT quality, but it did not affect the comparison of SMT and NMT because we used the same reference in both cases.
Based on corpus statistics (Table 1), we assumed that NMT outperforms SMT with respect to the lexicogrammatical features of the examined texts (frequency of nouns, adjectives, and verbs).
Based on analysis results, we can conclude that NMT demonstrated higher quality than SMT in terms of error rate.All automatic metrics achieved lower scores for neural MT compared to statistical MT, i.e., NMT outperformed SMT.The most serious issues of SMT include a shift in part-of-speech, omission or addition of words, and inflection.The word order was not such a serious issue for neural MT, which we explain by the fact that it was a translation into Slovak, which has a loose word order, unlike English, which has a strict word order (SWO).
Regarding the accuracy of the strings (represented by metrics WER and TER), SMT produced approximately the same error rate, whether it was SMT produced by GT or produced by mt@ec, EC tool (Tables 2, 5b), which is noteworthy.Both MT tools performed at a very similar level.On the contrary, due to the similarity of the strings (represented by metric PER), NMT produced approximately the same error rate, whether it was NMT produced by GT or produced by mt@ec, EC tool (Tables 3, 5a).We explain this fact by the character of the examined texts.They were of the journalistic style (newspaper writing) with no specific vocabulary or complex syntax, so MT tools did not require training on a specific text-domain.SMT showed similar error rates whether it was trained on a general text-domain (GT) or a specialized text-domain, such as administrative texts (EC).In the case of the metric PER, which only focuses on word similarity (independent of word position) and does not take into  3, 5b).
To validate the obtained results, we employed automatic metrics such as BLEU, COMET, and CharacTER to ensure the reliability of error rate metrics for both language directions (BLEU: G-G Epsilon < 0.940, G-G Adj.p < 0.001; COMET: G-G Epsilon = 0.812, G-G Adj.p < 0.001, and characTER: G-G Epsilon = 0.932, G-G Adj.p < 0.001).The results (Table 7) nearly fully correspond with the results for the metrics PER, WER, and TER for English-Slovak machine translation (Tables 2, 3, 4) as well as German-Slovak machine translation (Tables 5,  6).NMTs were of statistically significantly better quality than SMTs regardless of which MT tool and language direction were used (Table 7).NMT produced by GT (Table 7) achieved statistically significantly the lowest error rate (CharacTER = 0.481) and statistically significantly the highest accuracy (COMET = 0.887, BLEU_1 = 0.514, BLEU_2 = 0.227, BLEU_3 = 0.164, BLEU_4 = 0.097).On the other hand, SMT produced by mt@ec, EC tool (Table 7) achieved statistically significantly the highest error rate (CharacTER = 0.688) and statistically significantly the lowest accuracy (COMET = 0.662, BLEU_1 = 0.303, BLEU_2 = 0.115, BLEU_3 = 0.044).According to the metric BLEU_4 (Table 7f), both SMT systems (mt@ec and GT) form one homogeneous group, i.e., they achieved the same lowest quality (p > 0.05).
To analyze the relationships between the automatic metrics of error rate (PER, WER, and TER) and the metrics we chose as a baseline-valid criteria (BLEU_1-4, CharacTER, and COMET), we employed non-parametric correlations.Due to deviations from the normality of the automatic metrics (PER, WER, TER, BLEU_1-4, Char-acTER, and COMET), we applied non-parametric Spearman rank order correlations to both language directions (W < 0.993, p < 0.001), but separately for statistical MT (Table 8) and for neural MT (Table 9).
In the case of SMT (Table 8), similar results were achieved for both MT systems (GT and EC).The examined metrics of error rate (PER, WER, and TER) positively correlate with the CharacTER metric (Table 8), indicating a moderate (> 0.3) to high (> 0.5) degree of statistically significant direct proportional dependency (p < 0.001).On the contrary, in the case of the metrics of accuracy (BLEU_1-4 and COMET), a negative correlation was identified (Table 8), revealing a moderate (< − 0.3) degree of dependency between the automatic metrics (PER, WER, and TER) and the metric COMET/BLEU_4.A high (< − 0.5) to very high (< − 0.7) degree of statistically significant inverse-related dependency was observed between them and the metrics BLEU_1-3 (p < 0.001).
Similar results were achieved in the case of NMT (Table 9).The automatic error rate metrics (PER, WER, and TER) positively correlate with the characTER error rate metric (Table 9), showing a high (> 0.5) to very high (> 0.7) degree of statistically significant direct proportional dependency (p < 0.001).On the contrary, in the case of the metrics of accuracy (BLEU_1-4 and COMET), a negative correlation was identified (Table 9).Between the automatic metrics (PER, WER, and TER) and the metric COMET/ BLEU_4 a moderate (< − 0.3) to a high (< − 0.5) degree of dependency was observed, and between the metrics BLEU_1-3 and automatic metrics (PER, WER, and TER), a high (< − 0.5) to very high (< − 0.7) degree of statistically significant inverse-related dependency was found (p < 0.001).
In the case of NMT (Table 9), higher dependencies were identified compared to SMT (Table 8), but in both cases, they reached at least a medium level of statistically significant dependency.
These results motivated us to conduct a manual error analysis for both SMT and NMT.We restricted the analysis to only 5 MT texts produced by GT tools (SMT_GT vs NMT_GT) due to its labour-and time-intensive nature.We divided the occurred errors into the following 4 categories that cover the text complexity of inflectional languages 55 : (1) predication, (2) syntactic-semantic correlativeness, (3) compound/complex sentences, and (4) lexical semantics.
SMT produced 184 errors in the category of predication, 279 errors in syntactic-semantic correlativeness, 76 errors in compound/complex sentences, and 370 errors in the category of lexical semantics.The results obtained for NMT were significantly different.In the sphere of predication 27 errors were identified, in syntactic-semantic correlativeness 106 errors, in compound/complex sentences 12 errors, and in the sphere of lexical semantics 442 errors were identified.
Our results correspond with the findings of similar studies 56,57 which showed that SMT is more accurate in meaning (lexical accuracy), but less fluent in grammar (grammatical accuracy), and vice versa, NMT is grammatically more fluent, but less accurate in meaning (lexical semantics).
Using residual analysis, we can reveal which errors persist and, conversely, which have been eliminated or have arisen.
In the case of the European Commission's DGT tools, when we compared SMT and NMT based on the WER metric, which takes into account not only lexical accuracy, but also grammatical correctness and word order, we found that errors most often occurred within the lexical semantics, either in (1) part of speech transformation, e.g. a noun becomes an adjective after translation with a shift in meaning, or in (2) a shift of gender, most often from masculine to feminine, or in Another (4) frequent issue was word omission and word order, e.g.SS: Among their number were Belgian students, French schoolchildren and British lawyers.

( a )
Text pre-processing removing text formatting, which can influence the MT quality (images or tables can divide the text inappropriately and produce bad translation).(b) Human translation the translation process was realized in the tailored system OSTEPERE, which offers user-friendly interface for human translators and post-editors.The system saved the human translations and post-edited machine translations into a database for further processing.(c) Machine translation automatic translation of the source text by MT engines (Google Translate [SMT | NMT], mt@ec [SMT], and eTranslation [NMT].(d) Sentence alignment the generated MT outputs and human translations are aligned with the source texts using Hunalign tool (based on the 1-to-1 principle).https://doi.org/10.1038/s41598-024-59524-3

Table 6 .
Bonferroni (adjustment) post-hoc test for multiple comparisons of TER metrics between different MT systems (GT tools or EC tools) and approaches (statistical or neural) in the German-Slovak language pair.****Homogenous groups p > 0.05.https://doi.org/10.1038/s41598-024-59524-3www.nature.com/scientificreports/ of NMT based on all metrics of error rate (WER, PER, and TER), regardless of MT tool used (Google Translate tool or the European Commission's MT tool).

Figure 2 .
Figure 2. Visualization of NMT-SMT residuals for WER metric and the European Commission's MT tool.

Figure 3 .
Figure 3. Visualization of NMT-SMT residuals for WER metric and Google translate.

Figure 4 .
Figure 4. Visualization of NMT-SMT residuals for PER metric and the European Commission's MT tool.

Figure 5 .
Figure 5. Visualization of NMT-SMT residuals for PER metric and Google translate.

Figure 6 .
Figure 6.Visualization of NMT-SMT residuals for TER metric and the European Commission's MT tool.

Figure 7 .
Figure 7. Visualization of NMT-SMT residuals for TER metric and Google translate.

Table 1 .
Dataset composition of (a) English MT outputs/HT and (b) German MT outputs/HT.

Table 2 .
, Bonferroni (adjustment) post-hoc test for multiple comparisons of the metric WER between different MT systems (GT tools or EC tools) and approaches (statistical or neural) in the English-Slovak language pair.****Homogenous groups p > 0.05.

Table 3 .
Bonferroni (adjustment) post-hoc test for multiple comparisons of the metric PER between different MT systems (GT tools or EC tools) and approaches (statistical or neural) in the English-Slovak language pair.

Table 4 .
Bonferroniscores of metrics of error rate (WER, PER, and TER: G-G Epsilon < 0.944, G-G Adj.p < 0.001).NMTs were of statistically significantly better quality than SMTs regardless of which MT tool (GT or the European Commission's MT tool) was used.NMTs were lexically more similar to the references than SMTs.