Mirror, mirror, which is the fairest metric of them all?
- Marta Nieto Cayuela
- 2 days ago
- 9 min read
A discovery journey towards LLM-generated translation quality.

I am often told that apples can only be compared to apples. But what if my goal is to pick the sweetest fruit among those available, or the one that offers the best pairing to my recipe? Shouldn’t I taste them all to make an informed decision and choose the best one for me? This is the exact analogy I use when questioned about evaluating and comparing neural machine translation (NMT) and large language models (LLM) translation quality. After all, fruit salads are not made of just one fruit; just as localization strategies are never one-size-fits-all.
Whether you arrived here seeking answers from “To Quality or Not to Quality: That's Not the Question” or you are new to the discussion, welcome. If you are looking for the best metrics to evaluate LLM translation quality, you are in the right place. But be warned: I will not hand you a single “best” metric. Instead, I will act as a mirror, reflecting the full range of possibilities so you can determine which one may suit your needs and requirements.
NMT vs LLM
The apple of discord. When it comes to metrics and evaluation, we must consider two key perspectives. First, why it can be valuable to include NMT output when assessing LLM-generated translations. Second, why the industry has relied on adapting NMT metrics for LLM evaluation.
Why does NMT still matter in LLM translation evaluation?
If your aim is to evaluate LLM-generated translations solely to measure LLM performance, feel free to skip ahead. But if you are looking to mix and match, optimize workflows, and make informed localization decisions, this section is for you.
Earlier in this article, I emphasized the importance of assessing all available localization technologies before making a call. Choosing the right approach requires a deep understanding of not just model strengths and weaknesses, but also content types, processes, and resources. Data-driven decisions demand comparison. If you want to build a strong case for investing in LLM-generated translation, you must present leadership with the full picture, supported by the right metrics, and that includes side-by-side assessments of human workflows, NMT solutions and LLM models.
Measuring the unmeasured
In just two years, organizations have had to quickly pivot to integrating LLMs into their translation workflows. With no exclusive evaluation metrics available, the industry initially relied on NMT methodologies. After all, the goal and purpose remained the same: assessing language accuracy and correctness. However, the definition of quality we are trying to cover is —as you know— broader than that: it is the user experience, the product journey, the negotiated expectations.
As LLM adoption grows, and while we patiently wait and experiment, existing MT automatic scores offer a starting point and a benchmark for comparing NMT and LLM performance, and quality models allow us to incorporate additional attributes to capture the most frequent issues originated by LLMs. While it is not a perfect solution, it is a practical step.
Scores and metrics for LLM-generated translation
We agree that a more nuanced approach is needed to measure what was previously unmeasured. In the meantime, since you have been searching for answers, let me present the AI scores that can help you evaluate the quality of LLM translations: your long-awaited answer to the prayers of measurement and benchmarking.
A moment for technical clarity
The table below includes metrics that predict a scalar quality score (e.g., BLEU, BLEURT, COMET) alongside newly proposed metrics that provide structured or natural language explanations (e.g., GEMBA). While these metrics are generally trained separately, new hybrid methods, such as COMETKIWI, xCOMET and MetricX, aim to combine the strengths of different evaluation approaches.
A key distinction between these metrics is whether they require a reference translation (a preexisting human translation of the source text) against which the AI-generated translation is compared. Reference-based metrics, such as BLEU and COMET, rely on these human translations to evaluate quality, while reference-free metrics (also called quality estimation or QE) attempt to assess translation quality without a reference, making them useful in live scenarios or those where human translations are unavailable.
Do we consider LLM-generated translations reference-free? LLM translations are not inherently reference-free, but they can be evaluated using both reference-based and reference-free methods. When LLMs are used for translation, they can be trained and evaluated differently, sometimes without strict reliance on reference translations.
What about RAG? Is that a reference for LLMs? When using Retrieval-Augmented Generation (RAG) to generate LLM translations, the system pulls in external context from a database to improve translation quality. While this additional context informs the translation process, it does not qualify as a traditional reference for evaluation and should not be considered part of the reference in the context of metrics.
AI-Generated Translation Quality Metrics Explained
Metric | Description | Key Strengths | Limitations |
Measures the n-gram (sequence of words) overlap between machine-generated and reference translations. | Simple, fast, widely used. Common for benchmarking translation models and retraining MT models. | Not well-suited for semantics, lacks sensitivity to synonyms and fluency. Primarily optimized for word-based evaluation. | |
Uses fine-tuned BERT to assess translation quality based on pretrained evaluations. | Fine-tuned for semantic accuracy. Suitable for evaluating nuanced and complex texts. | Requires significant pretraining. Its black-box nature makes interpretation hard. | |
Uses BERT embeddings to measure word-level similarity in context. | Better at handling synonyms and paraphrasing. Good for assessing fluency and context retention. Captures contextual meaning better than n-gram-based metrics. | Sensitive to the specific BERT model used; may not capture domain-specific quality. Requires significant computational power and large-scale pretraining. | |
chrF3 | Calculates the similarity between an AI translation output and a reference using character n-grams, not word n-grams. | Simple and widely used. Better than BLUE for character-based languages. | Reference-dependent. |
Uses embeddings to compare semantic similarity between translation and reference. | Captures nuances missed by n-gram-based metrics. Evaluates deep contextual and semantic meaning. | Requires significant computational power; depends on training data. | |
A quality estimation (QE) variant of COMET that does not require reference translations. | Does not require a reference, making it useful for real-time evaluation. | Dependent on training data; may not generalize well across all domains. Lacks widespread adoption. | |
LLM-based metric that utilizes bilingual context to enhance evaluation. | More aligned with human evaluations, particularly for contextual quality issues. | Requires substantial manual annotation; implementation is complex. Not much adoption information. | |
Designed for evaluating generative models, like GPT-based translations. | Evaluates fluency and coherence in generative outputs. More aligned with generative translation needs. | Lacks widespread adoption. | |
Measures the quality based on a combination of unigram precision, recall, and alignment. | Considers both fluency and adequacy. Captures synonymy and word variations. | Computationally expensive compared to BLEU. Reference-dependent. | |
Hybrid reference-based and reference-free metric. | Includes synthetic examples in the training data to make it more robust to common translation failure modes such as fluent but unrelated translations or under-translation | Lack of standardization; may not generalize well. | |
QE (Quality Estimation) | Predicts translation quality without requiring a reference translation. | Useful for real-time applications. Can detect adequacy and fluency issues without a reference. | Performance depends on training data; may require domain-specific adaptation. Mostly used for MT. |
TER (Translation Error Rate) | Measures the number of edits needed to make a translation match the reference. | Practical for assessing post-editing costs and effort. | Penalizes stylistic variations; does not reflect semantic quality. |
A learned metric that simultaneously performs sentence-level evaluation and error span detection. | Capability for sentence-level and system-level evaluations. Strong at identifying localized critical errors and hallucinations. | Lacks widespread adoption. |
Note: This list is not exhaustive but represents the most widely used and discussed metrics for evaluating AI-generated translations, both MT and LLM, as of the date of publication of this article.
Human evaluation, the apple of my eye
While mathematics is the language of the universe and can quantify almost anything, the subjectivity of language (and the fact that translation quality encompasses experience) requires a qualitative dimension that automated scores cannot capture in mere zeros and ones. This is why human evaluation remains a cornerstone of translation quality assessment.
In addition to equipping you with the best automated metrics, we now answer your second prayer: human evaluation metrics for LLM-generated translations. These will help you assess and compare AI performance on specific datasets.
Error annotation and labelling: Farewell to legacy frameworks such as the LISA QA model and the SAE J2450. Multidimensional Quality Metrics (MQM) offers a reference-free approach applicable to human, MT and LLM-generated translations.
Scalar evaluations: Evaluators rate translations on a scale, a more intuitive measure of overall quality. For example, LIKERT. This method will produce data that can generate average scores, medians, by aggregating evaluators’ ratings.
Binary assessments: A simple Pass/Fail, Yes/No method to collect data on the specific suitability of content. This method is useful to evaluate unique attributes that you require to call-out for your use case, e.g. does translate terminology following the glossary? Are honorifics suitable? Does it use a formal voice?
But let’s be clear, human assessment is subjective and non-linear. This is why, whenever possible and appropriate, using multiple evaluators and incorporating inter-rater scores is crucial to enhancing reliability in the results, so you gain confidence in the automated scores that will allow you to monitor and govern quality.
Correlating metrics for scalable quality
Automated scores and human evaluation work best together when managing quality at scale. Why? Because assessing all content manually is costly, time-consuming, and resource-intensive. This is why building confidence in the chosen automated metric is crucial. The path to enlightenment lies in using data analytics to correlate automated and human evaluation scores, ensuring alignment between machine-calculated quality and human judgement.
Key correlation methods:
Pearson correlation coefficient (r): Measures the linear relationship between human and automated scores. It is effective for normally distributed data, but sensitive to outliers.
Spearman’s rank correlation: Measures the strength and direction of the relationship between human and automated scores, even if the relationship is not linear.
Kendall’s Tau (K correlation): Evaluates the ordinal association between two datasets, helping determine if automated scores follow human judgement consistently.
By leveraging these statistical methods, we can refine our quality evaluation models, ensuring that automated metrics reliably reflect human perception of translation quality.
Practical application
Let’s say you have a content sample translated into multiple languages using NMT and LLM. Here is a step-by-step guide on how you can evaluate the quality of those translations.
Start with automated metrics: Begin by using a reference-based metric like COMET to compare the LLM-generated translations against human-generated references. This will give you an initial quality score based on how closely the AI translation aligns with human expectations.
Consider reference-free metrics: Next, use a reference-free metric like GEMBA to assess fluency and contextual appropriateness. These metrics don’t require human references and are helpful when comparing translations that do not have a reference available, or when you are focusing on how well the translation flows and fits the context.
Combine multiple scores: By combining these automated scores, you get a more comprehensive view of the translation quality. This helps ensure you are not relying on just one metric, which might overlook certain factors. If needed, normalize the scores to use the same scale (e.g. 0 to 1, 0 to 100).
Incorporate human evaluation: Automated metrics alone will not give you the full picture. To gain a more unbiased and objective assessment, ask two evaluators to review a random, but statistically significant, selection of the translated content. This step ensures that subjective factors, such as tone, brand and cultural relevance, are properly considered.
Use error granularity: At a segment level, you can evaluate the translation errors more precisely by using methods like MQM. This will allow you to analyze specific issues within the translations (e.g., accuracy, fluency, terminology) and produce a defect score.
Isolate specific attributes: If you are looking to evaluate specific translation aspects different from the standard error taxonomy, you can apply a binary method (e.g., yes/no for each attribute) to clearly define whether a specific quality standard is met.
Correlate scores: Finally, once you have both automated and human evaluation scores, you can correlate the two sets of results using Spearman or Kendall's Tau rank correlation. This will help you determine how well the automated metrics align with human judgements (Spearman) and whether the automated tools are consistent (Kendall’s).
Conclusion
Be fearless and practical: perfection should not stand in the way of progress. Score, evaluate, and compare NMT and LLM, then integrate human workflows into the equation. Build your data and narrative, move from numbers to insights, and develop a framework tailored to your organization.
For LLM translation quality, combined quality evaluations pave the way for harmonization and smarter assessments. Identify the metric (or metrics) that provide data on crucial dimensions for your business such as fidelity, fluency, semantics, and pair them with human evaluation metrics, for pragmatics and the negotiated expectations. As LLMs evolve, you may need to update your evaluation methods to reflect newer trends in language models and translation quality.
You may not leave this article with a single definitive metric, but hopefully, you have found some “fruit” for thought. While BLEU has long been a staple, emerging metrics like COMET have proven to be more effective for evaluating LLM-generated translations, as seen in the WMT24 rankings. Similarly, GEMBA has shown strong correlation with human evaluation. Yet, a dedicated metric for LLM translation has not been established. While research explores xCOMET and MetricX, no universally accepted standard has cropped up.
Much like selecting the sweetest fruit from a variety of options, choosing the best metrics and the right quality evaluation methods requires considering the full range of available tools and assessing how they align with your needs. It is not about finding the one perfect measure but about understanding the options, mixing and matching to create something that truly serves your business, and keep on.