Applying WER to Arabic is not a simple matter of translation. The language’s unique structure presents three fundamental challenges that can distort accuracy measurements.
1. Morphological Richness (The "One Word, Five Meanings" Problem)
Arabic is a morphologically rich language. Words are typically formed from a three-letter root that is combined with various patterns to create different meanings. Furthermore, Arabic uses a variety of clitics, which are functional particles like prepositions, conjunctions, and pronouns that attach to the beginning or end of a word. For example, the single written word "وسيكتبونها" (wasayaktubūnahā) translates to "and they will write it." This single token in Arabic corresponds to five distinct words in English.
This structure creates a significant ambiguity in word segmentation.
- Should "وسيكتبونها" be treated as one word or as multiple morphemes?
Different ASR systems and annotation standards may adopt different tokenization schemes. An ASR system that separates clitics will produce a different word count from one that does not, leading to inconsistent WER calculations. A system might correctly identify all the component morphemes but still be heavily penalized by WER if the reference transcription treats the entire token as a single word.
2. The Diacritics Dilemma (The Vowel Blind Spot)
Standard written Arabic is typically undiacritized, meaning it omits the short vowel marks essential for pronunciation.
The word "كتب" can be read as
- kataba (he wrote),
- kutiba (it was written),
- or kutub (books).
An ASR system must predict the correct vowels to generate an accurate phonetic representation.This creates a mismatch.
If the reference text is undiacritized, the ASR system isn’t evaluated on its ability to produce the correct vowels.
If the reference is diacritized, any vowel error is penalized, even if the core consonants are correct and the meaning is intelligible.
3. Dialectal Variation (The "Which 'Now' Do You Mean?" Problem)
The Arab world is characterized by diglossia, the coexistence of Modern Standard Arabic (MSA) with dozens of regional dialects. A spoken utterance may have multiple valid transcriptions. For example, the concept of "now" can be:
- al-ʾān (MSA)
- dilwaʾti (Egyptian)
- hallaʾ (Levantine)
If an ASR system outputs a valid dialectal synonym that is different from the one in the reference text, WER will mark it as an error. This penalizes the system for being correct, just in a different dialect.