What Is Cross-Language Name Matching?
Cross-language name matching is the process of matching names written in different writing systems. For instance, it’s the challenge of matching владимир путин to its English equivalent Vladimir Putin.
Why is Cross-Language Name Matching Critical?
In a globalized world, there are many areas where cross-language name matching plays a critical role:
- Border security: border security checkpoints and consular services are required to screen applicant names against various databases and lists such as criminal records, fraud, previous visa violations, terrorist lists, other security risks, etc. Such databases and lists are typically in Latin characters. There is consequently a need to be able to match names in non-Latin script to those in Latin script.
- Record management: multi-national companies and international organizations may have records of customers, clients, and employees in both Latin and non-Latin data. Cross-language name matching is necessary to verify identities and link all the data records referring to the same individual for record completeness and to avoid record duplication.
- AML and KYC/KYB compliance: banks located in, say, the Middle East need to be able to match names in non-Latin characters against sanction lists like OFAC that are mostly in Latin characters.
Why is Cross-Language Name Matching Challenging?
You might think that cross-language name matching is a straightforward task, that it’s a simple one-to-one mapping from one set of characters to another. Well, it’s a lot more complicated than that, and this is for many reasons, a few of which are:
- One language may not have a sound that another one does.
Arabic does not have the sound p. In translating the name Trump, it will typically substitute the letter b, which it does have: ترامب (the letter b in Arabic is that letter ب at the end of the name – Arabic writing is cursive and runs right to left, the opposite of English). The name therefore sounds like Trumb.
Similarly, Russian lacks an “h” sound, so it sometimes strangely substitutes a hard “g,” which is a very different sound from English “h”:
Harvard = Гарвард (“Garvard”)
On other occasions, it will substitute a sound that is close to, but not the same as, English “h”:
Hillary Clinton = Хиллари Клинтон (“Xillary Clinton”).
- Languages differ in what kind of syllables they permit.
English permits a wide variety, including syllables that end in a consonant or a group of consonants, e.g., the one-syllable words like dot or sixth (the latter being frequently hard to pronounce, even for native speakers). By contrast, Japanese only allows much simpler syllables: syllables can end in a vowel or the consonant n. That’s it. The result of this difference between the two languages is that an English name will be altered to fit Japanese phonetic habits: Tom Cruise will appear as トム・クルーズ (pronounced “Tomu Kuruuzu”). Japanese has added a vowel to the end of both the first and last names to make the name conform to Japanese sound patterns.
- Some alphabetic languages do not write out the vowels for the most part.
For example, in Arabic, the name Mohammad bin Salman is written in Arabic as محمد بن سلمان. Spelled out in English letter for letter, the name is mhmd bn slman (Arabic doesn’t have capital letters).
- Sometimes there are drastically different transliterations of a foreign name.
For example, the English name Trump has two different transliterations in Chinese. Most Chinese-language media outside of China refer to Trump as ‘Chuānpǔ’ (川普). Conversely, most Chinese state media and Chinese-language UK media (such as BBC) all use ‘Tèlǎngpǔ’ (特朗普).
The reason for this is that the beginning of Trump’s name, ‘tr’, has no counterpart in Chinese phonetics. It needs to be converted to a sound in Chinese, and in this case there are two possibilities. The media haven’t agreed to standardize on one choice.
There are many other factors making the cross-language matching difficult. The above is by no means a comprehensive list, but it gives you an idea.
Approaches to Cross-Language Name Matching
In another blog, we have already discussed traditional and more modern approaches to name matching, so we won’t go into that in more detail here. In a nutshell, what is needed is a way of performing sophisticated fuzzy name matching across languages, i.e., the approach must be able to capture the typical variations in spelling found between languages such as in the examples above.
There is such a way: a machine learning-based approach of learning from real-world data offers a graceful way of handling fuzzy cross-language matching. It does not rely, as some approaches do, on first translating the foreign name into Latin characters and then matching it to Latin names. This is a problematic approach, since it introduces a whole new layer of possible errors produced by the translation process. It’s more accurate to directly match the foreign names against the Latin names.
A machine-learning approach automatically learns a set of probabilistic matching rules from a large amount of real-world data that provides strong coverage of many scripts. It makes possible highly accurate matching of names across once insurmountable language barriers.
Summary
Effective matching of names in different writing systems is a key component in many areas ranging from border security to AML compliance and record management. A machine learning-based approach provides effective cross-language name matching.



