What is Cross-Language Name Matching?

Homeland Security, Name Matching, Record Management, Risk Management

Cross-language name matching

Different Writing Systems Pose Challenges for Name Matching Technology

In another blog we’ve already discussed the technology of Name Matching and why it’s important. Here we want to focus in on the challenges of Cross-Language Name Matching.

Cross-language Name Matching refers to matching names written in different writing systems.  It is the challenge of mapping, e.g., جو بايدن to its English equivalent Joe Biden.

Why is Cross-Language Name Matching Useful?

There are many areas where Cross-Language Name Matching is useful:

  • In processing at a border security checkpoint, most of the Information Technology systems used will be restricted to Latin characters, but some of the documentation that travelers present may well be in non-Latin scripts.
  • In entity record management for multi-national companies, records of customers and employees may have both Latin and non-Latin data.
  • In performing Know Your Customer matching of names for compliance purposes, a bank located in, say, the Middle East will need to be able to match names in non-Latin characters against compliance lists like OFAC that are mostly in Latin characters.

Why is Cross-Language Name Matching Challenging?

You might think that Cross-Language Name Matching is a straightforward task, that it’s a simple one-to-one mapping from one set of characters to another. Well, it’s a lot more complicated than that, and this is for many reasons, a few of which are:

  • One language may not have a sound that another one does.

Arabic does not have the sound p. In translating the name Trump, it will typically substitute the letter b, which it does have:  ترامب (the letter b in Arabic is that letter ب at the end of the name – Arabic writing runs right to left, the opposite of English). The name therefore sounds like Trumb.

Similarly, Russian lacks an “h” sound, so it substitutes a hard “g”:  Harvard = Гарвард.

  • Languages differ in what kind of syllables they permit.

English permits a wide variety, including syllables that end in a consonant, e.g., the one-syllable word like dot. By contrast, Japanese only allows much simpler syllables: syllables can end in a vowel or the consonant n. That’s it.  The result of this difference between the two languages is that an English name will be altered to fit Japanese phonetic habits: Tom Cruise will appear as トム・クルーズ (pronounced Tomu Kuruuzu). Japanese has added a vowel to the end of both the first and last names to make the name conform to Japanese sound patterns.

  • Some alphabetic languages do not write out the vowels for the most part.

For example, in Arabic, the name Mohammad bin Salman is written in Arabic as محمد بن سلمان

Spelled out in English letter for letter, the name is mhmd bn slman.

  • Sometimes there are multiple possible transliterations of a foreign name.

For example, the English name Trump has two different transliterations in Chinese. Most Chinese-language media outside of China refer to Trump as ‘Chuānpǔ’ (川普). Conversely, most Chinese state media and Chinese-language UK media (such as BBC) all use ‘Tèlǎngpǔ’ (特朗普).

The reason for this is that the beginning of Trump’s name, ‘tr’, has no counterpart in Chinese phonetics. It needs to be converted to a sound in Chinese, and in this case there are two possibilities. The media could agree to standardize on one choice, but they haven’t.

There are many other factors making the cross-language matching difficult. The above is by no means a comprehensive list, but it gives you an idea.

Approaches to Cross-Language Name Matching

In another blog, we have already discussed traditional and more modern approaches to name matching, so we won’t go into that in more detail here. In a nutshell, what is needed is a way of performing sophisticated fuzzy name matching across languages, i.e., the approach must be able to capture the typical variations in spelling found between languages such as in the examples above.

There is such a way: a machine learning-based approach of learning from real-world data offers a graceful way of handling fuzzy cross-language matching. It does not rely, as some approaches do, on first translating the foreign-script name into Latin characters and then matching it to Latin-script names. This is a weak approach, since it introduces a whole new layer of possible errors produced by the translation process. It’s much better to directly match the foreign-script names against the Latin-script names.

Our machine-learning approach automatically learns a set of probabilistic matching rules from a large amount of real-world data that provides good coverage of many scripts. It makes possible highly accurate matching of names across once insurmountable language barriers.

Why Cross-Language Name Matching Is Critical

Highly accurate data management is critical to an organization’s success these days. Effective matching of different data records is a key component of keeping your data clean, complete, and deduplicated. In order to accomplish this, matching needs to go beyond traditional Latin script-only matching because in today’s globalized world many organizations have data records in different scripts.

If you are or will be in the market for acquiring a matching product with cross-language capabilities, a thorough evaluation is crucial. Be cautious and demanding: very few products can truly handle the types of matching challenges discussed above.