Contact us and see what NetOwl can do for you!
What is Name Matching?
What’s the Fuzzy Name Matching Problem?
Fuzzy Name Matching is about identifying names that are not identical yet similar enough that they likely represent the same entity. There are many areas where it is very important to identify such name variants.
For example, there have been several alarming incidents in the past where a difference in the spelling of a name has led to very negative consequences. The most tragic incident is that of the Boston Marathon bombers, where the difference in the spelling of their surname caused one of the terrorists not to be detained when he passed through JFK airport before the bombing. The person checking the watch list entered the one spelling “Tsarnaev” while the terrorist watchlist had “Tsarnayev”. No match. Incidents like these have not been uncommon. Another terrorism-related one was the so-called “underwear bomber.”
In the financial world, banks and other financial institutions face hefty fines for not properly screening individuals against sanction lists such as OFAC, which are meant to prevent money laundering (Anti Money Laundering or AML), terrorist financing (Counter Terrorism Financing or CTF), etc. For instance:
-
- In 2026 the Bank of Scottland was fined £160,000 for opening a bank account and processing payments for a sanctioned individual, an ally of Vladimir Putin. The individual in question opened the bank account using his UK passport, which contained a spelling variation (to be precise, a Cyrillic-to-Latin transliteration variant) of his surname (Ovsiannikov) that differed from that on the sanction list (Ovsyannikov).
- In 2024 Britain’s Starling Bank was fined £29 million for failing to vet customers. The bank had grown very fast over five years from 43,000 customers to 3.6 million. In 2021 the UK’s Financial Conduct Authority found that Starling hadn’t grown its compliance program to keep pace with its greatly increased business volume and issued an order to Starling to desist onboarding high-risk customers. But Starling didn’t comply with the order and continued failing to vet customers. Ultimately, because of Starling’s repeatedly failing to comply, the FCA announced in October 2024 that it had fined Starling £29 million (US$38.5 million).
- In 2024 TD Bank was fined $3 billion for shortcomings in its AML safeguards, which were driven, like Starling Bank’s, by a culture prioritizing growth over compliance. On October 10, 2024, the United States Department of Justice (DOJ) announced that TD Bank had agreed to a $3 billion settlement with the US government over charges that it repeatedly failed to detect money-laundering activities within its institution.
Where is Fuzzy Name Matching Used in the Real World?
There are in fact many other areas beyond AML and CTF where Fuzzy Name Matching is a critical technology. The following is a sample:
- Financial Industry:
-
- Politically Exposed Persons (PEP) — Financial institutions must identify and verify the identity of politically high-ranking customers and conduct ongoing monitoring to identify any suspicious transactions.
- Know Your Customer (KYC) — This is an overarching requirement for financial institutions: they must institute procedures to verify a customer’s identity and assess their risk, understand the nature of the customer’s activities and assess money laundering risks associated with customers.
- Money Transfer Verification — In making a money transfer, the payer supplies the payee’s name and bank account information. The payee’s name must match the name on the receiving account.
-
- Healthcare Industry:
-
- Patient Record Matching — Hospitals and other medical institutions need to unify their patient data across different datastores to create more complete patient records in order to improve patient care and avoid errors, duplicated tests, etc. Name matching is a critical component of this.
- Fraud Detection — Medicare, Medicaid, and other kinds of fraud are common. Government agencies, insurance companies, and medical institutions need name matching technology to combat it. For instance, they need to screen providers against lists of providers that have previously engaged in fraudulent activities.
-
- Retail Industry:
-
- Customer Data Management — Merchants need to unify all the data they gather on customers through the multiple channels now available (webforms, customer support calls, etc.)
-
- Background Screening Industry:
-
- Background checks — For example, when onboarding a new employee, a company typically will use a background screening company to find any red flags.
-
In the above industry areas, there are two main use cases: one is Name Search where a name is searched against a database or databases of names, which is common where more information is being sought about an individual. A second use case is Name Comparison, where a name is compared against one other name to see if there’s a match. This latter use case is applicable to money transfer verification as described above.
In both use cases it is of critical importance to accurately match names so as to ensure the identity of individuals.
Why is Fuzzy Name Matching Hard?
Name matching is hard because there are many different factors affecting the way in which the same name can vary, particularly if you’re looking at names from around the world:
-
- Simple misspellings:
-
- William vs. Wiliam
-
- Name variants that sound alike but are spelled differently:
-
- Christy vs. Christie
-
- Nicknames:
-
- English: Edward vs. the nicknames Ed/Eddie/Eddy/Ted
- Russian: Tatiana vs. Tania
- Spanish: Rosario vs. Chayo; José vs. Pepe
-
- Name Order Variants:
-
- Suzuki Ichiro (Surname + First Name) vs. Ichiro Suzuki (First Name + Surname)
- The usual order for Asian names is Surname + Given Name. However, Western sources almost always use the original order for Chinese and Korean, but usually use the Western order for Japanese.
- Suzuki Ichiro (Surname + First Name) vs. Ichiro Suzuki (First Name + Surname)
-
- Initials:
-
- John Morris Jones vs. J.M. Jones
-
- Missing name elements:
-
- Ali al-Sharif al-Maliki vs. Ali Maliki
- Arabic names can have many components, such as the definite article al-. They are frequently omitted in English.
- Ali al-Sharif al-Maliki vs. Ali Maliki
-
- Transliteration variants. For instance, a language like Arabic is written in a script different from English. When transforming the name from Arabic letters to English ones, differences in spelling frequently arise because there is no universal, agreed-upon standard for transliterating.
-
- Abdallah el-Sisi vs. Abdullah al-Sisi
- Farid Bacha vs. Fareed Basha
-
- Transliteration Standard differences as in Pinyin vs. Wade- Giles, the two main Chinese transliteration standards:
-
- Chinese: Xi Jinping vs. Hsi Chin-p’ing
-
- Partitioning Variants:
-
- Huang Xaoming vs. Huang Xao Ming vs. Huang Xao-Ming
- The last two elements in this Chinese actor’s name constitute the given name and they are written with two characters in Chinese script. The above are the three common possibilities for writing them in English.
- The actor Huang Xaoming also uses a partially anglicized form of his name: Mark Huang. The family name Huang also comes second in the Western manner. This use of a Western given name with a Chinese surname is particularly common for people from Hong Kong.
- Huang Xaoming vs. Huang Xao Ming vs. Huang Xao-Ming
-
- Use vs. non-use of diacritics: For instance, Spanish and many other European languages have diacritics in spelling. They are usually omitted in English
-
- Raúl Jiménez vs. Raul Jimenez.
-
- Ethnicity-Specific Phenomena:
-
- ‘Abd al-Rahim vs. Abd al-Rahim vs. Abdul Rahim vs. Abdarrahim.
- ‘Abd al-Rahim is an Arabic male given name. It has two component pieces: ‘Abd “Servant of” + al-Rahim “the Compassionate”. It is still one given name in Arabic, but it can be split up in different ways in English.
- ‘Abd al-Rahim vs. Abd al-Rahim vs. Abdul Rahim vs. Abdarrahim.
-
- Various language scripts: Some use cases involve matching a name in one script against a name in another. Think of a bank in the Middle East that needs to match names in Arabic script against a sanctions list in a Latin-script database such as OFAC.
-
- Abdelrahman Qureshi vs. عبد الرحمن قريشي
-
- Combinations:
-
- Hamid Sharif al-Mahfuz vs. Hameed el-Mahfouz.
- This example combines a number of spelling differences and dropping of a component.
- Hamid Sharif al-Mahfuz vs. Hameed el-Mahfouz.
-
- Other types of data also exhibit variations:
-
- Organizational names: International Business Machines vs. IBM; Smith & Jones LLC vs. Smith and Jones; Federal Express vs. FedEx
- Dates (not a name but they exhibit variations): 10/09/1991 vs. October 9, 1991
-
- Simple misspellings:
Traditional Approaches to Fuzzy Name Matching
Before we dive into the various approaches, we need to define some basic concepts that will be important in the discussion:
-
- Recall vs. Precision. These are the basic concepts for measuring the accuracy of matching. Recall measures the degree to which the matching process returns all matches that are considered good. Precision measures the degree to which the matching processes returns only good matches and not bad ones. Both concepts are of critical importance. They are also in tension – if you maximize the recall, you are more likely to let some bad matches through. Alternatively, if you maximize the precision, you may miss some good matches. The best approach is generally one where recall and precision are in balance. This depends, of course, on the requirements of the use case in question.
- Speed, aka throughput performance, refers to how long an individual match takes. This is not a constant for any approach to Fuzzy Name Matching because matches will vary a great deal in complexity (e.g., two names that differ in one letter will likely match much faster than two names with a difference of ten letters).
- Scalability. Many of the real-world uses for Fuzzy Name Matching described above require very high volume matching in real time. Scalability refers to the ability to increase matching capacity as the volume increases. For example, if 100 names are currently checked against a watchlist per second, will it be feasible for the system to be able to still match 1000 names per second?
Below we compare several traditional approaches to fuzzy name matching:
-
- Dictionary Look-Up: These dictionaries try to list all possible variants of a name. This approach appears to be easy to maintain: if the dictionary look-up misses a name, just add it to the list! Conversely, dictionary look-up is a very labor-intensive approach in that you have to build the list in the first place. Most importantly, it can’t match names the system doesn’t have on its internal lists.
To give an example, the most common name in the world is probably “Muhammad.” It has many attested variants, such as:
-
-
- Muhamad
- Muhammed
- Muhamed
- Mohammed
- Mohamed
- Mohamad
- Etc.
-
To get an idea of how common Muhammad really is plus its variants, see here.
Aside from the name variants, it has nicknames like Mo, Moe, Momu, Mamu, and Memo to add to the complexity.
-
- Soundex: Soundex is the best known representative of key-based systems (others are Metaphone and Double Metaphone), which are characterized by employing an algorithm to produce a standard form of a name, which is known as a key. The goal is to have the same key for all names that sound the same. Soundex is fast to execute and has good recall, but it typically suffers from a precision problem. For example, using a standard Soundex algorithm, a query for “Martin” would return “Morton,” “Mortenson,” and “Mordini” as good matches. Soundex ignores most vowels in creating a key, which of course increases recall, but it lowers precision significantly, as can be seen from the examples.
- Edit Distance:This approach simply counts the number of changes that have to be made to get from one name to another: “Kate” and “Cate” would have an edit distance of 1 and so would be considered a good match. It typically suffers, however, from not only a precision problem but also a recall problem (missing correct matches). For example, “Zheng Kai” would return “Zhang Kai” (only one letter difference), but “Zheng” and “Zhang” are distinct Chinese family names and an unlikely match, resulting in a precision error. On the other hand, “Bill Hough” would not return “Bill Huff,” a good match based on pronunciation, but the two surnames don’t share many letters. This is a recall error.
- Rule-Based: This refers to those systems that rely on humans writing rules for matching. This approach can incorporate a great deal of knowledge about names from different ethnicities, which can give the matching better world-wide coverage to handle the phenomena listed above. However, this approach is very labor intensive, and it typically suffers in the recall area, as the matching is limited by the humans’ knowledge. Consequently, it may miss non-obvious good matches.
A New Approach
The latest approach that achieves both high recall and precision uses state-of-the-art Machine Learning algorithms. It also uses name variant data that is large-scale, real-world, and contains names of multiple ethnicities that reflect the huge universe of names.
This approach automatically learns a very large collection of intelligent, probabilistic name matching rules from the name data. Since the rules are automatically learned from real data, they are not bound by limitations of humans’ knowledge as in the rule-based approach, but they reflect countless name variants that occur in the real world.
It also is able to recognize the ethnicity of a name and to construct specific matching models for those names that handle the specific name phenomena of that ethnicity. For example, Arabic ethnicity-specific models can match Khalif and Qaaliif.
Additionally, the Machine Learning approach handles cross-lingual name matching effectively where users can search names in multiple languages and language scripts even if they know only the name in, say, English. Other approaches such as Soundex and Edit Distance require transliteration of foreign scripts to a Latin representation before matching is done. This introduces a large source of transliteration errors into the matching, negatively affecting both recall and precision.
Summary
As we have seen, Machine Learning–based Fuzzy Name Matching achieves maximum accuracy in terms of recall and precision while ensuring high scalability and performance. It represents a significant advance over traditional rule-based approaches as discussed above. By learning from real-world data and adapting to linguistic diversity and different cultural naming conventions, these systems achieve higher accuracy and robustness in real-world applications. As global data integration continues to grow, intelligent name matching will continue to be a critical component of many information processing systems across many sectors of the economy.
