Contact us and see what NetOwl can do for you!
What’s the Fuzzy Name Matching Problem?
There have been several disturbing incidents in the last few years where a difference in the spelling of a name has led to negative consequences. Perhaps the most famous is that of the Boston Marathon bombers, where the difference in the spelling of their surname, “Tsarnaev” or “Tsarnayev,” caused one of the terrorists not to be detained when he passed through JFK airport before the bombing. The person checking the watch list entered the one spelling “Tsarnaev” and the watch list had the other. No match. Incidents like these have not been uncommon. Another terrorism-related one is the so-called “underwear bomber”.
In financial organizations, similar issues arise from banks’ duty to screen customers against watch lists like OFAC. Banks and other organizations have paid fines for performing inadequate matching and so potentially letting unacceptable customers through or, alternatively, blocking customers who should be let through. Cases include:
- PayPal having a fine imposed by US Treasury due to inadequate sanction screening
- TransUnion, a consumer credit reporting agency, losing a class action lawsuit for incorrectly flagging customers as criminals
- National Bank of Pakistan fined by OFAC for processing funds transfers to an entity on its Specially Designated Nationals List
Where is Fuzzy Name Matching Used in the Real World?
There are in fact many areas beyond counter terrorism where fuzzy Name Matching is a critical technology. The following is a sample:
- Homeland Security: Border security, law enforcement, visa application screening
- Financial Industry: Anti-money laundering (AML), Politically Exposed Persons (PEP), Know Your Customer (KYC)
- Healthcare Industry: Patient record matching, fraud detection
- Retail Industry: Customer Data Management, Customer Stitching
- Background Screening Industry: Background checks.
In all of these it is of critical importance to match names so as to ensure the identity of individuals.
Why is Fuzzy Name Matching Hard?
Name matching is hard because there are many different factors affecting the way in which the same name can vary, particularly if you’re looking at names from around the world:
- Simple misspellings, including names that sound alike: Sean vs. Shawn vs. Shaun
- Nicknames: William vs. Bill vs. Billy; Mikhail vs. Misha; Alexandra vs. Sasha
- Name Order Variants: Moon Jae-in vs. Jae-in Moon. (Asian names traditionally place the surname, e.g., Moon, first, but they occasionally occur in the Western order.)
- Initials: John Fitzgerald Kennedy vs. J.F. Kennedy
- Missing name elements: Ali al-Husseini al-Sistani, Ali Husseini al-Sistani, Ali al-Sistani (Arabic names can have many components and are frequently shortened in English.)
- Transliteration variants: Abdel Fattah el-Sisi vs. Abdul Fatah al-Sisi. (A language like Arabic is written in a script different from Latin and also has some sounds that don’t occur in English. When transforming the name from Arabic letters to English ones, differences in spelling frequently arise.)
- Transliteration Standard differences: Xi Jinping vs. Hsi Chin-p’ing. (Pinyin vs. Wade Giles)
- Partitioning Variants: Chow Yung-Fat vs. Chow Yungfat vs. Chow Yung Fat. (The last two elements in this famous Chinese actor’s name constitute the given name and they are written with two characters in Chinese script. The above are the three common possibilities for writing them in English.)
- Orthographic Variants: Joaquín Guzmán vs. Joaquin Guzman. (Spanish and many other European languages have diacritics in spelling. They are usually omitted in English.)
- Ethnicity-Specific Phenomena: ‘Abd al-Rahman vs. Abd al-Rahman vs. Abdul Rahman vs. Abdarrahman. (This is an Arabic male given name, sometimes also a surname. It has two component pieces in Latin transliteration: ‘Abd “Servant of” + al-Rahman “the Merciful.” It is still one name, but it can be split up in different ways in English.)
- Multiple languages: Xi Jinping vs. 习近平 vs. شي جين بينغ
- Combinations: Ali al-Bustani vs. ‘Ali Husayn el-Bustaani. (Combines a number of spelling differences and dropping of a component.)
Traditional Approaches to Fuzzy Name Matching
Before we dive into the various approaches, we need to define some basic concepts that will be important in the discussion:
- Recall vs. Precision. These are the basic concepts for measuring the accuracy of matching. Recall measures the degree to which the matching process returns all matches that are considered good. Precision measures the degree to which the matching processes returns only good matches. Both concepts are of critical importance. They are also in tension – if you maximize the recall, you are more likely to let some bad matches through. Alternatively, if you maximize the precision, you may miss some good matches. The best approach is generally one where recall and precision are in balance.
- Many of the real-world uses for Name Matching described above require very high volume, real-time matching. Scalability refers to the ability to increase matching capacity as the volume increases. For example, if 100 names are currently checked against a watch list per second, will it be feasible for the system to be able to keep the same speed for 1000 names?
- Speed, aka performance, refers to how long an individual match takes. This is not a constant for any approach to Name Matching because matches will vary a great deal in complexity (e.g., two names that differ in one letter will likely match much faster than two names with a difference of ten letters).
We compare several traditional approaches to fuzzy name matching:
- Dictionary Look-Up: These dictionaries try to list all possible variants of a name. One U.S. Government lexical resource listed over 100 occurring variant forms of just the name “Muhammad.” Obviously this is a labor-intensive approach and inevitably doomed to have recall problems as it is highly likely that the dictionary misses capturing additional variants.
- Soundex: Soundex is the best known representative of key-based systems, which are characterized by employing an algorithm to produce a standard form of a name, which is known as a key. The goal is to have the same key for all names that sound the same. It typically suffers from a precision problem. For example, using a standard Soundex algorithm, a query for “Martin” would return “Morton,” “Mortenson,” and “Mordini” as good matches.
- Edit Distance: This approach simply counts the number of changes that have to be made to get from one name to another: “Kate” and “Cate” would have an edit distance of 1 and so would be considered a good match. It typically suffers, however, from not only a precision problem (extraneous matches) but also a recall problem (missing correct matches). For example, “Zheng Kai” would return “Zhang Kai” (only one letter difference), but “Zheng” and “Zhang” are distinct Chinese family names and an unlikely match, resulting in a precision error. On the other hand, “Ed Gough” would not return “Ed Guff”, a good match based on pronunciation, but the two surnames don’t share many letters.
- Rule-Based: This refers to those systems that rely on humans writing rules for matching. This approach is labor intensive, obviously, but it can incorporate a great deal of knowledge about names from different ethnicities, which can give the matching better world-wide coverage to handle the phenomena listed above. It typically suffers in the recall area, as the matching is limited by the humans’ knowledge. Consequently, it may miss non-obvious good matches.
A New Approach
The latest approach that achieves both highest recall and precision, as demonstrated in the MITRE Challenge, uses a state-of-the-art machine learning algorithm and large-scale, real-world, multi-ethnicity name variant data. This approach automatically learns a collection of intelligent, probabilistic name matching rules from the data. Since the rules are automatically leaned from the real data, they are not bound by limitation of humans’ knowledge but they reflect countless name variants that occur in the real world. It also has automated name ethnicity detection capabilities and applies the most appropriate matching models to names based on their ethnicity values in order to attain high accuracy.
Additionally, this approach handles cross-lingual name matching gracefully where users can search names in multiple languages and scripts even if they know only the name in, say, English. Other approaches like the ones described above require transliteration of foreign scripts to a Latin representation, which introduces a large source of transliteration errors into the matching, affecting both recall and precision.