Home-Grown Name Matching Often Doesn’t Hit the Mark
The NetOwl team is often approached by customers that have a home-grown name matching system but have realized that it is not meeting their needs.
In some cases, when these customers were contemplating the Buy vs. Build decision for their name matching requirements, they thought that name matching was basically about performing simple string matching, so they could assign in-house staff for a few months to build their own. This is often misguided because, in practice, matching names involves going beyond mundane issues such as simple typos and handling initials versus the full form of names (e.g., “John William Jones” vs. “John W. Jones”) to navigating linguistic diversity and cultural differences in the variations of names. These differences make name matching a complex task that blends computer science, linguistics, and even the social conventions of a society.
In other cases, customers had implemented well-known name matching algorithms such as edit distance (e.g., Levenshtein) or phonetic algorithms (e.g., Soundex, Metaphone) but came to realize that these algorithms also had significant limitations.
In particular, home-grown name matchers fail in two major dimensions:
- Scalability: They often don’t scale up when the matching volumes increase and/or a real-time response is required.The NetOwl team is often approached by customers that have a home-grown Name Matching solution but realize that it is not meeting their needs. They typically report two types of issues:
- Accuracy: They are just not accurate enough. They miss correct matches (false negatives), and also they return too many bad matches (false positives).
What Are the Use Cases for Name Matching?
Broadly speaking, Name Matching typically comes into play in these two types of scenarios:
- “Bad guy” scenarios, where person and organization names need to be matched against lists of bad actors who are on watch lists for AML/KYC compliance, risk management, fraud detection, or border security reasons among others.
- “Good guy” scenarios, where there’s a need to match against internal databases, such as to determine whether two different customer records are in fact for the same person. The main purpose of this use case is to avoid record duplication, to consolidate records from multiple databases, or to verify an identity.
For other use cases for Name Matching, see our What is Name Matching? blog.
Why is Name Matching Challenging?
Name Matching needs to handle the many ways in which names can legitimately or accidentally vary. Some of these ways are simple. Some are quite complex. The very basic include:
- Simple misspellings: Dik Simpson vs. Dick Simpson
- Variations in word order: John Dickerson vs. Dickerson, John
- Nicknames: Joseph Thompson vs. Joe Thompson
- Missing components: Mary T. Johnson vs. Mary Johnson
- Initials: John Jackson vs. J. Jackson
- etc.
More complex phenomena include those that are specific to names of a certain ethnicity, for example:
- Transliteration variants: For instance, a language like Arabic is written in a script different from English. When transforming the name from Arabic letters to English ones, differences in spelling frequently arise because there is no universal, agreed-upon standard for transliterating.
- Abdallah el-Sisi vs. Abdullah al-Sisi
- Farid Bacha vs. Fareed Basha
- Differences in transliteration standards: for instance, there are two main transliteration standards for Chinese (i.e., Pinyin and Wade-Giles), and they can produce significantly different transliterations:
- Chinese: Zhang Ziyí vs. Chang Tzu-yi
- Elements that are frequently omitted:
- Arabic names frequently have the definite article attached to some elements of the name. It is frequently dropped: Hamid al-Sistani vs. Hamid Sistani
- Spanish person names typically include two surnames, one for the father, the other for the mother. For instance, in “Santiago Ramos Guzman,” Ramos is the patronymic and Guzman is the matronymic. It is frequently the case with Spanish names that the second surname is dropped, as in “Santiago Ramos.”
For more information and examples of such ethnicity-related complications in name matching, see our What is Name Matching? blog.
Some of our customers have even more complex requirements that call for matching names written in different scripts. Take this Hebrew example:
- יחזקאל אלון vs. Ezekiel Alon
In Hebrew the order of writing is right-to-left as opposed to the Latin alphabet’s left-to-right. Also, like other Semitic languages such as Arabic, not all the vowels are written. In our example the first name in Hebrew, when transliterated letter by letter into English, reads yhzqal (there are no capital letters in Hebrew).
For more examples of matching different scripts, see our What is Cross-Language Name Matching? blog.
Imagine how complex the name matching problem can become when differences in writing systems combine with variations like misspellings and ethnicity-specific phenomena!
Why Home-Grown Solutions Don’t Work for Name Matching
If your organization has a requirement for one of the name matching use cases we covered above, building your own in-house name matching solution may seem like an appealing option. It may seem easy and cost effective to implement using “traditional” algorithms.
However, as illustrated by the examples above, the challenges of name matching require a team that has long and deep experience in the area. Here are some specific pitfalls:
- Lack of expertise in name matching
Many countries such as the US present names from a wide variety of linguistic ethnicities. Your team would need to have a good understanding of the characteristics of various name ethnicities.
Simple algorithms such as Edit Distance and Soundex cannot address the issues of too many false negatives or false positives. (For a discussion of Edit Distance and Soundex and their shortcomings, see our What is Name Matching? blog). Your team would need someone who could bring in and apply more advanced AI algorithms that work.
If scalability is important, as it often is the case, your team would also need experience in engineering the software for operational-quality standards.
- Measuring accuracy objectively
A good vendor of a Name Matching product collects and maintains very large sets of training and test data of name variants for its product development and testing. The product is constantly run against the test data to ensure that accuracy (a measure of false positives and false negatives) is increasing and not declining. If you do intend to implement a home-grown solution, your team would need to collect such data and devise a way to quantitatively measure accuracy regularly.
- Total cost of ownership
Contrary to popular belief, a home-grown solution is not free if your employees are implementing it while they could be working on other tasks that require their expertise. Building a name matcher requires a serious time commitment. It’s not a matter of a few months.
In addition, the upkeep of the system rarely ever ends, either. The team would need to monitor the accuracy of the system and suppress false positives and false negatives. Scalability issues also need to be regularly monitored to ensure efficient performance in use cases where high-volume matching or near real-time matching is required.
A home-grown name matcher will not receive the regular upgrades that a good vendor will provide. In all likelihood the home-grown name matcher will not be able to benefit from improvements in the technology, and its performance will gradually decline.
A further risk of a home-grown solution is that key developers in building it may leave, so as a practical matter it may become impossible to maintain the capability at some point.
Organizations need to assess the true cost when considering building its own name matching solution.
Summary
Because it is an advanced technology relying on sophisticated algorithms, Name Matching is definitely not a commodity technology. The wise buyer, if they are not a specialist, would do well to acquire a well-designed, highly accurate, and robust product in the space.



