Identity Resolution Facilitates Data Warehousing

Identity Resolution, Record Management

Identity Resolution for Data Warehousing

A Data Warehouse Is Essential to a Business

A data warehouse is a concept that has become very popular: it is meant to be a single layer in an organization’s IT environment that contains data from different sources and where the data has been reconciled and normalized so it can serve for downstream applications such as business intelligence or marketing. It is, of course, critical that the data in the warehouse be as consolidated, accurate, and complete as possible, but this can be difficult when the representation of the data in the different sources varies a great deal.

Challenges in Reconciling Customer Databases

Different customer databases may contain records about the same customer that may overlap in some elements (e.g., name) as well as contain elements that differ (e.g., address) or have no counterpart in the other database such as a phone number. A major consequence of this is that a data warehouse that simply aggregates records from different sources without intelligent merging may not be reliable enough to support further analysis.

A major stumbling block in comparing and consolidating customer records is the frequent lack of a common unique identifier. The logical models of two different databases can be very different – one relying on Account Number as a unique identifier, a second selecting a phone number to serve the same function.

In the absence of a common unique identifier, it will be necessary to use other attributes of the record, such as names, addresses, phone numbers, etc. The problem with attributes like these is that they can vary in many ways.

Challenges in Matching Personal Names

Personal names exhibit characteristic variations:

  • Misspellings: John Wallingham vs. John Walingham
  • Use of nicknames: Theodore Franks vs. Ted Franks
  • Presence or absence of middle names/initials: John Francis Tims vs. John F. Tims vs. John Tims
  • Spelling variations: Sean vs. Shaun.
  • Different order of the elements: John Smith vs. Smith John
  • One database may assign the entire name to one field, another may assign each name component to a separate field, while a third may assign two of the name components to one field:
    • Wallace B. Jones
    • Wallace / B. / Jones
    • Wallace B. / Jones

Personal names of different ethnic origins exhibit unique variations:

  • Spanish surnames consist of a patronymic and a matronymic (e.g., Maria Gonzalez Martin). When a surname is shortened, the patronymic is retained and the matronymic is dropped (in our example, “Maria Gonzalez” is a more likely match than “Maria Martin”).
  • Arabic names can consist of many elements, some of which are frequently omitted: Abdullah bin Najib al-Khalid vs. Abdullah al-Khalid vs. Abdullah Khalid.

Another challenging problem with representations of non-Western names is that of transliteration. This is the usual term for the rules for transforming letters in a non-Latin script like Arabic into Latin script. There are different sets of rules currently in use for transliterating Arabic names into Latin script. This results in many different versions of the same name:

  • Khaalid al-Rashiid
  • Khalid el-Rashid
  • Khalid ar-Rashid
  • Khaalid alRashiid
  • etc.

Arabic is among the most complex cases, but names from many languages transliterated into English exhibit similar phenomena. And that can make it very hard to recognize that a name in one database is the same as the name in another.

And these different types of variation can, of course, be combined in any number of ways, so the number of potential variants is very large.

Matching Other Data Fields Is Also Tricky

There are also challenges with other record attributes such as address, employers, email addresses, etc. These attributes exhibit different patterns of variation. For example, corporate names will vary on the presence or absence of corporate designators such as an “Inc.”; some database records will contain the full form of a corporate name and some just an acronym. All of these phenomena, and many others (including, of course, typos), need to be handled in order to match up database records.

Luckily, there’s a technology that can do this: Identity Resolution.

Identity Resolution Helps Create Data Warehouses

Identity Resolution identifies records that are similar and likely to belong to the same individual. The key technology to accomplish this is fuzzy name matching. It employs Machine Learning and AI techniques that generate a different trained matching model for each entity type (i.e., person names, company names, dates, addresses). Each model is trained to match the likely variations in each type. For example, it knows that “Maria Gonzalez” is a more likely variant of “Maria Gonzalez Martin” than “Maria Martin.” In addition, the matching models provide a numerical score that indicates how likely a match is of being good.

In matching database records with multiple attributes, Identity Resolution first matches each attribute separately against its counterpart in another data source. Then these separate attribute scores are combined into a single total score.  In combining, there have to be some business logic rules as well to decide which attributes are more important. Personal names must be very similar, but how about home addresses? Email addresses? People might move or use different email addresses.

In sum, Identity Resolution is critical to building a data warehouse. It employs sophisticated AI and Machine Learning techniques to accomplish this task.