Entity Extraction Helps Guard Sensitive Data

Entity Extraction, Record Management, Risk Management

Entity Extraction helps guard sensitive data

Protecting Sensitive Data Is an Imperative for Organizations

Protecting information privacy is a major priority for all organizations today regardless of industry, whether public or private. Given the escalating number of cyber attacks on corporate networks, it’s critical that companies be able to guard their sensitive information from data breaches. An important step is to enhance the capability to accomplish at least two goals:

  • In order to protect sensitive data on a corporate network, it’s critical to locate all such information that is on the network but that the company may not be aware of. This includes, for example, text documents that might contain personal information of employees or customers.
  • Scan all incoming and outgoing email and text messages (plus their attachments) to prevent sensitive data from inadvertently being sent.

And in those cases where a data breach has already occurred, it’s important to assess what sensitive data may have been exposed to notify the affected parties and offer mitigating actions such as credit monitoring.

In all three cases, the very first step is to be able to identify sensitive data.

What Is Sensitive Data?

The term “sensitive data” covers a lot of ground: a company’s financial data, trade secrets, etc. In this blog, however, we’ll focus on what is known as Personally Identifiable Information (PII), i.e., information which, if it gets into the hands of unscrupulous people, could cause an individual great harm. PII includes the following:

  • Names of people
  • Social Security numbers
  • Phone numbers
  • Addresses
  • Email
  • Date of birth
  • Passport numbers
  • Credit card numbers
  • etc.

Sometimes two data elements have to be associated to be considered truly sensitive PII. For example, a date of birth by itself is not considered sensitive PII, but it becomes so if, say, a full name is joined with it.

Technologies that provide protection against cyber-attacks such as phishing and spams have typically relied on statistical analysis of unstructured data supported by machine learning techniques. These technologies look at an entire text plus metadata (sender/receiver, etc.) to make a determination if the item is suspicious.

However, for identifying the PII contained in an otherwise harmless text, a different technology is required: Entity Extraction.

How Does Entity Extraction Recognize PII and Why is it Hard?

Entity Extraction performs a linguistic analysis of the text to find contextual clues that entities corresponding to PII are present. It has a strong model of what is a likely person name, for example, and can recognize them in not only well-edited text but also noisy content such ungrammatical, incomplete, or misspelled text. The main contribution of Entity Extraction is that it can recognize previously unknown examples of PII.

Some PII in unstructured text may be quite predictable in format, such as Social Security numbers, and be well differentiated from other numeric expressions, and thus in principle easy to identify. Other numerical data may be more complex, such as different forms of the same phone number: (212) 678-4545 vs. 2126784545 vs. +1-212-678-4545. Although it looks easy, it turns out that phone numbers are hard, especially when including phone numbers from a variety of countries, to differentiate from other numerical amounts in unstructured text and requires very sophisticated extraction techniques to avoid a high number of false positives (i.e., spurious hits).

Other types of PII are more complex. For example, person names are not limited to First Name + Middle Name + Last Name, a convention Europe inherited from the Romans. Other cultures have very different structures for names. In Asian names, the family name typically precedes the given name: Nakamiri Hisao. Spanish names can have patronymics and matronymics, in effect two surnames: Mario Ramirez y Bolivar. Entity Extraction recognizes all of these and more.


Entity Extraction helps organizations identify the PII in their data. Depending on the business logic, once Entity Extraction has identified the PII, the organization can then decide what to do with it – whether to remove the document containing it from the network, redact the data, substitute fictional data, or if a data breach has already occurred, asses how many parties have been affected and to what degree and take appropriate mitigating action.

In sum, Entity Extraction offers an effective way for organizations to keep sensitive data safe in the face of ever-growing threats like data breaches.