Entity Extraction Enables Effective Redaction for Data Protection

Entity Extraction, Risk Management

Entity Extraction enables automated redaction for data protection

Sensitive Internal Organizational Data Is Increasingly Exposed to the Outside

We all know the case of the hapless employee who mistakenly uploads sensitive documents to a public cloud without adequate security safeguards. In another insider thread scenario, it may be a malicious insider who gains access to sensitive organizational data and releases it to the world. Additionally, cyber attacks on an organization by external actors often result in data breaches.

In an age, moreover, where an ever-increasing number of organizations are outsourcing a lot of their storage and computing capability to third-party clouds, more and more individuals are potentially gaining access to internal organizational data. Organizations need more effective ways of protecting their data.

Governments Try to Safeguard Privacy

In response to these data protection concerns, government bodies have begun initiatives such as the European Union’s GDPR and California’s CCPA to impose regulations on organizations regarding their handling of sensitive data. The organizations themselves have focused on setting up more granular restrictions on data access and protecting their network perimeter.

Redaction to the Rescue

There is also one important additional tool available: redaction (aka data masking). Redaction refers to the hiding of sensitive data.

Redaction is necessary in many areas beyond protecting against a breach of sensitive internal data, as for example during legal discovery or when the Government releases documents in compliance with FOIA requests. In other situations, organizations do want to share information internally across corporate boundaries, but they need to keep some of the sensitive data hidden. For example, data used in a production environment may need to be made available in a development environment to support further system improvement. It may not be wise, however, for the engineers or other personnel in the development environment to see that data (e.g., a document containing customers’ PII data).

This is where redaction comes in. For example, a redaction tool can take sensitive data and mask it, while preserving data characteristics such as formatting that is critical for engineers to understand. An example is the simple one of taking all credit card numbers and substituting fictitious ones that have the same number of digits.

Redacting Unstructured Data Poses Special Problems

Redaction is fairly straightforward in the case of structured data. It is less so in the case of unstructured data, which refers to ordinary human language such as is found in corporate memos or HR documents containing employee information. The challenge here is that the location of sensitive data in unstructured text is unpredictable. Items such as names, dates of birth, addresses, social security numbers, etc. can show up anywhere in a document. It’s totally unlike structured data whose location and semantic type can be readily located in a database. Fortunately, there’s a technology, Named Entity Recognition (another name for Entity Extraction), which can help.

How Entity Extraction Helps Protect Sensitive Data

Entity Extraction identifies occurrences of key concepts in text, such as names of people and organizations and classifies their semantic types (e.g., PERSON, COMPANY, STATE). It works not by having long lists of these concepts. Rather it uses sophisticated algorithms to locate them. To do this it looks in the immediate context of, say, a person name for clues indicating that a name is present. The key critical contribution of Entity Extraction is that it identifies occurrences of names that have not been seen before. It doesn’t work with simply a long list of known names. It dynamically identifies, with high accuracy, new ones.

One of the challenges to finding and redacting names in unstructured text include the fact that personal names may appear on first occurrence in their full form, e.g., “John Donaldson,” but then use shorter forms on subsequent occurrences in the same document, “John,” “Mr. Donaldson,” or “Donaldson.”  In cases of company names, both the full name and an acronym may appear, e.g., “Booz Allen Hamilton, Inc.” vs. “BAH.” These shorter forms of a name need to be identified and redacted along with their full form since they may effectively identify someone as much as the full form does. In order to preserve the fact that it was referring to the same entity, the referential chain may also need to be preserved (e.g., all the occurrences should be replaced with PERSON-1).

Entity Extraction also handles other difficult phenomena such as ambiguity between an item that could be a name as well as something else: “Apple” as company vs. “apple” as a fruit. The former needs to be redacted, not the latter. A name like “Mary White” needs to be redacted even though its second element can be a simple color adjective in another, non-name context. Certain ethnic types of names offer their own challenges. Arabic names can be long: “Abdurrahman al-Rashid al-Ayyubi.” The entire name needs to be redacted, not parts of it.

Entity Extraction Automates Redaction of Unstructured Data

Entity Extraction is able to identify a wide range of concepts and not just person or organization names. For example, its coverage includes all kinds of numerical amounts that are obviously sensitive:

  • Social security numbers
  • Credit card numbers
  • Account numbers
  • Phone numbers
  • etc.

Other kinds of data include Place of Birth, Date of Birth, and addresses (both physical and email).

These sensitive items may be contained anywhere in the entirety of an organization’s unstructured text holding, so effectively redacting them requires a fast and scalable technology such as Entity Extraction to handle the job. Redaction done purely by humans isn’t up to the task in today’s vast data environments.