Entity Extraction Unlocks the Insights Hidden in Unstructured Data

Enterprise Search, Entity Extraction, Homeland Security, Intelligence Analysis, Risk Management, Social Media Analysis

Entity Extraction for Big Data

“Big Data” Has Become the New Normal for Many Organizations

The current flood of data of all sorts – text, imagery, video, voice, etc. – over the Internet is enormous. In particular, digitization has led to an explosion in the amount of unstructured text data, i.e., normal human language such as is found in tweets, email, etc. It’s estimated that unstructured text data constitutes up to 90% of all data in the world today. In addition, the arrival of AI-based chatbots that automatically generate unstructured text data in response to queries, along with other applications that use Large Language Models, guarantees that the amount of unstructured text is going to be even more predominant in perhaps the very near future.

The special challenge of unstructured data is that information technology tools have historically not been able to handle such data very well. Until fairly recently AI technology was not capable of understanding unstructured text with high accuracy. Human language is just too vague and ambiguous to be suitable for automated processing.

However, that has changed with the introduction of AI-based content-enrichment technologies such as Entity Extraction.

Entity Extraction is the Key to Overcoming the Challenge of Unstructured Data

Entity Extraction can be defined as an AI technology that reads unstructured text and produces structured representations of the key concepts in it. The motivation for Entity Extraction is that structured representations have predictable and precise formats like databases do, and any semantic ambiguities in them are removed. Such structured representations are needed to support many useful end-user applications such as semantic search, trend analysis, link analysis, geospatial analysis, visualizations, etc.

Structured representations of unstructured text are necessary because natural language is highly ambiguous, imprecise, and vague. The simple word “bank” can refer to a financial institution or to the bank of a river. Another challenge of natural language is the multiplicity of ways in which the same concept can be expressed:

  • “JP Morgan acquired the troubled First Republic Bank on Monday.”
  • “JP Morgan bought the troubled First Republic Bank on Monday.”
  • “The troubled bank First Republic Bank was acquired by JP Morgan on Monday.”
  • “JP Morgan carried out its acquisition of the troubled First Republic Bank on Monday.”
  • etc.

Entity Extraction solves these problems by identifying the key concepts in text and producing a structured representation that is identical for the four sample sentences above:

  • CORPORATE_ACQUISITION_EVENT:
    • BUYER: JP MORGAN
    • ACQUISITION: FIRST REPUBLIC BANK
    • TIME_OF_ACQUISITION: MONDAY

Entity Extraction has essentially removed the variability of vocabulary and syntax in the four example sentences and replaced it with a single structured representation that is suitable for further automated processing.

How Entity Extraction Does It

Entity Extraction first recognizes key concepts in unstructured text such as:

  • Personal and organizational names
  • Names of locations
  • Numerical amounts of all kinds
  • Dates/times
  • etc.

It produces structured representations of these concepts, which can be in a variety of output formats such as JSON, RDF, and XML.

Entity Extraction does not rely on long lists of known names to do this. It uses the context to recognize these entities and what kind of entities they represent. This is usually termed dynamic recognition, and it is one of the great strengths of Entity Extraction. It recognizes all instances of key concepts, not just the ones previously known to the user. This is a critical leap beyond traditional keyword search.

Going beyond this level, Entity Extraction recognizes the relationships and events in which the persons, organizations, etc., participate. A relationship identifies semantic relationships between two entities in unstructured text. For example, typical relationships for a person entity would be with the following:

  • Age
  • Place of birth
  • Nationality
  • Spouse
  • Associate
  • Affiliation
  • etc.

An organization might have the following:

  • Founder
  • Headquarters
  • Affiliated person (employee, consultant)
  • Subsidiary
  • etc.

All of these are output in a structured, predictable format.

An event is more complex than a relationship. It typically involves a verbal element, e.g., “buy,“sell,” that anchors the event, along with one or more participants whose precise roles in the event are defined.

The relationships and events are specified by a rich ontology that includes a large number of both covering many different domains.

Complex knowledge extracted in this manner can then be used by analytical tools such as link analysis, semantic search, geospatial analysis, etc. It can even be combined with other, originally structured data.

Entity Extraction is also highly scalable, meaning that it can handle very large quantities of text.

Using Entity Extraction in the Real World

Entity Extraction is a powerful technology that supports many different kinds of organizations. Here are some popular applications:

  • Entity Extraction can identify PII in a collection of documents for data privacy and redaction purposes (e.g., legal and health industries).
  • Entity Extraction can help detect and monitor geopolitical events such as conflicts that may impact assets, supply chains, national security, etc. for the purpose of risk assessment, preparedness, and response, both in the private and public sectors.
  • An intelligence analyst can use the extraction of relationships to identify previously unknown members of a terrorist or criminal group that have been mentioned in an unstructured document. The analyst need not have read the document to gain the knowledge it contains.
  • Event Extraction can identify any adverse events in media sources that companies or their executives are involved in (e.g., bribery, lawsuits). This helps companies better understand the risks of engaging with a particular company or individual (KYC, PEP, AML scenarios).

See these links for further information about Entity Extraction, Relationship Extraction, and Event Extraction.