Entity Extraction: What Is It?

May 08, 2019 | Entity Extraction, Geotagging

It is commonly said that about 80% of all data is unstructured data, which means it is not organized in a predefined manner and is typically text-heavy and difficult to use for many applications that rely on semantically labelled, database-ready data, known as structured data. It is however possible to turn unstructured data into structured data. That is precisely what Entity Extraction is all about.

What does Entity Extraction do?

Entity Extraction identifies key concepts in unstructured text such as named entities and disambiguates and classifies them into semantic categories, thereby creating structured data. Entity Extraction typically handles semantic concepts such as people, organizations, places, addresses, and time expressions as well as numeric expressions such as phone numbers, passport numbers, and many others.

Why is Entity Extraction Useful?

Entity Extraction allows users to analyze Big Text Data in a timely and effective manner, saving them from having to wade through enormous amounts of text manually.  Suppose you received a large collection of data, such as web pages, social media posts, email, internal memos, and reports for analysis, and you have no idea what is in it or what to search for. Entity extraction can immediately reveal what people, companies, countries, cities, mailing addresses, email addresses, phone numbers, artifacts, etc. are mentioned in this collection. Now you can export this structured entity output to your favorite data visualization and analysis applications (e.g., Elastic Stack, Tableau) for trend analysis, geospatial analysis, link analysis, among others, to discover what is hidden in this vast data.

In contrast, conventional keyword search assumes the users know what they are looking for, and it does not easily allow them to discover unknowns. For instance, it is easy to search for documents about a specific company name, but not easy to get answers for the names of all companies involved in, say, recent mergers and acquisitions or the names of executives that are associated with these companies.

Why is Entity Extraction Hard?

Much of the challenge of automating the text analysis comes from the ambiguity and richness of language, that is, the same word could have different meanings and the same meaning could be expressed in multiple ways.

For example, “Jordan” could be a country name (Kingdom of Jordan), a person name (Michael Jordan), or a brand name (Air Jordan). In addition, not all texts are written in grammatical sentences with correct punctuation or capitalization, especially in informal media such as social media and texting. In such cases, “bill” could be ambiguous among a person name (Bill), a noun (electricity bill), and a verb (bill a customer).

At the same time, richness of language makes relationship and event extraction (which are discussed in the next section) especially challenging.  For instance, an event that describes someone leaving an office can be expressed in many different ways, including resign, quit, step down, and so on. Or a relationship that conveys a spousal relation can be expressed in different grammatical constructs, such as by a verb (“John and Mary are married”) or a noun (“the marriage of John and Mary”, “John is Mary’s husband”, “Mary is John’s wife”).

Beyond Traditional Entity Extraction

Basic Entity Extraction is limited to named entities.  By contrast, Advanced Entity Extraction builds on these entities and provides greater capabilities:

  • Relationship Extraction. Extracting relationships between entities and specifying the nature of the link: For instance, in “John Smith works for XYZ Corporation,” there’s a relationship between John Smith and XYZ Corporation and its nature is employment. The semantic representation output would contain the link, its label, and both participants as well as their respective roles (Employee, Employer).
  • Event Extraction. Extracting events along with the participants (who and whom) as well as when and where: For instance, in “John Smith was arrested in Toronto on 13 May,” there is an Arrest event where the Person Arrested is John Smith and it happened in the Location Toronto with a Date of 13 May.
  • Geotagging. Identifying place names mentioned in text and assigning latitude/longitude values to them. In the case of ambiguous location names (e.g., is it London in England or London in Canada?), Geotagging analyzes the textual context and determines which location it actually is.  Furthermore, Advanced Entity Extraction makes it possible to geolocate traditionally non-geocodable entities such as people, facilities, and events (e.g., a place where a person visited or a certain event occurred).

Applications of Entity Extraction

Entity Extraction brings benefits to many business activities that require analysis of unstructured data, including business intelligence, intelligence analysis, media monitoring, eDiscovery, regulatory compliance, and many others. It goes beyond search and enables timely knowledge discovery for Big Text Data Analysis.