Contact us and see what NetOwl can do for you!
What is Entity Extraction?
What Does Entity Extraction Do?
It is commonly said that about 80% of all data is unstructured data, which means it is not organized in a predefined manner and is typically text-heavy and difficult to use for many software applications that rely on semantically labelled, database-ready data, known as structured data. It is however possible to turn unstructured data into structured data. That is precisely what Entity Extraction is all about.
Entity Extraction, also known as named entity recognition (NER), identifies key concepts in unstructured text such as named entities and disambiguates and classifies them into semantic categories, thereby creating structured data. Named entity recognition typically handles semantic concepts such as people, organizations, places, addresses, and time expressions as well as numeric expressions such as phone numbers, passport numbers, and many others.
Why Is Entity Extraction Useful?
Entity Extraction allows users to analyze unstructured text data in a timely and effective manner, saving them from having to wade through enormous amounts of text manually. Suppose for instance that you received a large collection of data, such as web pages, social media posts, email, internal memos, or reports for analysis, and you have no idea what is in it or what exactly to search for.
Entity Extraction can immediately enable you to:
-
- Discover names of people, companies, organizations, countries, and cities as well as mailing addresses, email addresses, phone numbers, etc. that are mentioned in this collection.
- Exploit this structured entity output with your favorite data visualization and analysis applications (e.g., Elastic Stack, Tableau, Esri ArcGIS) for trend analysis, geospatial analysis, link analysis, among others, to understand what is buried in this vast data.
In contrast, conventional keyword search assumes the users know what they are looking for, and it does not easily allow them to discover unknowns. For instance, given a large collection of documents, it is easy to search for documents about a specific company name, but it’s not easy to get the names of all companies involved in, say, a recent litigation case or the names of executives that are associated with it during an eDiscovery process.
Why Is Entity Extraction Hard?
Much of the challenge of automating text analysis comes from the ambiguity and richness of human language, often called Natural Language. Unlike programming languages, which are designed to be unambiguous and rigid, Natural Languages have evolved organically to be flexible in order to be able to express a wide range of ideas, emotions, and nuances and often rely on context for interpretation. This flexibility and reliance on context make Entity Extraction and in particular relationship and event extraction (discussed in the next section) especially challenging.
Here are some examples of the challenge of language ambiguity:
-
- Entity Type Ambiguity: The same name may refer to entities of different semantic types depending on context: “Jordan” could be a country name (Kingdom of Jordan), a person name (Michael Jordan), a body of water (Jordan River), or a brand name (Air Jordan).
- Part of Speech Ambiguity: The same word may have multiple grammatical functions depending on context. For instance, “bill” can be a proper noun (Bill), a common noun (electricity bill), and a verb (bill a customer).
- Semantic Ambiguity: The same word may have multiple meanings depending on context. For instance, the noun “operator” could indicate a person (e.g., crane operator), organization (e.g., tour operator) or a function in programming (e.g., arithmetic operator), and the verb “fire” can mean to dismiss an employee, to discharge a weapon, to inspire enthusiasm, etc.
At the same time, richness of language allows us to express the same concept in a variety of ways, which is a particular challenge for Advanced Entity Extraction such as relationship and event extraction:
-
- Multiple Terms: For example, the same event about someone leaving office can be expressed with different verbs, including “resign,” “quit,” “step down,” and so on.
- Different Syntactic Structures: For instance, the same relationship or event can be expressed with different syntactic structures such as:
- A verb phrase vs. a noun phrase:
- “John and Mary are married,” “John is Mary’s husband,” “Mary is John’s wife”
- “the marriage of John and Mary”
- Active voice vs. passive voice:
- “Robert Jones runs a hedge fund”
- “The hedge fund is run by Robert Jones.”
- A main clause vs. a relative clause:
- “Robert Jones runs a hedge fund”
- “Robert Jones, who runs a hedge fund”
- A verb phrase vs. a noun phrase:
- Names vs. Common Noun Phrases vs. Pronouns: The same entity can be referred to using names, common noun phrases, and pronouns. For instance, there are three different references to the same person entity in the example below. Entity Extraction needs to understand that “he” and “the highly successful CEO of ABC Corporation” in the second sentence refer to “Neil Smith” in the previous sentence so that it can identify the employment relationship between Neil Smith and ABC Corporation.
-
- “Neil Smith has received a substantial bonus this year. He is the highly successful CEO of ABC Corporation.”
-
Additionally, not all types of texts are written in grammatical, well-edited sentences with proper punctuation and capitalization like news articles, where capitalization can help determine the correct part of speech and proper punctuation can indicate sentence boundaries. Such texts pose Entity Extraction challenges, for instance:
-
- Informal Language: Texts written informally, which are common in social media and texting, often lack capitalization and punctuation and contain typos and misspellings. They may also not be written in full sentences and are not always grammatical. In addition, they may include special abbreviations and terms, which need to be recognized to understand the context.
- OCR and ASR Output: Because of OCR (Optical Character Recognition) and ASR (Automatic Speech Recognition) errors, these types of texts often include unusual misspellings as well as capitalization and punctuation mistakes.
Beyond Traditional Entity Extraction
Basic Entity Extraction is limited to named entities. By contrast, Advanced Entity Extraction builds on these entities and provides greater capabilities:
-
- Relationship Extraction. Extracting relationships between entities and specifying the semantic nature of the link. For instance, in “John Smith works for XYZ Corporation,” there’s a relationship between John Smith and XYZ Corporation and its nature is employment. The semantic representation output would contain the link, its label, and both participants as well as their respective roles (Employee, Employer).
-
- Event Extraction. Extracting events along with the participants (who and whom) as well as when and where: For instance, in “John Smith was arrested in Toronto on May 13th,” there is an Arrest event where the Person Arrested is John Smith and it happened in the Location Toronto with a Date of May 13th.
-
- Geotagging. Identifying place names mentioned in text and assigning latitude/longitude values to them. In the case of ambiguous location names (e.g., is it London in England or London in Canada?), Geotagging analyzes the textual context and determines which location it actually is. Furthermore, Advanced Entity Extraction makes it possible to geolocate traditionally non-geocodable entities such as people, facilities, and events (e.g., a place where a person visited or a certain event occurred).
Applications of Entity Extraction
Entity Extraction brings benefits to many business activities that require analysis of unstructured data. Here are some of them:
-
- Business intelligence: Keep abreast of current trends in an industry and monitor competitors, e.g., M&A activities, spin-offs, planned product launches, management succession, etc.
- National security intelligence analysis: Track activities of terrorist groups and other malign actors.
- Legal research: Analyze court documents by extracting critical information such as judges, attorneys, cases, and involved parties.
- Social media monitoring: Track mentions of people, locations, and organizations across social media to identify trends and emerging stories, monitor brands and events, etc.
- Semantic faceted search: Enhance search by providing extracted entities as semantic facets so that users can accurately find all documents with a company named “Apple” (not documents about a fruit) or a country named “Jordan” (not documents about a person).
- Link analysis: Find linkages among entities expressed in texts automatically by extracting relationships, events, and their participating entities.
- Adverse media monitoring: Monitor news media and other sources to obtain in real time any adverse information about customers, partners, suppliers, etc. that could affect your business.
- eDiscovery: Identify relevant entities, relationships, and events in large volumes of electronically stored data.
In sum, Entity Extraction is an advanced AI technology that goes beyond conventional search and enables timely knowledge discovery. It enhances the efficiency of knowledge work in any field that needs to analyze large quantities of natural language text.
Related Blogs:
