Entity Extraction for Faceted Search of Unstructured Data

December 14, 2017 | Enterprise Search, Entity Extraction


Faceted search is a search technique that allows users to set filters to explore and find information. Faceted search is intuitive and has become the prevailing search method for e-commerce websites. For faceted search to be possible, the information must first be classified along potentially multiple dimensions, called facets. For instance, typical facets for e-commerce correspond to product attributes like color, size, and price. Products can then be accessed in multiple ways by navigating and combining facets (multi-faceted queries).

Since faceted search is intuitive and popular, it is also a preferred way of exploring other types of information like unstructured information such as news, analyst reports, and legal documents. The challenge is that for unstructured data to be amenable to faceted search, it must first be turned into structured information in the form of semantic categories. This challenge is compounded by the sheer size of the data involved. In fact, unstructured data is estimated to make up 80% of the world’s digital data and keeps growing at an exponential rate.

How to Prepare Unstructured Information for Faceted Search

Here is where advanced Entity Extraction technology comes into play.

Entity Extraction transforms unstructured data into structured, semantically labelled data. Once information is labelled with semantic categories, it can also be used for faceted search as well as for advanced analytics techniques such as social network graphs, timelines, and complex map visualizations.

For instance, once named entities like person, organization, and place names have been identified in a document set, users can iteratively filter documents down to those that mention specific people and organizations. Identifying named entities requires resolving ambiguity (e.g., “London” in “Jack London” is part of a person name and not a reference to the city of London) as well as being able to recognize previously unseen names that may consist of unknown first and/or last names.

In addition, the more semantic concepts that Entity Extraction can identify accurately, the more useful it will be for faceted search. It is important that Entity Extraction software identify not only a large set of entity types but also links and events along with their arguments. For instance, if Entity Extraction identifies events and their arguments, it will be possible to use faceted search to narrow down a document set to those documents that mention a hiring or arrest event or specific people being hired or arrested.

Finally, for extracted information to be properly aggregated for faceted search, it must first be normalized to standardize variations like capitalization, acronyms, and abbreviations.

About NetOwl Extractor

NetOwl’s Entity Extraction product is capable of analyzing and extracting over 100 entity types and over 150 link and event types for a variety of domains including Business, Cyber Security, Finance, Homeland Security, Intelligence, Law Enforcement, Military, National Security, Politics, and Social Media. What’s more, NetOwl’s software performs normalization of extracted entities for proper aggregation and optimal faceted search.

In addition, when NetOwl’s Smart Geotagging is used alongside entity extraction, place names are both disambiguated and normalized.

Advanced Entity Extraction software helps make faceted search possible. For more information on Entity Extraction software for optimal faceted search, visit NetOwl’s product page today.