When 80% of the World's Data Is Unstructured, Entity Extraction is a Must

Entity Extraction

It is commonly said that unstructured data represents 80% of all data not currently available in digital format, and consequently it has also become the focus of a lot of attention. Traditional structured data as found in relational databases is still very important (and not about to go away), but unstructured data is fast catching up in terms of its importance to many industries.

Unstructured text data includes many different kinds of documents – news feeds, blogs, audio streams, posts on social media, and many others. Recent eye-popping increases in the ability to produce and capture it have made it a very attractive target for organizations to exploit for better business insight.

As just one example, a consumer-facing firm can now – which it couldn’t do even 10 years ago – find an enormous, unfiltered sea of customer comments on the quality of their products and services on social media. Companies can now analyze a very large set of extremely fine-grained consumer responses to every aspect of what they are offering that go way beyond in comprehensiveness what was traditionally possible with formal customer surveys.

What’s been holding organizations back from full realization of the value of unstructured data has been the lack of tools and technologies that can extract business insights from it. Fortunately, this situation has changed. In particular, there is one such technology that can be of enormous value: Entity Extraction.

Entity Extraction: How It Works

Entity Extraction automatically analyzes unstructured data and transforms it into structured data. At the basic level, it recognizes the key entities in unstructured data, for example, names of people, organizations, locations, dates/times, and many numerical items. All of this extracted information is expressed in an unambiguous and structured output format such as JSON, RDF, TSV, and XML.

Going beyond this basic entity level, Entity Extraction also recognizes the relationships between these key entities as well as the events in which they participate. These entities, relationships, and events are all defined by a rich ontology that enables Entity Extraction to remove any ambiguities in the original unstructured text and output semantically unambiguous structured data. Information extracted in this way can then be fed to a variety of analytical applications such as semantic search, link visualization and analysis, geospatial analysis, trend analysis, predictive analysis, and so on.

To get the most insight from unstructured data, entity extraction is an ideal choice of technologies, which is why it is fast becoming one of the most powerful forms of text analytics used today.

What to Look for in Entity Extraction Tools

Now what makes a superior extraction product so it can meet the challenges posed by unprecedented quantities of unstructured data? Here are some important qualities to pay attention to:

  • Broad Out-of-Box Entity Ontology. To be useful, an Entity Extraction product should offer a broad ontology of entities that it extracts out-of-the-box. Some entity extraction tools extract only a handful of entity types, such as people, organizations, and places. In reality, users find that they would like a more refined entity ontology that distinguishes, for example, companies from governmental organizations, or countries from cities. Users are also interested in additional entity types, such as addresses, products, phone numbers, email addresses, and so on.
  • Beyond Entities.  Extracting basic entities such as people, organizations, and places is useful. But to enable more advanced analytics, relationship and event extraction are necessary. Relationship extraction identifies associations between entities, such as an employment relationship between a person and an organization. Event extraction recognizes dynamic events, e.g., a sales event where the object being sold, the buyer, the seller, and the date of the transaction are all identified. Both relationship and event extraction make possible automated, sophisticated link analysis as well as discovery of crucial associations from very large amounts of unstructured data. They reveal meetings that two people attended who were not previously known to be associated with each other. They will uncover hidden relationships between persons and organizations. Given a rich ontology of types of relationships and events, relationship and event extraction make possible extensive, detailed knowledge discovery from an ocean of unstructured data.
  • High Accuracy. In addition, the extraction needs to be highly accurate (i.e., it extracts with a minimum of false positives and false negatives) and also be made robust enough to handle a wide range of different styles, formal or casual. A misspelling of a word or an unconventional grammatical expression should not gum up the works. An extractor has to work well on both well-edited material such as news articles as well as what is found on social media where pretty much anything goes grammatically.
  • Multilingual. An extractor also needs to be multilingual. With the rise of the global market, the number of firms operating internationally has skyrocketed. As a consequence, more organizations are interested in foreign languages. From a technical standpoint, an important feature of multilingual extraction is that the output formats are the same regardless of language, which make it easy to integrate multiple language extraction into a wider solution.
  • Scalable. An Entity Extraction product needs to be highly scalable to be able to process the petabytes, terabytes, and even zettabytes of unstructured data that will be available soon. An extractor needs to be highly parallelizable to function in the new cloud processing environments that have evolved and that will continue to become ever more capable of larger volumes.

In sum, then, Entity Extraction is a powerful new technology that maximizes the value of unstructured data.