Document Categorization - NetOwl Extractor

NetOwl offers multiple approaches to document categorization so that the most appropriate technique is applied to specific types of text categorization tasks to achieve high accuracy.

Multiple Categorization Strategies

The nature of categories that organizations utilize is truly diverse. Categories can be, for example, general news (e.g., politics, business, entertainment, etc.), domain-specific (e.g., technologies in the biotech industry), or location-based (e.g., regions, countries, cities, etc.) No single categorization approach can address this challenge effectively.

In order to address this wide range of needs, NetOwl offers a multi-strategy approach to classifying documents. NetOwl’s strategies include robust machine learning-based, topic tagging-based, and semantic entity and event-based categorization. NetOwl supports any combination of these categorization techniques in a single API call to provide the best categorization functionality for any specific customer challenge.

Machine Learning-based Categorization

NetOwl’s machine learning-based categorization uses a state-of-the-art robust learning algorithm to handle even noisy and incomplete data. Categorization models are automatically created from training data representing each category.

NetOwl’s algorithms are designed to effectively build accurate models with minimal training data. However, if additional training data is available, the underlying models can be easily retrained to become smarter over time. This machine learning categorization is most appropriate when training data is readily available.

Topic Tagging-based Categorization

Topic tagging-based categorization uses a variety of different concept tagging rules. It is straightforward for users who are familiar with the target domain to develop these topic tagging rules, often from existing sources of domain knowledge, and there is no need for training data.

Concepts can be defined in several different ways. The simplest concepts are specific words and phrases whose presence by themselves are indicators of specific categories. Concepts can also be defined based on prefixes or suffixes of words, the character length of words, and the case of the individual words. All of these concept feature identifiers can be combined in any Boolean combination. Topic tagging-based categorization is most suitable when the target categories can be defined using a combination of relatively unambiguous terms and phrases.

Semantic Extraction-based Categorization

NetOwl Extractor’s entity and event extraction is highly applicable to text categorization. For example, users may want documents that mention military or political organizations in the world without listing specific names of organizations. NetOwl’s entity extraction offers such entity categories out-of-the-box. Or users may be interested in documents that describe various types of personnel change events, which NetOwl’s event extraction automatically identifies.

Extraction-based categorization is most appropriate when the target categories are about semantic entities or events that NetOwl’s ontology already covers. This includes the ability to leverage NetOwl’s geotagging to disambiguate place names and categorize documents, for example, about San Francisco, CA versus other cities of the same name in the world.

Key Features of Categorization

Accurate

The multi-strategy approach offers the best accuracy for each categorization task.
Fast & Scalable

Built for high throughput and scalability for real-time analysis.
Multilingual

Supports multiple languages, including English, Arabic, Chinese (traditional and simplified), French, German, Korean, Persian (Farsi and Dari), Russian, and Spanish.

Categorization

Multiple Categorization Strategies

Machine Learning-based Categorization

Topic Tagging-based Categorization

Semantic Extraction-based Categorization

Key Features of Categorization

Accurate

Fast & Scalable

Multilingual