Document categorization is useful for a variety of applications, ranging from enterprise search to topic meta tagging, sentiment analysis, customer relationship management (CRM), compliance monitoring, and so on. NetOwl Categorization automatically assigns “categories” to each document with high accuracy so that documents can be organized semantically and users can effectively filter and analyze them in faster and more effective ways. The categories can be trained or customized to be anything that fits an organization’s needs.
Multiple Categorization Techniques
The nature of categories that organizations utilize is truly diverse. For example, categories can be of the type general news (e.g., politics, business, science and technology, entertainment, etc.), domain-specific (e.g., technologies in the biotech industry), location-based (e.g., regions, countries, cities, etc.), social media-focused (e.g., sentiments and emotions), or entity and event-centric. No single categorization approach can address this challenge effectively.
In order to address this wide range of needs, NetOwl offers a multi-strategy approach to classifying documents so that the most appropriate techniques can be applied to specific types of categorization tasks to achieve high accuracy. NetOwl’s strategies include robust machine learning-based categorization, topic tagging-based categorization, and semantic entity and event-based categorization.
NetOwl supports the ability to perform any combination of these categorization techniques in a single API call to provide the best categorization functionality for any specific customer challenge.
Categorization via Machine Learning
NetOwl’s machine learning-based categorization uses a state-of-the-art robust learning algorithm to handle even noisy and incomplete data. Categorization models are automatically created from training data representing each category. Through these examples, NetOwl is able to learn key features of each category so that as new documents are evaluated against these models, the proper collection of categories can be assigned. NetOwl’s algorithms are designed to effectively build accurate models with minimal training data, but if more training data is available over time, the underlying models can be easily retrained to provide even more accurate categorization. This machine learning categorization is most appropriate when training data is readily available.
Categorization via Topic Tagging
Topic tagging-based categorization uses a variety of different concept tagging rules. It is straightforward for users who are familiar with the target domain to develop these topic tagging rules, often from existing sources of domain knowledge, and there is no need for training data. Concepts can be defined in several different ways. The simplest concepts are specific words and phrases whose presence by themselves are indicators of specific categories. Concepts can also be defined based on prefixes or suffixes of words, the character length of words, and the case of the individual words. All of these concept feature identifiers can be combined in any Boolean combination. Topic tagging-based categorization is most suitable when the target categories can be defined using a combination of relatively unambiguous terms and phrases.
Categorization via Semantic Extraction
NetOwl Extractor’s entity and event extraction can be highly applicable to text categorization. For example, users may want documents that mention military or political organizations in the world without listing specific names of organizations. NetOwl’s entity extraction offers such entity categories out-of-the-box. Or users may be interested in documents that describe various types of personnel change events, which NetOwl’s event extraction automatically identifies. Extraction-based categorization is most appropriate when the target categories are about semantic entities or events that NetOwl’s entity and event ontology already covers. This includes the ability to leverage NetOwl’s geotagging to disambiguate place names and categorize documents, for example, about San Francisco, CA versus other cities of the same name in the world.
MultilingualSupports document categorization in multiple languages.
Language IDOffers a seamlessly integrated language ID capability where the language of the input text is automatically detected, and the text is processed through the categorization algorithms accordingly. Both microblog and standard document lengths are supported. A mixed language document, where sections of the document are written in multiple languages, can also be handled automatically.
Easy CustomizationEasy to customize categorization by any of the three strategies.