What is e-Discovery and Why is it Challenging?
Electronic discovery, or e-Discovery, is the process of discovery of electronically stored information (ESI) such as emails, documents, databases, social media, and chat logs to be used as evidence related to legal proceedings. The e-Discovery process involves several stages, including identification, preservation, collection, processing, and review of information. Each stage presents its own unique challenges, but overall the two competing forces are cost and effectiveness. On the one hand, the ultimate goal is to produce evidence to meet a burden of proof. On the other hand, it is not unusual for e-Discovery to involve massive amounts of ESI, often in the form of unstructured data in various file formats and in the order of millions of documents. Manually reviewing such large volumes of unstructured data is prohibitively expensive and impractical both on the production side and the consumption side of the discovery process. The challenge in both cases is how to narrow down a large collection of ESI to a much smaller set of relevant documents for review and analysis.
How does Entity Extraction help e-Discovery?
Entity extraction is an AI technology that plays a critical role within the Technology Assisted Review (TAR) step to prepare raw ESI for review and identify relevant information:
- Identify sensitive information on the production side. Entity extraction can be used to identify privileged or sensitive information, which must be protected during the e-Discovery process. Sensitive information such as social security numbers must be redacted before ESI is made available for review. Entity extraction can recognize social security numbers, account numbers, and other sensitive information with very high accuracy and use this information to produce cleansed or redacted versions of those documents;
- Identify metadata-level information on the consumption side. ESI starts as raw data. If the raw data is more than a few hundred items, it should be turned into an indexed, fully searchable collection. ESI may come with metadata (e.g., to, from, and date information in email messages). This metadata can play an important part in identifying relevant data and providing evidence for a given case, but it is often not sufficient and may just not be available. Entity extraction can help identify relevant information to serve as or augment metadata and thus turn unstructured data into searchable information. In its most basic form, entity extraction is about automatically recognizing names of people, organizations, and places, time expressions, and various numeric expressions such as monetary amounts. Entity extraction can be used to augment any metadata associated with electronic documents to include key concepts such as the names of the companies and people mentioned in those documents. It is important to realize that entity extraction does not just identify known names. Its true power is to use linguistic context to identify previously unknown and unseen names, which may play a critical role in the legal case.
What Advanced Entity Extraction Capabilities are Useful for e-Discovery?
There are a number of ways in which advanced entity extraction capabilities offer a unique and critical advantage for e-Discovery:
- Relationship and event extraction. Relationship and event extraction not only identifies named entities with state-of-the-art accuracy but offers a unique and advanced capability to identify a broad range of relationships (e.g., person-kinship, person-associate, person-affiliation, organization-subsidiary, organization-owner), events (e.g., meetings, payments, travelling), and their participants out of the box. This relationship and event extraction capability allows for far more advanced analysis and insights beyond the simple links afforded by co-occurrence or analysis of the To/From/Cc fields from emails or other internal communications. Advanced entity extraction enables a deeper analysis of documents such as network link analysis to reveal clues critical to a given investigation.
- Customization. For specialized domains, entity extraction should be easily customized to extract additional concepts of interest for the domains (e.g., oil rigs for the oil industry).
- Name normalization. Entity extraction should be able to normalize names so that they can be more easily resolved and aggregated across documents to support semantic search, faceted search, and advanced analysis (e.g., timelines, charts).
- High throughput. Entity extraction should be engineered to handle high-volume processing of multiple different data sources, which is critical in investigations and legal proceedings where time is of essence.
- Ease of integration. Entity extraction should offer an API to support easy integration with existing workflow as well as with databases, document and content management systems, portals, and other sources of electronic content.
Summary
Advanced entity extraction provides an accurate, fast, and cost-effective way to identify and analyze electronically stored text content for e-Discovery, from named entities to advanced relationships and events.



