Entity Extraction Helps Discover Critical Information Hidden in the Dark Web

Entity Extraction, Homeland Security, Intelligence Analysis

Entity Extraction Helps Discover Critical Information Hidden in the Dark Web

What Is the Dark Web?

The internet is frequently viewed as a vast, limitless space of information, but what most users interact with is only the surface.

Beneath that surface lies a more obscure part of the internet known as the dark web, also known as darknet websites. It’s a hidden realm that is not indexed by traditional search engines and is only accessible by specialized software such as Tor (The Onion Router).

Because of the anonymity it offers to users through a layered encryption system, the dark web is notorious for its association with illegal activities. There are black markets on it like the now-defunct Silk Road and Hydra Market, where users can buy and sell illicit goods such as drugs, weapons, counterfeit money, and stolen data. Others use the dark web to offer crime as a service, hacking services, identity theft schemes, and child exploitation materials. Terrorists use it to hide because the open web is becoming increasingly risky for them.

To hide these illicit activities further, financial transactions are concealed with crypto currencies. Because of its anonymity and encrypted nature, the dark web is challenging for authorities to track down and prosecute site operators and users.

Luckily, there’s a sophisticated AI technology that can provide valuable insight into this dark web data.

How Entity Extraction Illuminates the Dark Web

The dark web contains a great deal of data structured in the form of, for example, spreadsheets containing sensitive information such as passwords, social security numbers, etc. The dark web also contains information in unstructured data such as text (e.g., documents and email messages) without a predefined schema or format.

Unlike tables, spreadsheets, or database data, information in unstructured data is not labelled and requires Entity Extraction, an advanced AI technology that analyzes text and extracts and categorizes key concepts.

An example of unstructured data would be email hacked from a corporation’s SharePoint archive. These emails would be made available for sale by criminals on a site within the dark web. Likewise, bomb-making instructions that a terrorist group posts on-line in a dark web forum would be in an unstructured, natural language format. Another example is ads for the sale of drugs, guns, or other illicit goods.

In the process of extracting unstructured data, Entity Extraction provides it with structure, making it essentially into a database-like record with predictable semantics and formats.

Another complication is that the language used on the Dark Web is messy, with slang and sometimes a mix of languages. Entity Extraction can automatically detect the language and process the input accordingly.

What Information Does Entity Extraction Find on the Dark Web?

Entity Extraction finds instances of a variety of entity types mentioned in unstructured data that dark websites would like to sell or buy. For example:

    • Names of people and companies/organizations
    • Addresses
    • Phone numbers
    • Social Security numbers
    • Email addresses
    • Credit card numbers
    • Usernames
    • Weapons (including model numbers and manufacturers)
    • Chemical and biological substances
    • Illegal drugs
    • Malware
    • etc.

Some of this information could be collected through keyword search using a conventional search engine if the user is lucky enough to know, for example, a person’s name, but keyword search can only find instances of entities that are known to the user. By contrast—and this is its great strength—Entity Extraction recognizes any instances of entities, whether known or unknown to the user, using contextual clues surrounding them in the unstructured data. More advanced than extraction of just entities, Relationship Extraction recognizes the relationships between entities in the text. For example, Relationship Extraction associates person names with their Personally Identifiable Information (PII) found in text:

    • Person –- Company (employer)
    • Person — Address
    • Person ­­­­— Phone number
    • Person — Social Security number
    • Person — Credit card number
    • Person — Email address
    • etc.

For more on Relationship Extraction, see here.

Event Extraction goes even further, identifying the illicit activities in which entities are involved. It goes a step beyond Entity and Relationship Extraction: it identifies and extracts specific events or actions described in text, along with details of what’s being talked about. It can extract from Dark Web chatter an advertisement for a drug written in very non-standard language that is typical for the Dark Web:

“!!!!!!!!!!!!!!!We have some very high quality 2C-B pills for you@@ 25 mg !!Quantity 60 pills&&& Price: 150 euros******************Very different from run of the mills but Extremli good!!You won’t regret###”

Event Extraction identifies and labels with roles the participants in the event in a structured format such as:

    • Event Type: Sell Artifact
    • Artifact: 2C-B
    • Dosage: 25 mg
    • Quantity: 60 pills
    • Price: 150 euros

For more on Event Extraction, see here.

Ways in which Data Extracted by Entity Extraction May Be Used by Downstream Applications

Because Entity Extraction structures the data, it makes it usable by different applications that require structured data. If it is stored in and indexed by a search engine like Elastic, for example, the user can perform advanced search and discovery as well as visualization of the output.

One very popular application is known as faceted search. All the entities, relationships, and events in a large collection of dark web documents can be used as independent filters that analyze and dissect the data. The user can ask, “Show me the top 20 most frequently sold drugs” or “Report all the person names with social security numbers found in the data collected last week,” and so on.

Entity Extraction is also a feeder technology for other applications. For example, it enables a Link Analysis tool to uncover the network of associations between people and organizations in order to determine the leaders or organizers in, say, a drug trafficking gang using the Dark Web and the various roles of their followers.

Entity Extraction also supports Social Network Analysis. Where Link Analysis provides graphical and quantitative tools to make connections between entities clear, Social Network Analysis uses mathematical and visual approaches to analyze the structure of a network. In both approaches, Entity Extraction, by structuring unstructured data, makes it possible for these tools to exploit that data.

In sum, Entity Extraction is an advanced technology for discovering critical information on the dark web. It enables the automated discovery of information and renders it suitable for processing by downstream applications.