Entity Extraction Sheds Light on the Dark Web

Entity Extraction, Homeland Security, Intelligence Analysis, Risk Management

Entity Extraction for the Dark Web

Entity Extraction is a New Tool for Analyzing the Dark Web

The dark web is a part of the Internet that is not indexed by conventional search engines such as Google and Bing and therefore isn’t accessed by them. There are heavy layers of encryption in it, so it requires a specialized tool such as Tor (short for The Onion Router) to access sites there. Entity Extraction, a sophisticated AI technology for understanding natural language, can provide valuable insight into this dark web data as much of it comes in unstructured form, in other words in natural language.

What’s in the Dark Web?

Needless to say, the dark web has become notorious for giving a home to all sorts of criminal activity. It has become a dumping ground for personal data that has been hacked, as in the famous Ashley Madison case.  There are sites where you can buy names and credit card information. If you want to buy weapons, there are sites that will oblige you.  You can purchase all kinds of drugs. You can buy hacking software. You can buy login credentials to people’s bank accounts. You can obtain counterfeit bills in any amount. You can even hire hackers to launch attacks on computers.

Equally ominous, terrorist groups have moved a lot of their activities to the dark web. Terrorists use it to hide because the open web is becoming increasingly risky for them. Recruitment and radicalization are conducted on the dark web. Propaganda is also spread there. Instructions on manufacturing weapons are popular in this jihadist world.

How Entity Extraction Sheds Light on the Dark Web

Of course, a great deal of the data available for sale on the dark web is already structured in the form of, for example, spreadsheets containing such sensitive information as passwords, social security numbers, etc.

However, a large amount of such dark web information is in unstructured format, that is, in natural language. An example of this would be email hacked from a SharePoint archive at a corporation. These emails would be made available for sale by criminals on a site within the dark web. Likewise, bomb-making instructions that a jihadi group posts on-line in a dark web forum would be in an unstructured, natural language format.

Unlike Excel data or database formats, unstructured data is not immediately interpretable without the aid of Entity Extraction, an advanced AI technology that analyzes text and extracts key concepts from it.

In particular, Named Entity Extraction finds instances of a variety of entity types mentioned in unstructured data that dark web sites would like to sell. For example:

  • People
  • Companies and organizations
  • Addresses
  • Social Security numbers
  • Phone numbers
  • Credit card numbers
  • Email addresses
  • User names
  • Weapons (including model numbers and manufacturers)
  • Chemical and biological substances
  • Illegal drugs
  • Malware
  • etc.

Some of this information could be gathered through keyword search if the user knows the words, but the keyword search is limited by the fact that it can only find instances of entities that are known to the user. By contrast, Entity Extraction recognizes any instances of entities, whether known or unknown to the user, using contextual clues surrounding them in the unstructured data.

More advanced than extraction of entities, Relationship Extraction, recognizes the relationships between entities in the text. For example, Relationship Extraction associates person names and their Personally Identifiable Information (PII) found in text:

  • Person – Address
  • Person – Social Security number
  • Person – Phone number
  • Person – Credit card number
  • Person – Email address
  • etc.

Event Extraction goes even further, identifying the illicit activities in which entities of interest are involved, for example, the selling of certain weapons, drugs, or WMD substances, or the offering of certain cyber-attacks

The Output of Entity Extraction Can Be Utilized in Multiple Ways

Once unstructured information has been extracted, it’s available for many different applications. If it is stored in and indexed by a tool like Elastic, for example, the user can perform advanced search and discovery as well as visualization of the output. One of the most popular applications is known as faceted search. All the entities, relationships, and events in a large collection of dark web documents can be used as independent filters that analyze and dissect the data. The user can ask, “Show me the top 10 most frequently traded weapons/drugs?” “Report all the phone numbers/people names found in the data collected yesterday,” and so on.

Entity Extraction is also an enabling technology for other applications. For example, a Link Analysis tool serves to uncover the network of associations between people and between organizations in order to determine the influence of leaders within a network, the direction of flow of information, or other critical aspects. Entity Extraction, by structuring unstructured data, also makes it possible for Social Network Analysis tools to exploit this data.

In sum, Entity Extraction is an advanced technology for discovering critical data on the dark web. It enables the automated discovery of sensitive information and renders it suitable for processing by downstream applications.