What is coreference resolution

Coreference Is a Common Phenomenon of Natural Language

We’ve already discussed various forms of text analytics in other blogs: Entity ExtractionRelationship Extraction, and Event Extraction. Here we would like to focus on one linguistic phenomenon called coreference that needs to be handled well in order to analyze text accurately.

What’s coreference? It’s based on the fact that natural languages tend to avoid excessive repetition. For example, in English the first occurrence of a name in a document is typically something like Lionel Jones, and then subsequent mentions of the same individual may be Joneshehim, or the famous soccer player. It’s highly unlikely that it’s going to be Lionel Jones all the way through. He and him are pronouns whereas the famous soccer player is a definite noun phrase. When a pronoun or a definite noun phrase refers to the same entity as another mention (whether it’s a name, another pronoun, or noun phrase), they are said to corefer. In Linguistics, such pronouns and definite noun phrases are called anaphors, the mentions they refer to are called antecedents, and the process of figuring out what antecedent an anaphor refers to is called Coreference Resolution.

Coreference Resolution Is a Critical Piece of Entity Extraction

So what does all this have to do with Entity Extraction? Well, it turns out that information in unstructured text (e.g., whether about an entity, relationship, or event) is often scattered across multiple sentences, so linking up the pronouns, definite noun phrases, and various forms of names that refer to the same entity is critical to accurately capture entity information (e.g., name, aliases, descriptive phrases) and to identify the relationships (Relationship Extraction) and events (Event Extraction) that entities participate in.

To take a simple example:

Former University of Houston guard, NBA player, and coach Anthony Sullivan is alleged to have been involved in a huge illegal gambling scheme. As a result, the FBI arrested him on Thursday in Las Vegas.

As described in our previous “What is Event Extraction?” blog, the goal of Event Extraction is to take an unstructured text like the above and transform it into structured output like:

Event: Arrest
Authority: FBI
Arrested_Person: Anthony Sullivan
Place: Las Vegas
Time: Thursday

In order to obtain this event extraction result, it is crucial to resolve him in the second sentence to the correct entity, Anthony Sullivan, in the first. It is only this pronoun that connects the two.

How Coreference Resolution Works

While information may be scattered across multiple sentences, the desired structured output is one consolidated unit as in the event example above. To that end, Entity Extraction proceeds in stages. At a high level, it proceeds as follows. It first extracts the basic entities, labelling what kind of thing they are, such as:

Names: 
University of Houston (Organization:University)
NBA (Organization:Company)
FBI (Organization:Government)
Anthony Sullivan (Person, gender=male)
Las Vegas (Place)

Noun Phrases: 
Former University of Houston guard, NBA player, and coach (Person)

Pronouns: 
him (Person, gender=male)

Dates: 
Thursday

Event Extraction then identifies the Arrest event. At this point, the system does not know what entity him refers to. Finally, Coreference Resolution resolves the pronoun him to Anthony Sullivan and enables the system to determine that the person that has been arrested is actually Anthony Sullivan.

To get a bit into the weeds of how a pronoun finds its antecedent: most Coreference Resolution algorithms rely, among other things, on two features when dealing with pronouns:

    • Distance: The closer the pronoun is to the preceding name or noun phrase it refers to (i.e., the antecedent), the more likely they refer to the same underlying entity.
    • Syntactic and semantic properties: A definite noun phrase and a pronoun are more likely to refer to the same entity if they share some properties, such as semantic type (e.g., person, company, etc.), gender, number, syntactic role, etc. Minimally, they should not conflict in basic properties like gender and number.

In the above example, the name FBI is actually the closest name preceding the pronoun. For the second factor, though, we need to back up a bit: when Entity Extraction extracted the entities FBI, Anthony Sullivan, and him, it assigned semantic types, i.e., Organization to FBI and Person to Anthony Sullivan and him. Since him and Anthony Sullivan match on the semantic type Person and also the gender feature (male), the algorithm can now establish a link between Anthony Sullivan and him.

The actual Coreference Resolution processing is much more complex than this in real life (we have glided past a few issues), and we have not even discussed coreference resolution cases for definite noun phrases or short forms of names or for entities other than people.

There Are Hard Problems in Coreference Resolution

Let us be clear, though, that Coreference Resolution still contains some serious challenges. Here are some cases where other factors can play a role in coreference and which would need to be handled in an extraction system:

  1. Knowledge of the world

But days later, on 10 November, Mr. Morales stepped down and sought asylum in Mexico following an intervention by the chief of the armed forces calling for his resignation He denounced the move as a “coup”.

Here there are two potential antecedents for the pronouns his and HeMr. Morales and the chief of the armed forces, but knowledge of the world allows us to understand both his and He as referring to Mr. Morales. That’s because we know that people call for other people’s resignations, not their own, so his must refer to Mr. Morales. We also know that when one denounces something, it’s usually something about others and that a coup is something done by the military, so He too must refer to Mr. Morales.

Naming him prime minister, Mr. Díaz-Canel praised Mr. Marrero in particular for his handling of relationships with foreign investors.

Here too there are two potential antecedents for hisMr. Díaz-Canel and Mr. Marrero, but we understand his as referring to Mr. Marrero because we know that when a person praises another, it’s for something about the person being praised.

Taiwan is under threat from China. Its precarious position represents a real threat to the world’s economy.

Here it’s world knowledge about the relationship of China and Taiwan with Taiwan being in a weaker position that allows us to understand Its as a reference to Taiwan instead of China. Also, the fact that Taiwan is in subject position tips the balance in favor of its being the antecedent of Its.

Franklin Roosevelt and Winston Churchill met in August 1941 during the Atlantic Conference in Placentia Bay, Newfoundland. The prime minister came aboard the USS Augusta for their first meeting.

Here, the prime minister has two possible antecedents: Franklin Roosevelt and Winston Churchill. Only knowing the fact that Churchill was prime minister of Britain makes it possible to resolve the prime minister to Winston Churchill.

  1. Cataphora

During his trial, Kosta Diamantis testified in his own defense and said the tens of thousands of dollars he took from two contractors was for networking, introducing contractors to other companies.

Cataphora is a type of anaphora where an anaphor comes before its antecedent. Coreference resolution is not just a reference to an entity introduced earlier in the text. It could also be a reference to an entity introduced later. Here his is a reference to Kosta Diamantis.

  1. Direct speech

Direct speech is the exact words of someone’s speech marked by quotation marks. Quotations embedded in unstructured text represent a special challenge for Coreference Resolution. Here is a snippet of the transcript of a longer conversation between two business people meeting to discuss the performance of a third:

Hagerty says:Bill Jones’ sales performance last quarter was phenomenal. He has improved quite a bit. Im really impressed by it.”

Davidson replies: “You should reward him come bonus time.

The coreference phenomena here are quite complex:

    • I in the first quotation refers to Hagerty which is outside of the quotation.
    • He in the first quotation refers to Bill Jones. Both are contained within the quotation.
    • You in the second quotation refers to Hagerty.
    • Finally, him in the second quotation refers to Bill Jones in the first quotation.

Whew!

  1. Saliency

Fidel Castro led a communist revolution that toppled the Cuban government in 1959, after which he declared himself prime minister. He held the title until 1976, when it was abolished and he became head of the Communist Party and president of the council of state and the council of ministers. With his health failing, Castro handed power to his brother, Raúl, in 2006. He died in 2016.

Here the saliency of Fidel Castro as both occurring initially in the whole paragraph and as the subject of the first sentence makes it clear that all occurrences of he and his in the paragraph refer to him. In particular, in the last sentence, He could refer to his brother Raúl, but it clearly doesn’t because of the overall saliency of Fidel Castro.

  1. Synonyms

Sudan’s new transitional government, brought to power after protesters toppled Omar Al Bashir, has been meeting with rebels who fought for years against their marginalisation by Khartoum under the ousted leader.

Here the ousted leader is a definite noun phrase anaphor, which poses different challenges than pronouns.  We know that the ousted leader refers to (toppled) Omar Al Bashir because ousted is a synonym of toppled. A conference resolution system needs to understand various synonyms like this because written English does not like to repeat the same words in sentences close to each other.

Current Coreference Resolution approaches of any kind would have a tough time with most of the examples above. But that doesn’t mean that Coreference Resolution algorithms are not effective. It’s just that they haven’t solved all the problems. It might be added in defense of the algorithms, though, that humans, as we said earlier, occasionally have real trouble figuring out what entity a pronoun refers to, and this is generally due to sloppy or unclear writing.

Summary

While Coreference Resolution is critical for advanced Entity Extraction and in particular for Relationship Extraction and Event Extraction, it poses some of the hardest challenges in the field.