What is coreference resolution

Coreference Is a Common Phenomenon of Natural Language

We’ve already discussed various forms of text analytics in other blogs: Entity Extraction, Relationship Extraction, and Event Extraction. Here we would like to focus on one linguistic phenomenon in particular that needs to be handled well in order to analyze text accurately.

What’s the phenomenon? It’s simply that natural languages tend to avoid excessive repetition of some elements. For example, in English the first occurrence of a name in a document is typically something like Lionel Messi, and then subsequent mentions of the same individual will be Messi, he, him, or the famous soccer player. It’s highly unlikely that it’s going to be Lionel Messi all the way through. He and him are pronouns whereas the famous soccer player is a definite noun phrase. When a pronoun or a definite noun phrase refer to the same entity as another mention (whether it’s a name, another pronoun, or noun phrase), they are said to co-refer. In Linguistics, such pronouns and definite noun phrases are called anaphors, the mentions they refer to are called antecedents, and the process of figuring out what antecedent an anaphor refers to is called Coreference Resolution.

Coreference Resolution Is a Critical Piece of Entity Extraction

So what does all this have to do with entity extraction? Well, it turns out that linking up pronouns, definite noun phrases, and short forms of names with entities they refer to in the text is critical to being able to perform accurate relationship extraction and event extraction.

To take one example:

Last April Israel’s attorney general investigated Benjamin Levy and charged him with corruption.

As described in the previous “What is Event Extraction?” blog, the goal of Event Extraction is to take an unstructured text like the above and transform it into structured output like:

  • Event: Indict
    • Authority: Israel’s attorney general
    • Party Indicted: Benjamin Levy
    • Offense: corruption
    • Place: Israel
    • Time: Last April

In order to obtain this event extraction result, resolving him to the correct entity is crucial.

How Coreference Resolution Works

In order to get all the information about an entity scattered across multiple sentences in the original unstructured text unified into one structured output as in the example above, Entity Extraction first extracts the basic elements such as Israel’s attorney general, Benjamin Levy, and him. It figures out that the first item is a definite noun phrase, the second is a person name, and the third is a personal pronoun (with male gender).

Event Extraction then identifies the indict event. At this point, the system does not know what entity him refers to. Finally, Coreference Resolution resolves the pronoun him to Benjamin Levy and enables the system to determine that the indicted party is actually Benjamin Levy and not Israel’s attorney general.

To get a bit into the weeds of how a pronoun finds its antecedent: most Coreference Resolution algorithms rely, among other things, on two features when dealing with pronouns:

  • Distance: The closer the pronoun is to the preceding name or noun phrase (i.e., antecedent) it refers to, the more likely they refer to the same underlying entity.
  • Syntactic and semantic properties: A definite noun phrase and a pronoun are more likely to refer to the same entity if they share some properties, such as gender, number, syntactic role, semantic relationships, etc. Minimally, they should not conflict in basic properties like gender and number.

In the above example, the name Benjamin Levy is pretty close to him. For the second factor, though, we need to back up a bit: when Entity Extraction extracted the name Benjamin Levy, it assigned a gender, i.e., masculine, to it based on the first name Benjamin. Since him and Benjamin Levy match on this gender feature (and also the singular number feature for that matter), the algorithm can now establish a link between Benjamin Levy and him.

The actual Coreference Resolution processing is much more complex than this in real life (we have glided past a few issues), and we have not even discussed coreference resolution cases for definite noun phrases or short forms of names or for entities other than people.

There Are Hard Problems in Coreference Resolution

Let us be clear, though, that Coreference Resolution still contains some serious challenges. Here are some hard cases found in news articles:

  1. Knowledge of the world
    • But days later, on 10 November, Mr. Morales stepped down and sought asylum in Mexico following an intervention by the chief of the armed forces calling for his resignation He denounced the move as a “coup”.

Here there are two potential antecedents for his and He: Mr. Morales and the chief of the armed forces, but knowledge of the world allows us to understand both he and him as referring to Mr. Morales. That’s because we know that people call for other people’s resignations, not their own, so his must refer to Mr. Morales. We also know that when one denounces something, it’s usually something about others and that a coup is something done by the military, so He too must be Mr. Morales.

    • Naming him prime minister, Mr. Díaz-Canel praised Mr. Marrero in particular for his handling of relationships with foreign investors.

Here too there are two potential antecedents for his: Mr. Díaz-Canel and Mr. Marrero, but we understand his as referring to Mr. Marrero because we know that when a person praises another, it’s for something about the person being praised.

    • Hong Kong no longer enjoys a high degree of autonomy from China — a decision that could result in the loss of Hong Kong’s special trading status with the US and threaten the autonomous region‘s standing as an international financial hub.

Here it’s world knowledge about Hong Kong that allows us to understand the autonomous region as a reference to Hong Kong instead of China or US.

  1. Ambiguity
    • Wednesday’s visit will mark the third time Cuomo has met with Trump since he took office.

Here, even some human readers may be unsure about who he refers to, since it could technically refer to either Cuomo or Trump.

  1. Cataphora
    • In his first major foreign tour since taking over as leader of the Palestinian Islamist movement in 2017, Ismail Haniya has sought to drum up support from allies and find new ones.

Coreference resolution is not just a reference to an entity introduced earlier. It could also be a reference to an entity introduced later. Here his is a reference to Ismail Haniya.

  1. Direct speech
    • President Jair Bolsonaro said he would seek talks with Mr. Trump. “Their economy is not comparable with ours, it’s many times bigger. I don’t see this as retaliation,” Mr. Bolsonaro said in a radio interview with Brazil’s Radio Itatiaia. I‘m going to call him so that he doesn’t penalise us. Our economy basically comes from commodities, it’s what we’ve got,” he said.

In direct speech, I and you refer to different entities depending on who’s speaking. Here I is a reference to the speaker he (Bolsonaro) and he is a reference to an entity other than the speaker (Mr. Trump).

  1. Saliency
    • Fidel Castro led a communist revolution that toppled the Cuban government in 1959, after which he declared himself prime minister. He held the title until 1976, when it was abolished and he became head of the Communist Party and president of the council of state and the council of ministers. With his health failing, Castro handed power to his brother, Raúl, in 2006. He died in 2016.

Here He could technically refer to Fidel Castro or his brother Raúl (which is closer to “He”), but saliency of Fidel Castro as the topic of the paragraph makes us understand He as a reference to Fidel Castro.

  1. Synonyms
    • Sudan’s new transitional government, brought to power after protesters toppled Omar Al Bashir, has been meeting with rebels who fought for years against their marginalisation by Khartoum under the ousted leader.

Here we know that the ousted leader refers to (toppled) Omar Al Bashir because ousted is a synonym of toppled.

Current approaches of any kind would have a tough time with examples like these ones. But that doesn’t mean that Coreference Resolution algorithms are not effective. It’s just that they haven’t solved all the problems. It might be added in defense of the algorithms, though, that humans, as we said earlier, occasionally have real trouble figuring out what entity a pronoun refers to.