Do NLP Entailment Benchmarks Measure Faithfully?

Frederick A. Cook’s picture of Ed Barrill atop a peak claimed to be Denali but actually 15,000 ft lower.

Recognizing Textual Entailment

Several NLP benchmarks test for semantic understanding. One such is Recognizing Textual Entailment (RTE): Given two sentences, a premise P and a hypothesis H, decide if we can conclude H given P. This comes in two flavors. For binary RTE, the answer may only be yes or no. For ternary RTE a third possible answer is that P and H are contradictory.

My Worry and Argument in a Nutshell

I question the ability of the current benchmarks to evaluate entailment. If the RTE benchmarks don’t faithfully measure strength on the actual RTE problem, we should be cautious of claims such as (Sammons et al., p. 4) “A system that performs well on these corpora could be said to have achieved a good “understanding” of natural language text”.

Why I Write This

I care.

Benchmark Creation is Hard Work

And I am grateful that folks do this. Having a great set of examples on which we can test models and learn is great. What I want to get from testing a model on a benchmark is a detailed sense of a model’s shortcomings. What I am criticizing is the current practice of just extracting one number from it — the accuracy. Perhaps we do not even get one number — we get a single bit: SOTA-or-Not.


We begin by looking at how meaning is not fully captured by words alone. We then look at how these issues cloud the RTE benchmarks.

Extra-textual Meaning

Inert marks on a page and mere sounds get suffused with meaning as we read and listen. While the words themselves trigger this genesis of meaning, they do not fully determine it. Other contributors are our knowledge, the situational context, the words’ connotations, and the active piecing together we engage in. Just as a DNA needs the cytoplasmic environment to express, and just as it produces different proteins depending on the specifics of the surrounding chemical soup, words become meaningful in the right environments, and the same words may mean different things in different contexts and different minds.

The Listener’s Knowledge

I don’t speak Estonian, and an Estonian sentence means nothing to me. More pertinently, I don’t speak Pharmacology, Nephrology, or Eschatology, although I recognize a few phrases. Words that paint a vivid picture for you may mean nothing to me, or sketch something incomplete and misleading.

The Conversational Context

You have uttered this sentence more than once: “Thursday. ” What you conveyed each time was not identical, nor even the same “speech act”: sometimes you provided information, sometimes made a suggestion, sometimes corrected. The sentence, divorced from the situation, cannot thus tell the full story of the intended meaning.


The phrase “roe v wade” now means much more than the literal meaning of a lawsuit involving roe and wade. It polarizes. It connotes freedom, connotes murder, connotes judicial overreach, connotes necessary protection from overreach of religious zeal. Connotations make meaning subjective. Are these two sentences paraphrases: “Mike is John’s son” and “Mike is John’s brat”?

Meaning-Creation is Active

I have saved for last what I consider the central ingredient. Words don’t come pre-labeled with their precise sense — our brains piece together the meaning. Watch yourself as you understand this description of the Wild West: it is a place where men are men. The two instances of men cannot be identical in meaning, and yet the words do not come marked with the precise sense meant — we add the meaning to it.

The Extra-Textual Trespasses into RTE

I downloaded the RTE data from the website for Super-GLUE (Super-General Language Understanding Evaluation).

Objectivity is Attempted …

Attempt was made to get clean, unambiguous data. Dagan et al. report that only examples with 100% rater agreement were kept. This need for objectivity can lead to trivial conclusions from the premises. In several examples, the hypothesis is directly lifted from the premise. For example:

… but Subjectivity Persists

There are many examples where the argument will make sense for some people and be a non sequitur for others. Arguments rely on warrants. The argument “Aristotle was a person, hence he was a mortal” depends on the belief that people are mortal. Without that belief or its equivalent, the conclusion does not follow. Several examples in the benchmark depend on warrants, but it is simply not true that everyone holds the same prior beliefs.

In Conclusion

It is frustrating to me how tenses and modals have been summarily dropped. Meaning resides outside words, yes, but it also resides inside, and connecting words and punctuation and such guide the process of piecing together the meaning. Without such guides we end up with a celebrity who finds inspiration in cooking her family and her dog.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhijit Mahabal

Abhijit Mahabal

I do unsupervised concept discovery at Pinterest (and previously at Google). Twitter: @amahabal