Do NLP Entailment Benchmarks Measure Faithfully?

I suggest they don’t.

Frederick A. Cook’s picture of Ed Barrill atop a peak claimed to be Denali but actually 15,000 ft lower.

This post may be read by itself or as the second installment of “The Emperor’s New Benchmarks”. That post looked at a benchmark of passing curiosity (pun detection), but now we consider a problem that commands the community’s ongoing veneration.

Several NLP benchmarks test for semantic understanding. One such is Recognizing Textual Entailment (RTE): Given two sentences, a premise P and a hypothesis H, decide if we can conclude H given P. This comes in two flavors. For binary RTE, the answer may only be yes or no. For ternary RTE a third possible answer is that P and H are contradictory.

Dagan et al. (2005) introduce RTE and make the case that solving this enables many NLP tasks such as question answering, information extraction and summarization.

I agree: being able to answer that question for any pair of arbitrary sentences in any context is exceedingly useful.

I question the ability of the current benchmarks to evaluate entailment. If the RTE benchmarks don’t faithfully measure strength on the actual RTE problem, we should be cautious of claims such as (Sammons et al., p. 4) “A system that performs well on these corpora could be said to have achieved a good “understanding” of natural language text”.

In a nutshell: the RTE benchmark aims to measure semantic ability of models, yet it only provides them with textual input P and H, divorced from the situation where P or H may have occurred. It is thus tacitly presumed that the words encompass enough meaning to pinpoint the relationship. But meaning depends on other factors, such as who addressed whom in what surrounding. Ascribing words the exclusive credit for meaningfulness is unduly generous. Much of what gives words meaning is outside words and is not spelled out. Were it not so, communication would become laborious and impossible.

I am not remotely the first to suggest that meaning is not fully contained by the words, nor am I the first to raise this concern for this task. In fact, Dagan et al. acknowledge part of this issue in their original paper (paragraph World Knowledge in section 4). I think that they grossly underestimate the extent.

I care.

Low fidelity benchmarks result in focusing scientific resources inappropriately. Consider the space of all possible models for a task. The benchmark provides a score for each, and thus represents a fitness landscape: better performing models have higher altitude, and the best performing models will be at the peaks.

In current NLP practice, a long procession of models climb the hills of this fitness function. As example, on many benchmarks, loftier heights have been scaled in quick succession by ULMFiT, ELMo, BERT, GPT-2, XLNet and RoBERTa.

A low-fidelity benchmark has peaks not aligned with the peaks of the original problem. These fake peaks form a suitable habitat for a class of models that may well be misfit on the peaks of the real fitness landscape. If we chase higher altitudes on the fake fitness landscape, we waste precious resources.

The lower the fidelity, the worse this becomes.

Is this worry idle? Consider, from that same paper by Dagan et al., this assessment of the submitted models: “Interestingly, system complexity and sophistication of inference did not correlate fully with performance, where some of the best results were obtained by rather naïve lexically-based systems.” This observation is essentially a paraphrase of the observation made many years later about the pun dataset (Miller et al., 2017): “though there exists a considerable body of research in linguistics on phonological models of punning (Hempelman and Miller, 2017) and on semantic theories of humor (Raskin, 2008), little to none of this work appeared to inform the participating systems,”. This is no coincidence.

That benchmarks can be gamed was again demonstrated powerfully recently by Poliak et al. (2018) when they built systems that did much better than chance for several entailment datasets of “does A imply B” challenges where the system was never shown A at all.

Richard Feynman, in The Character of Physical Law (p. 50), states the law of Gravitation in three different ways “all of which are exactly equivalent but sound completely different”. The first version is the familiar “F=GMm/r²”, the second is the “field way” using potential, and the third is a path that minimizes a certain quantity. As he introduces each, he points out how they differ philosophically, for instance, on their reliance on action-at-a-distance. More notably, he points out how they differ psychologically: if we discovered that the law is not precise, we will want to tweak it; but the tweaks suggested by each of these formulations is different, rendering the theories psychologically unalike.

A computer model may be inaccurate and yet point the way. Its errors may highlight shortcomings and suggest specific fixes.

Strangely, the models that thrive at the peaks of the fake fitness landscape are not even wrong in interesting ways. They do not suggest the way forward. End-to-end deep learning models are oft criticized for their opaqueness.

Finally, claims of “super human performance” baffle me. If I can inoculate even a few readers against such high falutin claims, I will be satisfied.

And I am grateful that folks do this. Having a great set of examples on which we can test models and learn is great. What I want to get from testing a model on a benchmark is a detailed sense of a model’s shortcomings. What I am criticizing is the current practice of just extracting one number from it — the accuracy. Perhaps we do not even get one number — we get a single bit: SOTA-or-Not.

My objections only apply to that setup. If a person is looking at the fit between model and data example by example, they would be able to recognize and skip dubious examples. But if the scores are just averaged and even the average is only seen by another algorithm — grid search — trouble ensues from the discrepancy between the actual task and the benchmark’s version of this task.

We begin by looking at how meaning is not fully captured by words alone. We then look at how these issues cloud the RTE benchmarks.

Extra-textual Meaning

Inert marks on a page and mere sounds get suffused with meaning as we read and listen. While the words themselves trigger this genesis of meaning, they do not fully determine it. Other contributors are our knowledge, the situational context, the words’ connotations, and the active piecing together we engage in. Just as a DNA needs the cytoplasmic environment to express, and just as it produces different proteins depending on the specifics of the surrounding chemical soup, words become meaningful in the right environments, and the same words may mean different things in different contexts and different minds.

Let’s examine four ingredients of meaning making.

I don’t speak Estonian, and an Estonian sentence means nothing to me. More pertinently, I don’t speak Pharmacology, Nephrology, or Eschatology, although I recognize a few phrases. Words that paint a vivid picture for you may mean nothing to me, or sketch something incomplete and misleading.

The golden data for the benchmark comes from human raters who are all too human. We may wish they judge a sentence based on the “real” meaning, but they judge it based on the meaning they see. They easily judge as equivalent the phrases “I live in New York” and “I live in NYC” but may falter at the equivalence of “I take Tylenol daily” and “Every day I take acetaminophen”. Example sentences from RTE datasets dabble in medicine, astronomy, ethics and international diplomacy, and some raters must trip.

Another knowledge-requiring sentence pair for ternary RTE: “In this topology, X is an open set” and “In this topology, X is a closed set”. In topology, a set can be both closed and open but the terminology suggests otherwise — an endless source of confusion to students beginning topology. Most raters will mark those sentences as contradictory.

As a fire prevention engineer, Benjamin Lee Whorf investigated a fire caused by workers throwing a lit match into a gasoline drum labeled Empty. Whorf argued that the label lulled the workers into thinking the drums were free of gasoline fumes. The word open likewise suggests the opposite of closed, though those two words are not in this complementary relationship. Pun intended: in a topology, closed sets are defined to be the set-complement of open sets.

Rater’s knowledge and expertise influences how they judge the relationship among sentences.

You have uttered this sentence more than once: “Thursday. ” What you conveyed each time was not identical, nor even the same “speech act”: sometimes you provided information, sometimes made a suggestion, sometimes corrected. The sentence, divorced from the situation, cannot thus tell the full story of the intended meaning.

Some will dismiss this worry by claiming that such sentences are few and to not let the perfect be the enemy of the good. But every sentence is like the sentence “Thursday” to varying degrees. Even long sentences elide information and assume background information. They harken back to the prior sentence or paragraph or chapter and to common ground established with the reader.

At the gym, my trainer uttered a sentence twice spaced by twenty minutes. He said: “Lemme get you a small dumbbell”, and returned once with a pair of 10-lb dumbbells, and the other time with a pair of 40-lb. The same phrase by the same speaker to the same listener in the same location meant different things — we were working on different muscle groups. The words themselves give partial clues, and it is through the interaction of those clues with the surroundings is meaning ascertained, or rather, the meaning is guessed. Guessed well, typically, but guessed all the same. By the way, are these two sentences paraphrases: “Lemme get you a small dumbbell” and “Lemme get you a small dumbbell”?

This conversational context extends to discourses happening in the society at large. Topics in RTE datasets involve WMD, illegal immigration, death penalty, euro as currency — it is impossible that the raters have been spared the extended societal conversation, and each rater has partaken of a different slice.

The phrase “roe v wade” now means much more than the literal meaning of a lawsuit involving roe and wade. It polarizes. It connotes freedom, connotes murder, connotes judicial overreach, connotes necessary protection from overreach of religious zeal. Connotations make meaning subjective. Are these two sentences paraphrases: “Mike is John’s son” and “Mike is John’s brat”?

Connotations live in the subtlest aspects, ripe for harvesting by politicians adept in using dog whistles. George W. Bush chose his words:

Bush appealed to women with sentences that began with “I understand” and he repeated words such as “peace” and “security” and “protecting”. For the military, he used “Never relent” and “Whatever it takes” and “We must not waver” and “Not on my watch”. For Christians, he began sentences with “and”, just as the Bible does: “And in all that is to come, we can know that His purposes are just and true”. (Thank you for Arguing, by Jay Heinrichs, p. 253)

We read between the lines all the times, but each of us hallucinates in our own way, captured well by the cliche “one person’s terrorist is another’s freedom fighter”. The meaning raters suffuse into the phrase “drug legalization” will influence whether they accept as logical the conclusion “Drug legalization is going to be amazing!”, no matter what the premise.

I have saved for last what I consider the central ingredient. Words don’t come pre-labeled with their precise sense — our brains piece together the meaning. Watch yourself as you understand this description of the Wild West: it is a place where men are men. The two instances of men cannot be identical in meaning, and yet the words do not come marked with the precise sense meant — we add the meaning to it.

In making sense, we rummage through meanings until we stumble upon a combination that makes sense. Along the way, we may bend and stretch word meanings, consider the possibility that we misheard and correct accordingly, do blending as suggested in the book The Way We Think (Fauconnier and Turner), and do other violence illustrated by these two personal anecdotes.

Anecdote #1: When my daughter Suhana was 4, we lived near New York and on January 1 were taking the train to the city. Many folks wished us “Happy New Year”, and so Suhana also started wishing folks “Happy New York”. This is a classic case of a capture error: when you use resources at your disposal to make sense. New York was in her repertoire whereas New Year wasn’t, ergo the capture of the latter by the former.

Anecdote #2: When I was six, I did not know Hindi but was exposed to Bollywood songs thanks to my elder brother, Ashish. One song from the movie Yarana (Friendship) has this line “You picked me from ashes (khak) and sat me on a pedestal (phalak)” — complex ideas for a six-year-old non-native. A decade later, when I was a Hindi speaker, I discarded my confident knowledge that the line was “You picked me from the wooden bed (khat) and sat me on a metal bed (palang)”. In India of the time, the wooden bed was rural and the metal bed more urban and modern, and thus the transition represented a promotion, consistent with the tone of the song.

A lack of existing hooks (or manufacturable hooks) on which to tie meaning leads to bafflement, as happens in this Douglas Adams sentence: “In those days spirits were brave, the stakes were high, men were real men, women were real women, and small furry creatures from Alpha Centauri were real small furry creatures from Alpha Centauri.”

I cannot stress enough this process of adding meaning. Raters judging sentence pairs for entailment craft their own meaning. Subjectivity — nemesis of science? — rears its unwelcome head. The listener refuses to take her leave, marring the clean simplicity of a text-only benchmark.

The Extra-Textual Trespasses into RTE

I downloaded the RTE data from the website for Super-GLUE (Super-General Language Understanding Evaluation).

Please note: For ease of exposition, I only chose examples where the data claims that the hypothesis is entailed by the premise. I thus label hypotheses as entailed hypotheses.

Attempt was made to get clean, unambiguous data. Dagan et al. report that only examples with 100% rater agreement were kept. This need for objectivity can lead to trivial conclusions from the premises. In several examples, the hypothesis is directly lifted from the premise. For example:


Premise: A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.

Entailed Hypothesis: Pope John Paul II died.

I gently suggest that answering this correctly as entailment should not be a model’s ticket into the club “achieved a good understanding of natural language text”. The premise is a complex sentence, fully understanding which involves the notions of death, sorrow, Christianity, celebration on installing a new Pope, and so forth.

Several examples, while not direct sub-strings, depend for their answer on a pair of contiguous words in a pattern repeated often enough that an ML algorithm will wise up to it. For example, #829 has a behemoth 100-plus-word premise containing the phrase “Louisville, Kentucky” and the entailed hypothesis is “Louisville is in Kentucky”. These are syntactic transformations that do not deeply probe understanding.

There are many examples where the argument will make sense for some people and be a non sequitur for others. Arguments rely on warrants. The argument “Aristotle was a person, hence he was a mortal” depends on the belief that people are mortal. Without that belief or its equivalent, the conclusion does not follow. Several examples in the benchmark depend on warrants, but it is simply not true that everyone holds the same prior beliefs.


Premise: If legalization reduced current narcotics enforcement costs by one-third to one-fourth, it might save $6 — $9 billion per year.

Entailed Hypothesis: Drug legalization has benefits.

Note the hedges in the premise (if, might) and the bald-faced simplicity of the entailed hypothesis. Is this entailment or a non sequitur? How would a rater judge this it they vividly foresee the deep pain and trauma and lawlessness and debauchery to be unleashed by legalization and therefore realize that $9 billion is but pocket change?


Premise: The Herald is now pleased that Kennedy can vigorously pursue a re-examination of the Iraq war and how to extricate our nation from it. Speaking seriously, I’ll take this change of heart as a sobering indication that all of us, wherever we stand on the liberal-conservative continuum, are deeply concerned about the damaging impacts of the Iraq war and are looking for ways to work together to resolve it as soon as possible.

Entailed Hypothesis: All are deeply concerned about the damage of the Iraq war and are looking for ways to resolve it.

I am not convinced that everybody is deeply concerned and looking to resolve the crisis. I beseech us to think hard before expanding democratic voting rights to an AI if it is going to conclude X given that “Politician said X”. Most people, when they hear “X said Y” tend to piece together what may actually be meant, what the motives and knowledgeability of the speaker is, whether the message was garbled through the grapevine. We don’t always apply this critical mindset to the political domain, much less to the scientific, but we excel at gossip and spotting deception. When a politician speaks, the hedge words may carry more of the message than the “content words”.


Premise: In 1969, more than 500 million people around the world sat in front of television sets, watching grainy images of two men in white bulky spacesuits planting a U.S. flag on the lunar landscape with its black horizon.

Entailed Hypothesis: The Apollo astronauts waved the American flag on the moon in 1969.

No, the premise did not say they waved the flag. And certainly the flag did not wave by itself on the moon. There are enough conspiracy theories involving waving lunar flags for us to feed the trolls. On earth, when you plant a flag, to be sure, it waves; waving and planting flags are thus overlapping activities here; not so on the moon. Word meanings matter.


Premise: For Bechtolsheim, who designed the prototype for the first Sun workstation while he was a Birkenstock-shod Stanford University graduate student in 1982, the new line of computers, code named Galaxy, is a return to the company’s roots.

Entailed Hypothesis: The Sun workstation was created by Bechtolsheim.

Designed a prototype does not equal “created the Sun workstation”. Designed the prototype does not even mean built the first prototype. And it does not mean that the design survived into the product. In this particular case, part of the design was replaced by commercial components from 3com. And Bechtolsheim did not work alone. Oh, and by the way, I created the Google Search Engine.

Our knowledge of how things work informs our conclusions.

One final example:


Premise: Ocean colour satellite remote sensing has developed rapidly within the last five years and satellite imagery is now processed automatically and made available via the WWW.

Entailed Hypothesis: Ocean remote sensing is developed.

I am sorry, but no, that hypothesis makes no sense. Self-driving has rapidly developed, but it is not yet developed. “Developed nation” is different from “developing nation” even if the latter has developed rapidly within the last five years. Also, I hope that developed nations are also developing. Sorry, tense matters. Modals matter. Hedges matter.

It is frustrating to me how tenses and modals have been summarily dropped. Meaning resides outside words, yes, but it also resides inside, and connecting words and punctuation and such guide the process of piecing together the meaning. Without such guides we end up with a celebrity who finds inspiration in cooking her family and her dog.

Almost every example in the dataset has a rich and complex premise, and all sorts of inferences can be and are drawn by readers. The hypothesis, by contrast, tends to be watered down, de-tensed, de-modaled. An outsider, seeing high scores on this dataset and impressed by the complex language in the premise that the models digest so brilliantly, will fall prey to the Eliza effect and read more ability and depth into models that have nothing of the sort.

I don’t believe that this dataset comes close to evaluating a model’s ability at spotting entailment. If you disagree, and you have looked at and thought through several of the actual examples in the dataset, I would love to hear your arguments.

I do unsupervised concept discovery at Pinterest (and previously at Google). Twitter: @amahabal