The Emperor’s New Benchmarks

The Dangers of NLP Benchmark Oversimplification

15 min readJun 26, 2019

Language is complex. For NLP, we often use simplified benchmarks, relegating phenomena such as metaphor, metonymy, and ungrammatical usage to future work. We also simplify in other ways. Surely an innocuous step — doesn’t science proceed from the simple to the complex?

This post documents damage simplistic benchmarks cause. While I had grand designs for this post, I will scale down to one particular benchmark — pun detection — and follow up with other posts for other benchmarks.

Methodology: To illustrate the dangers I perceive, I use published concrete examples from reputable venues. Singling out papers is not my style. But stripped of specific examples, arguments are toothless. I do not intend to criticize these in particular. The criticisms apply to the field, and these papers are merely the tip of the zeitgeist.

Outline: I first lay the groundwork with examples of puns and then turn to the NLP version of pun detection. The latter is but a caricature of the former, and I point out the lurking dangers and show with particular citations how they here become a reality. Puns are not the exclusive realm where we NLPers cut corners. I will examine other tasks in later posts, but here I touch upon some.

Pun Detection

Puns are personal. I wallow in puns, and these my friends have learned to endure. My Indian friends also suffer my cross-language puns. My day job is in the same space: I work with linguistic nuance. And so when I saw in the NAACL 2019 schedule a talk about computational models of puns, of course I went. This post is the direct result.

A few words about puns first. Never hurts to understand the intricacies of what we aspire to model! Puns come in many molds, and The Comic Encyclopedia (Esar, 1978) lists dozens of variations including double puns, bilingual puns, and Freudian puns. Below, I present three puns I came across, one dubious pun, and a question: if for a sentence we don not spot alternate meanings, is it still a pun, or must a pun’s multiple senses be accessible to most native speakers?

Three Puns

Contemplate for a moment the following credit union advertisement I ran into a dozen years ago: “The best bank because it is not a bank.” Pause and let that sink in. What aspects of bank does the first occurrence pick out? And the second?

Each bank picks out different facets of financial banks without spelling out what those facets are. Although the sentence is nebulous as advertisements are, based on discussing that sentence with dozens of individuals, people are not arbitrary in what facets they associate with each bank. The features attributed to the first bank include utilitarian aspects such as saving and writing cheques. Aspects attributed to the second bank include the impersonal persona of enormous corporations and exorbitant fees.

It is fascinating we understand that sentence. Understanding how we understand it will illuminate our mental processes and what meaning is. And speaking of understanding, how close are computers?

Nowhere close. Various computer models named after Sesame Street characters have been making solid progress on NLP benchmarks, but this sentence does not offer enough contextual cues for the current Elmos and BERTs to sink their monster teeth. And systems such as WordNet, which many researchers tap for their semantic needs, is too coarse-grained to handle such subtlety — it does not have two separate senses for the two banks above (although it does make the far coarser distinction between the financial bank and a river bank).

A second example comes from my thesis adviser, Douglas Hofstadter. Doug has authored books in several languages and translated from a few more — English, French, Italian , Russian, Swedish, and Chinese, at least. Despite this, he calls himself a pilingual, for, you see, he does not fully know each of those languages to their idiomatic core, and the fractions total pi. He sometimes confesses, however, that pilingual overstates his prowess and that he is more of an e-lingual.

Note that e-lingual is two hops away from bilingual: The first hop from bilingual to pilingual is based on similar sounds, and the second hop to e-lingual uses semantic similarity between the two numbers pi and e.

A final example: He is a man of few words, but he keeps repeating them. This insult hinges upon two separate meanings not of one word but of an entire phrase man of few words, one used idiomatically and the other used literally. The literal sense is curiously the atypical and surprising usage.

What those illustrations of good puns share is humor and a grain of truth. We have encountered organizations of different sizes, recognize from experience the trickiness of estimating prowess in a foreign tongue, and know repetitive bores.

A Dubious Pun

My name is John, and I am a plumber. I hope you agree that that is not the best pun. Is it even a pun for anyone who made it through kindergarten? Someone has called puns the lowest form of humor, but even puns are more dignified than that.

A Question

Is being punny a property of a sentence or of whether people catch the distinct senses? If the multiple meanings of a sentence go unseen, is it a pun? As two examples, consider these: chicken fried rice and end user agreement. Puns or not? Did you notice that one interpretation involves Mr. Chicken standing by the cooking range frying rice? When you encountered this phrase on a restaurant menu, were you aware of the multiple senses? Was it a pun there? As for the end user example, did you notice that end can be a verb with user as its object? Is that a pun? Did you ink such an agreement with an AI?

To my mind, a pun worth the name is intentional (hence the phrase no pun intended, which nixes unintended readings). Contexts typically rule out all but one meaning of a phrase. Such ruling out happens with My name is John, where the toilet fixture sense of the term John is inaccessible because of My name is, rendering it a poor example of pun.

Pun Detection, NLP Edition

NLP tasks are a tad watered down. The 2017 edition of SemEval — a contest where teams submit computer programs for solving a semantic language understanding challenge — featured pun detection (Miller, Hempelmann and Gurevych, 2017).

The organizers scoured joke books to collect puns. Admission into the dataset is restricted. No sentence with two or more puns allowed, nor sentences with multi word or hyphenated puns, nor punny words not in WordNet — all severe restrictions.

The “only one pun word” stipulation is baffling because if we change what a word means, we also change what some other words in the sentence mean. The dataset has “The lingerie thief gave the police officers a slip”, where the term slip is the punny word. But if we change slip to mean escaping instead of a piece of lingerie, we also change give from physical giving to a light verb construction that has nothing to do with transferring ownership. It is rare that a lone word changes meaning leaving the rest untouched.

These restrictions make the task a simplified version of the original. Moreover, given the joke-book origin of these puns, they are a biased sample.

Also illuminating are the puns that make it to the dataset. Our classic pun makes the cut: “My name is John and I am a plumber.” That, and many more of its ilk, including “My name is Chuck and I am a butcher”, “My name is Emmy and I am a T.V. star” and the even more roll-on-the-floor hilarious “My name is Sandy and I am a lifeguard”.

Why simplify? Several practical reasons. Simplified problems can be beneficial if they preserve the spirit of what is challenging in the original problem while stripping it of the peripheral confounds, thus isolating the bits we wish to study. What to consider central and what superfluous is largely an outcome of the theories the researchers hold. But once you name the task with a commonly understood phrase, such as “pun detection”, you forfeit the ability to mold it any which way — it has to conform to some expectations.

And then there are pragmatic reasons based on how to feed this data to computers. To collect data and get it annotated by raters, the task has to be discrete and not too contentious. Can’t let raters get too creative and original here. I phrase this as “Many fine problems have been sacrificed at the altar of inter-rater reliability” — but that is a rant for another time. Discretized data is simpler to feed to computer programs and simpler to quantify. In other words, easier to science.

But by simplifying, we change the problem we undertake. This is no longer pun detection in general. This innocuous simplification brews trouble. I will point out five classes of ill-effects.

The simplification unintentionally adds artifacts that algorithms can exploit. These are NOT the features of the original problem but of the simplification, and so they cannot generalize to the full problem. Just to be clear: I support exploiting the properties of the original problem. If the original problem had been “break this cipher” and you used the fact that the letters ordered by frequency in English are ETAOINSHRDLU, that is honest. It would be improper, however, if you exploited the incidental property of the testing data that each example starts with the word “the”. Unfair and not generalizable. For puns, I will present evidence that this has happened in more than one instance.
These accidental exploitable artifacts can overpower the legitimate and more generalizable signals. As an illustration, in the pun data, presence of the word “Tom” is a foolproof predictor of punhood, and the pattern “My name is X” accurately pinpoints the pun every single time. Imagine a program that utilizes these features as well as the explicit decrees such as “at most one pun” and “only bother to consider words in WordNet”. Such a program is grotesquely overfit, and would perform badly for examples not from the target data. By contrast, a general system will struggle to beat it on this limited dataset. The last paragraph of the paper (Miller et al., 2017) summarizing the results of the contest wonder why “though there exists a considerable body of research in linguistics on phonological models of punning (Hempelman and Miller, 2017) and on semantic theories of humor (Raskin, 2008), little to none of this work appeared to inform the participating systems”. Well, duh, the simplification artifacts stacked the deck against systems innocently solving the unabridged problem. Why bother with semantic theories of humor when you could just as fruitfully look for the presence of the word “Tom”?
It lulls us into a deceptive sense of progress. A high score on this benchmark does not equal progress in understanding puns or in understanding how NLP systems should deal with them. A good system might score well here, but a tremendous score can happen for reasons with zilch correlation with performance on the unabridged problem. Well performing solutions may contribute no insights into the phenomena being studied.
What follows when someone publishes state of the art (SOTA) results on this simplified task? Reviewers will require later research compare their methods on this dataset and to this SOTA, even if this makes no sense. For one of our papers accepted to ACL 2019, we had to wade through such a situation and had to report results on a surreal dataset. In that light, the last stanza of the poem “The Bridge Builder” by Will Allen Dromgoole loosely fits this scenario in ways unanticipated by the poet (are these words, then, a pun?). “There followed after me to-day / A youth whose feet must pass this way. / This chasm that has been as naught to me / To that fair-haired youth may a pitfall be; / He, too, must cross in the twilight dim; / Good friend, I am building this bridge for him!”
The final point is surreal. The paper at NAACL 2019 thought the puns dataset was THE pun understanding problem: This was the proximate trigger for this post. The simplified problem has supplanted the fuller original problem. I illustrate this turn of events below with words lifted from that paper. Goodhart’s law applies here: When a measure becomes a target, it ceases to be a good measure. If this phenomenon of misinterpreting a dataset as the real problem is as widespread as I suspect, the implications are not pleasant.

Evidence for These Phenomena

I will offer evidence for three of these (1, 2, and 5). 3 follows logically, and others have remarked on 4 (e.g., Yoav Goldberg’s fine post subtitled “Or, for fucks sake, DL people, leave language alone and stop saying you solve it”).

Systems Exploit the Artifacts of Simplification

The submissions for the task reveal most as instances of teaching to the test. I offer a few illustrations.

Puns in jokes occur as punchlines near the end. This property holds in this dataset but won’t hold if multiple puns are present or in longer pieces of humor or whimsy. Each contestant either considered words only in the second half or scored those higher. All contestants used the one-pun trick, and for two equally punny seeming words, chose the one closer to the end.
For each test sentence, the program knew what kind of pun to expect — homographic (one term with multiple senses) or heterographic (a similar sounding word is the pun target). Too bad Jerry Seinfeld does not preface his quips with the class of joke to come. He makes his audience work unduly hard. Multiple contestants exploited the knowledge of the pun type by consulting a rhyming dictionary in one case but not the other.
The contest had three subtasks: to detect whether a pun is present, detect where, and to name the two meanings. For the second subtask, only pun containing examples were used, and for the third, the pun word was provided and only examples used had both senses are in wordnet. Ten teams undertook the second task, but only five attempted the first: there were thus systems great at identifying where the pun was, but which could not identify whether there was any pun at all. Pertinently, the best systems in the second subtask for each class (heterographic and homographic) did not compete in the first subtask.
The most overfit-to-the-test contest entry was IdiomSavant (Doogan et. al, 2017), which performed the strongest. First, they (along with the second best system, Vechtomova (2017)) reward out-of-vocabulary terms as more likely to be puns. In the real world, OOV words are almost always typos or adversarial uses (e.g., prOn) to get by porn filters. Second, regarding a specialized class of pun called the Tom Swifty overrepresented in this data, they say “As such, our system did not adequately recognize these instances, so we designed a separate procedure for these cases”.
Apart from these “acts of commission” adaptations to the dataset, the supervised submissions in the contest (which used part of the data for training) would have overfit to Tom, “My name is”, and other such.

Exploitable Artifacts Overpower

We saw earlier the quote lamenting how techniques from linguistics did not make an appearance. I attribute that void to how much stronger exploitable signals are, rendering more legitimate signals weak and impotent by comparison. I will add one more data point.

Recall that for the third subtask we know precisely where the pun is and what kind it is, and we have to identify the two senses for the pun. We further know that both senses are in WordNet. This subtask is what anybody would understand by “pun identification.”

Submissions did much better at locating where the pun is (best f1 was 66%) than they did at identifying what that pun meant (best f1 was 16%). I find this very peculiar, mildly reminiscent of the Hitch Hiker’s Guide to the Galaxy’s 42 (the answer to the ultimate question of the life, the universe, and everything but where the question is not known). Here, the systems pinpoint the pun without knowing why. The systems have not understood the pun.

Even more peculiar: doing better at the more central task correlates inversely with performance on the pun-location task. Only three systems beat the naive baselines of “last word” and “most frequent sense” in at least one of these two subtasks. None achieved this triumph on both. The f1 scores on (detection, disambiguation) for these contest entries reveals the inverse correlation (0.663, 0.074), (0.440, 0.148), and (0.278, 0.154).

Dataset Morphs into “The Pun Identification Task”

The dataset starts its life as a surrogate for the problem: an expeditious means to assess progress. Time passes, and somewhere along the way, it becomes the entire task itself. What were once the carefully described constraints on the dataset morph into the scientifically established dogma which can be supported with citations. I witnessed this surreal passage at NAACL this year.

Zou et al.’s presentation suggested to me that they took the “one pun per context” as established fact. Indeed, in the paper they state, with proper citation: “Since each context contains a maximum of one pun (Miller et al., 2017) we design a novel tagging scheme to capture this structural constraint.” and later, “To capture this interesting property, we propose a new tagging scheme…”.

One of the main result of the paper is that adding this extra constraint improves performance. I have no doubt that their conclusion is accurate.

Has this ever happened before? I am uneducated in computer vision, and I unqualified to speak to the strength or weaknesses of Imagenet, but it has over 11,000 citations, suggesting that improving performance on it has become the goal whereas when it started its journey its limitations would have been more present in people’s minds. Only over time the two — dataset and the original problem that spawned it — blur.

Beyond Puns

Are there other NLP domains where benchmarks are starkly simplified versions of the original problem they purportedly measure? I will mention three very central problems in passing.

Sentiment

Consider this movie review culled from the Large Movies Review Dataset (Mass et. al., 2011).

When I saw the elaborate DVD box for this and the dreadful Red Queen figurine, I felt certain I was in for a big disappointment, but surprise, surprise, I loved it. Convoluted nonsense of course and unforgivable that such a complicated denouement should be rushed to the point of barely being able to read the subtitles, let alone take in the ridiculous explanation. These quibbles apart, however, the film is a dream. Fabulous ladies in fabulous outfits in wonderful settings and the whole thing constantly on the move and accompanied by a wonderful Bruno Nicolai score. He may not be Morricone but in these lighter pieces he might as well be so. Really enjoyable with lots of colour, plenty of sexiness, some gory kills and minimal police interference. Super.

What a human gleans from that movie review goes way beyond the single bit of information captured by “good” or “bad”. Yet most sentiment tasks are content with this level of classification and are thus highly simplified. There are plenty of exploitable artifacts here, including individual words.

Word Sense Disambiguation and Named Entity Linking

How many senses does the phrase “Harry Potter” have? It could be the character, or the movie, or the book, or the other movies in the series, or the other books, or the entire fantasy universe of Hogwarts, or the kid dressed as Harry on halloween, and so on. Just as bank had fine gradation of senses in the best bank because not a bank, most terms have non-discrete senses.

The word “mother” is a canonical example used to describe gradation of senses. Mothers play many roles in our lives: giving us half of their genes, giving us their mitochondrial genes, carrying us for nine months, breast feeding, nurturing, putting band-aids on skinned knees, and so forth. The same person typically carries out these roles, and this person is the mother. But for various reasons, these roles may fall on different individuals. A subset of these roles are fulfilled by surrogate mothers, wet nurses, biological mother, foster mother, and so forth. That list was enlarged in this century because now the “regular” genes and mitochondrial DNA can come from separate mothers. Furthermore, even den mothers in Girl Scouts and the Mother Superior play some of these roles. All these senses are overlapping and typically appear as one unified concept.

How many senses the word mother has is thus ill-defined.

In the WSD and NEL tasks, however, things are simplified artificially by stipulating discrete senses where only one holds at a time (an assumption that clearly breaks with puns).

Paraphrase Detection and Natural Language Inference

In these tasks, two sentences are given and the systems needs to decide whether they are paraphrases (or, for the second task, one implies the other). More about this in later posts. Here I will point out that there are exploitable artifacts here, as demonstrated by a paper (Poliak et. al., 2018) recently that got reasonable numbers on the NLI task for ten different widely used datasets. Recall that the task is to decide if sentence A implies sentence B. The system did much better than baselines without ever showing sentence A to the system, suggesting exploitable regularities.

Subsequent Posts

The simplifications in all four tasks described use discrete outcomes (e.g., pun or not; which of the two possible senses of jaguar is intended here; etc.) Underlying these, I suspect, is a simplified theory of meaning. What allows an author to say “this is metaphorical usage, we don’t need to deal with it just yet” assumes that words and phrases have literal meaning captured well by a small set of possibilities and that metaphor and metonyms are parasitic on this core meaning. That position makes it hard to understand phrases such as “the Mozart of chess” or deal adequately with phrases where the practical meaning extends way beyond the mere literal meaning: for most readers, the phrase roe v wade carries more meaning than the literal “a lawsuit between two parties, roe and wade”. I plan to tackle one or two more tasks and the implicit theory of meaning in subsequent posts.