The Emperor’s New Benchmarks

The Dangers of NLP Benchmark Oversimplification

Language is complex. For NLP, we often use simplified benchmarks, relegating phenomena such as metaphor, metonymy, and ungrammatical usage to future work. We also simplify in other ways. Surely an innocuous step — doesn’t science proceed from the simple to the complex?

Pun Detection

Puns are personal. I wallow in puns, and these my friends have learned to endure. My Indian friends also suffer my cross-language puns. My day job is in the same space: I work with linguistic nuance. And so when I saw in the NAACL 2019 schedule a talk about computational models of puns, of course I went. This post is the direct result.

Three Puns

Contemplate for a moment the following credit union advertisement I ran into a dozen years ago: “The best bank because it is not a bank.” Pause and let that sink in. What aspects of bank does the first occurrence pick out? And the second?

A Dubious Pun

My name is John, and I am a plumber. I hope you agree that that is not the best pun. Is it even a pun for anyone who made it through kindergarten? Someone has called puns the lowest form of humor, but even puns are more dignified than that.

A Question

Is being punny a property of a sentence or of whether people catch the distinct senses? If the multiple meanings of a sentence go unseen, is it a pun? As two examples, consider these: chicken fried rice and end user agreement. Puns or not? Did you notice that one interpretation involves Mr. Chicken standing by the cooking range frying rice? When you encountered this phrase on a restaurant menu, were you aware of the multiple senses? Was it a pun there? As for the end user example, did you notice that end can be a verb with user as its object? Is that a pun? Did you ink such an agreement with an AI?

Pun Detection, NLP Edition

NLP tasks are a tad watered down. The 2017 edition of SemEval — a contest where teams submit computer programs for solving a semantic language understanding challenge — featured pun detection (Miller, Hempelmann and Gurevych, 2017).

  1. These accidental exploitable artifacts can overpower the legitimate and more generalizable signals. As an illustration, in the pun data, presence of the word “Tom” is a foolproof predictor of punhood, and the pattern “My name is X” accurately pinpoints the pun every single time. Imagine a program that utilizes these features as well as the explicit decrees such as “at most one pun” and “only bother to consider words in WordNet”. Such a program is grotesquely overfit, and would perform badly for examples not from the target data. By contrast, a general system will struggle to beat it on this limited dataset. The last paragraph of the paper (Miller et al., 2017) summarizing the results of the contest wonder why “though there exists a considerable body of research in linguistics on phonological models of punning (Hempelman and Miller, 2017) and on semantic theories of humor (Raskin, 2008), little to none of this work appeared to inform the participating systems”. Well, duh, the simplification artifacts stacked the deck against systems innocently solving the unabridged problem. Why bother with semantic theories of humor when you could just as fruitfully look for the presence of the word “Tom”?
  2. It lulls us into a deceptive sense of progress. A high score on this benchmark does not equal progress in understanding puns or in understanding how NLP systems should deal with them. A good system might score well here, but a tremendous score can happen for reasons with zilch correlation with performance on the unabridged problem. Well performing solutions may contribute no insights into the phenomena being studied.
  3. What follows when someone publishes state of the art (SOTA) results on this simplified task? Reviewers will require later research compare their methods on this dataset and to this SOTA, even if this makes no sense. For one of our papers accepted to ACL 2019, we had to wade through such a situation and had to report results on a surreal dataset. In that light, the last stanza of the poem “The Bridge Builder” by Will Allen Dromgoole loosely fits this scenario in ways unanticipated by the poet (are these words, then, a pun?). “There followed after me to-day / A youth whose feet must pass this way. / This chasm that has been as naught to me / To that fair-haired youth may a pitfall be; / He, too, must cross in the twilight dim; / Good friend, I am building this bridge for him!”
  4. The final point is surreal. The paper at NAACL 2019 thought the puns dataset was THE pun understanding problem: This was the proximate trigger for this post. The simplified problem has supplanted the fuller original problem. I illustrate this turn of events below with words lifted from that paper. Goodhart’s law applies here: When a measure becomes a target, it ceases to be a good measure. If this phenomenon of misinterpreting a dataset as the real problem is as widespread as I suspect, the implications are not pleasant.

Evidence for These Phenomena

I will offer evidence for three of these (1, 2, and 5). 3 follows logically, and others have remarked on 4 (e.g., Yoav Goldberg’s fine post subtitled “Or, for fucks sake, DL people, leave language alone and stop saying you solve it”).

Systems Exploit the Artifacts of Simplification

The submissions for the task reveal most as instances of teaching to the test. I offer a few illustrations.

  • For each test sentence, the program knew what kind of pun to expect — homographic (one term with multiple senses) or heterographic (a similar sounding word is the pun target). Too bad Jerry Seinfeld does not preface his quips with the class of joke to come. He makes his audience work unduly hard. Multiple contestants exploited the knowledge of the pun type by consulting a rhyming dictionary in one case but not the other.
  • The contest had three subtasks: to detect whether a pun is present, detect where, and to name the two meanings. For the second subtask, only pun containing examples were used, and for the third, the pun word was provided and only examples used had both senses are in wordnet. Ten teams undertook the second task, but only five attempted the first: there were thus systems great at identifying where the pun was, but which could not identify whether there was any pun at all. Pertinently, the best systems in the second subtask for each class (heterographic and homographic) did not compete in the first subtask.
  • The most overfit-to-the-test contest entry was IdiomSavant (Doogan et. al, 2017), which performed the strongest. First, they (along with the second best system, Vechtomova (2017)) reward out-of-vocabulary terms as more likely to be puns. In the real world, OOV words are almost always typos or adversarial uses (e.g., prOn) to get by porn filters. Second, regarding a specialized class of pun called the Tom Swifty overrepresented in this data, they say “As such, our system did not adequately recognize these instances, so we designed a separate procedure for these cases”.
  • Apart from these “acts of commission” adaptations to the dataset, the supervised submissions in the contest (which used part of the data for training) would have overfit to Tom, “My name is”, and other such.

Exploitable Artifacts Overpower

We saw earlier the quote lamenting how techniques from linguistics did not make an appearance. I attribute that void to how much stronger exploitable signals are, rendering more legitimate signals weak and impotent by comparison. I will add one more data point.

Dataset Morphs into “The Pun Identification Task”

The dataset starts its life as a surrogate for the problem: an expeditious means to assess progress. Time passes, and somewhere along the way, it becomes the entire task itself. What were once the carefully described constraints on the dataset morph into the scientifically established dogma which can be supported with citations. I witnessed this surreal passage at NAACL this year.

Beyond Puns

Are there other NLP domains where benchmarks are starkly simplified versions of the original problem they purportedly measure? I will mention three very central problems in passing.


Consider this movie review culled from the Large Movies Review Dataset (Mass et. al., 2011).

Word Sense Disambiguation and Named Entity Linking

How many senses does the phrase “Harry Potter” have? It could be the character, or the movie, or the book, or the other movies in the series, or the other books, or the entire fantasy universe of Hogwarts, or the kid dressed as Harry on halloween, and so on. Just as bank had fine gradation of senses in the best bank because not a bank, most terms have non-discrete senses.

Paraphrase Detection and Natural Language Inference

In these tasks, two sentences are given and the systems needs to decide whether they are paraphrases (or, for the second task, one implies the other). More about this in later posts. Here I will point out that there are exploitable artifacts here, as demonstrated by a paper (Poliak et. al., 2018) recently that got reasonable numbers on the NLI task for ten different widely used datasets. Recall that the task is to decide if sentence A implies sentence B. The system did much better than baselines without ever showing sentence A to the system, suggesting exploitable regularities.

Subsequent Posts

The simplifications in all four tasks described use discrete outcomes (e.g., pun or not; which of the two possible senses of jaguar is intended here; etc.) Underlying these, I suspect, is a simplified theory of meaning. What allows an author to say “this is metaphorical usage, we don’t need to deal with it just yet” assumes that words and phrases have literal meaning captured well by a small set of possibilities and that metaphor and metonyms are parasitic on this core meaning. That position makes it hard to understand phrases such as “the Mozart of chess” or deal adequately with phrases where the practical meaning extends way beyond the mere literal meaning: for most readers, the phrase roe v wade carries more meaning than the literal “a lawsuit between two parties, roe and wade”. I plan to tackle one or two more tasks and the implicit theory of meaning in subsequent posts.

I do unsupervised concept discovery at Pinterest (and previously at Google). Twitter: @amahabal