Why ChatGPT is bad at math and at facts

9 min readFeb 26, 2023

(I have moved to Substack: https://amahabal.substack.com/. I will cross-post a few here, but if you want to follow my posts, please subscribe there).

ChatGPT learns what it knows by reading reams of text, aided by statistical cues about how text is structured. To the extent that such cues faithfully mirror the world, ChatGPT produces sensible sounding text.

This post explores one class of statistical signals, probing both its strength and weakness, a weakness arising sometimes from lack of signal — too few bars — and sometimes from interference — cross connections. The oft pointed weaknesses of ChatGPT in math and in chess arises from these shortcomings, resulting in non-sensical answers and outright fabrications.

Breakfast Foods

I would like you to think of what all can fill the blank here: “For breakfast, John had ___”. Don’t stop at the first thing that comes to mind! Many different fillers nicely fit here, and I want you to come up with half a dozen diverse examples, and to contemplate the space of what’s possible here.

What you have just thought through is an example of a category. Note how some things clearly fit into this category: oatmeal, scrambled eggs, orange juice, and so forth. Some things, on the other hand, will be decidedly odd there: democracy, beautiful, or but. And others will be rarer but plausible: leftovers, the Cambodian dish fried tarantulas, or some dinner-appropriate food such as steak.

ChatGPT would do well if it not only knows that set of breakfast foods but also knows their relative likelihood. To do well on the task, the model must learn something about the world: what breakfast foods are and their likelihood.

Signals hiding in plain sight

Does our language hold clues that somehow group all food items together? Is it possible to make a long list of various foods just by processing a ton of text?

There are some obvious hints, called Hearst Patterns, that are useful for categories that bear a name. For example, we see phrases such as “foods such as __”, “__ is a food”, “foods include __”. These are the tip of the statistical iceberg, plainly visible, with a far greater mass of hidden clues lies submerged.

Thousands of patterns suggest food without using the word food, and some of the short-distance patterns include delicious __, gourmet __, nourishing __, uncooked __, spicy __, morsel of __, cup of __, eat __, consume __, digest __, taste __, sumptuous __. Huge models such as ChatGPT are also capable of exploiting longer-distance patterns such as “for breakfast … had __”, “dinner was ready and … had __”.

The idea of distributional similarity is that similar entities are similar in many ways. Things that are central to the category food will present themselves in many of the contexts listed above. More obscure foods will appear in fewer contexts. Note how each of the patterns above, by itself, is non-diagnostic, hardly a litmus test for food. We encounter, for instance, delicious irony, gourmet diner, digest the news and consume entertainment. Despite fallibility of individual patterns, taken together, thankfully, the patterns are nearly flawless. Although irony gets one food vote from the idiomatic delicious irony, since we don’t encounter uncooked irony or digest irony or sumptuous irony, our faith in its foodness is miniscule.

Category Builder is an open-source system I built at Google that uses such regularities. It enables us to easily describe a category with just a couple of examples. To get a long list of presidents, we can “expand” the set {Ford, Nixon}. Under the hood, by using data preprocessed from two billion pages, it identifies patterns such as “__ signed bill” and “__’s vice president”, generating a clean list. But if we expand a different set, such as {Ford, Chevy}, we obtain a long list of cars, not tripped by the ambiguity of Ford.

Language is indeed replete with rich statistical clues.

What can (and does) go wrong?

There are three major modes of failure concerning these simple categories. First, categories may overlap in their patterns, leading ChatGPT to inappropriately slide into a wrong category. Second, there are very important categories that lack syntactic cues of the sort just described. This second failure is, I believe, the leading reason for ChatGPT’s abysmal performance in Math and Chess. Finally, although an item may belong to some category in one context, the “same item” — linguistically speaking — can belong to many different categories when you consider a corpus of billions of web pages, leading to dilution of signals and the introduction of noise.

Let’s look at these three now.

Categories with overlapping syntactic cues

Some categories are very similar to each other, with shared members and shared contexts. Consider these three categories: US Presidents, US Vice Presidents, and famous US political figures. Presidents are famous political figures and many were previously vice presidents, senators, and so forth. They all give speeches, lead rallies, are involved in scandals, break promises, meet constituents, and so forth. In these cases, it is easy to slide from one category to the other, to believe that some politician checks enough boxes of being a president that they belong to the category of president.

Take Hillary Clinton. Not only does she share all the syntactic cues common to presidents and politicians, she ticks even more boxes. She has run a presidential campaign. She has even lived in the White House! As measured by linguistic evidence, she is very presidential. And this is partly the cause of such failures as this:

(tweet from Cristian Georgescu)

Many of ChatGPT’s factual errors have this flavor, of over-stretching some category. This often feels creative, but not when the categories correspond to something real. This is exactly the class of errors that some search-based-on-LLM systems committed when they awarded Turing Awards — The Nobel of Computer Science — to many computers scientists (such as to Yoav Goldberg, for his work on, if I recall right, Microsoft Word). What a Turing Awardee does and what an accomplished computer scientist does are very similar, and this understandable blunder is widespread.

Other error involve the same boundary confusion, but based on “structure”, a topic for next time. For example, concluding from “a woman can produce a baby in nine months” that “nine women can get that task done in a month” displays similar confusion, applying a strategy far outside its jurisdiction.

Categories with missing syntactic cues

The category “Natural Number” offers a rich panoply of syntactic cues. We see, for instance, these patterns: has __ pencils, # __, it costs $__, There were __ people ahead, a table for __ please, and on and on. Number is a category ChatGPT has no trouble with. Likewise, the category chess move, such as Rxf3+, is a piece of cake for ChatGPT, supported by patterns such as Kasparov then played __ and White responded with __ and on and on.

But subsets of these categories are altogether a different kettle of fish. Think of the subcategories even number, odd number, prime number, or legal chess move.. Try to think of a few different patterns that apply to odd numbers and not to even numbers. Those that apply to prime numbers but not to composite numbers. There are a few, such as “__ is even”, but only a tiny fraction of the number of patterns that apply to both.

Just as ChatGPT overextends categories such as “US president” to members that show up in overlapping syntactic contexts, it overextends categories that apply to one situation into strange contexts. As an example, consider this correct “expansion” followed by the strange, and completely bogus, “explanation”:

A similar problem happens with chess moves. ChatGPT absolutely learns the category “chess move”, but even the category “legal chess move” stumps it, as shown by this amazing video of ChatGPT playing Stockfish, where it appears unconstrained by legality, happy even to take its own material. In the video above, ChatGPT indeed plays brilliantly for the first four moves of the Ruy Lopez opening. Then it castles — often a good move hereabouts, provided the bishop has been moved. In this specific case, it is not just bad strategically but blatantly illegal: its rook jumps on its own bishop, obliterating it (Scary, isn’t it? In its quest for world dominion, ChatGPT will stop at nothing, even taking its own material). But it is not cold calculation that leads it to sacrifice: it is its inability to learn the category “legal move”.

Beyond legal, there is also the question of “a good move”. Moves belong to different categories based on why they were played: maybe they are escaping a check; preventing a fork; occupying a file; supporting a pawn; eliminating an outpost. Or it could be several of these! What shows up in text might be “33. Re1+”, but this move could be any of several categories not spelled out. These categories have virtually no “syntactic markings”, and although chatGPT will trivially learn the category “a chess move”, it won’t learn any of the meaningful subcategories.

Items belonging to multiple categories

Category Builder has some trouble with expanding the set {Aries, Gemini}. It effortlessly identifies 11 of the 12 star signs, but has trouble with the final one. Can you guess which?

The troublesome star sign is of course Cancer. The word cancer shows up in dozens of star sign contexts (such as “born under __”) but also thousands of the disease-based contexts, greatly weakening the signal.

This is a very common phenomena, applying to all sorts of categories. It shows up very forcefully when we are dealing with numbers.

Consider 514229. What category is it? Well, it is an odd number. A prime. The 29th fibonacci number. The zip code of the Changzi township in China. As ChatGPT puts it,

It could be any category! Had it been a 4-digit number such as 1987, it could have been an year, but more specifically, depending on where in the text it was, it could have been the year of death, the year of publication, the year of passage of a bill, or any number of things. This ambiguity makes it much harder to single out the relevant category.

Here, we focused on simple categories and saw how even here ChatGPT runs into trouble with numbers and falsehoods. No “reasoning” was involved here. Next time, we will start looking at the effect of “structure”, which leads to a lot more fun.