Halfway between Marcus and Bengio

10 min readDec 29, 2019

What symbols are for Marcus differs from what symbols are for Bengio — and in between those extremes exists another possibility that offers a compromise.

My research lives midway between the two extremes. I want symbols, but I don’t wish them handcrafted: my work has been in obtaining symbols by statistical techniques by reading from tons of data (from web pages while at Google, and from text and graphs at Pinterest). My symbols have a distributed representation and graded evocation, and I am after compositionality and higher level structures. Finally, one key ambition of the cognitive architecture I built for my Ph.D. was avoiding brute force.

In this post, I summarize some of my work. I am indebted to many for their ideas, and the phrases “my” in the preceding paragraph is intended to encompass all the folks I adopted and stole ideas from, including Douglas Hofstadter, Melanie Mitchell, and John M. Ellis, to name just three.

Contours of the Issues

During the debate, Bengio presented a slide that offers a good launching point (at time 46:30 in this video; also the top image for this post). It describes the pitfalls to avoid and desirable features worth keeping from symbolic models.

We need efficient large-scale learning

Symbolic methods involve intense manual labor: think CYC or Knowledge Graph, each with many person-years worth of manual curation. Despite the heroic efforts, the result is still brittle because of the need to have discreet chunks. Trying to squeeze fluid concepts into discreet pigeonholes leads to frustrating or bizarre artifacts, since the number of relationships has to be of a manageable size so human operators can navigate it for data entry.

I will mention just three here. First, consider the task of reading a sentence and inserting the discovered facts into the Knowledge Graph. Say you encounter the sentence “M42 is in the constellation Orion”. Curiously, KG does not have any relation called “X is in constellation Y”. It does have half a dozen related relations: “Star X is in Constellation Y”, “Nebula X is in constellation Y”, “Galaxy X is in constellation Y”, etc. Presumably, this is so because of the need to exclude wanderers: there is no “Planet X is in constellation Y” as that is a transient fact. But this strange splintering of a semantically simple relation makes it frustratingly hard to place a new fact. It is an artifact of the graph design, and all such designed graphs have artifacts.

Second example. The Knowledge Graph has a property called “Sexual Orientation”, along with additional information such as “Sexual Orientation Start Date”. The world is more complex: people don’t change orientation on one specific date. But such nuances are hard to capture as nodes in a graph.

Final example. Many curated set of topics are represented as ontologies, in practice as trees. But a topic such as “classroom activities in geology” has at least two parents: “classroom activities” and “geology”, but a complex structure like that is hard to curate without having an army of ontologists who agree with each other.

Manual curation is expensive and error prone. We need methods to learn at scale.

We need semantic grounding in system 1

Neural Networks excel at System 1: fast, reflexive decisions. A robot that has the symbol for “cup” also needs the ability to identify cups in its surroundings. NNs can identify cups blazingly fast; such identification using symbolic methods would be ploddingly slow, more like deliberate thought, and unlike how people recognize (typical) cups.

We need distributed representations for generalization

Having seen five items that answer to a symbol such as “cup”, how do we find more instances or recognize other cups? If “cup” is a monolithic concept, it is hard to see how this can be done, but if the notion of a cup is seen to have dozens or millions of features, other items that share most of these features can be seen as cups. Symbolic systems tend to have atomic symbols: the symbol “New York” or “Sexual Orientation” are complex concepts represented by a single node in the Knowledge Graph, limiting its utility and creating the artificial problem of how to evoke these.

Efficient Search Based on System 1

For my Ph.D., I wrote a computer program that extends integer sequences in a human-like way. An example from that domain will highlight the need for efficient search. In the following sequence, what integers come next? This will be quick. Please watch yourself as you understand it:

7, 1, 105, 8, 2, 106, 9, 3, 107, 10, 4, 108, …

I wager that you saw this as an interlacing of three sequences (or perhaps a sequence of 3-integer-sized chunks). You never tried seeing this as an interlacing of four or two sequences: the 3-ness jumped out at you, despite such 3-ness not having a high prior probability. You did not methodically work through the millions of possible ways of parsing the sequence. You did not brute force your way to the solution.

We need to deal with uncertainty

Yes, but I think this is a red herring. As Marcus said, he is happy to stick on probabilities on his symbols; that is the easy part. Where things get a lot more fun, though, involves what Hofstadter calls slippage. Consider the well-known quip that the state bird of Florida is the mosquito. How do we understand that sentence, which we all do effortlessly? We can “see a mosquito as a bird” and draw inferences such as large mosquitoes trouble Floridans.

Like to keep: Systematic Generalization

Practice with a certain class of problems makes us quicker at solving other problems in the same class. Below, you see four sequences annotated with ovals to make them easier to parse. The solid part is the “input” and the grayed out part is the desired extension. Does prior experience with either of the three sequences a, b, or c speed up subsequent understanding of the target sequence? Note that the ovals are NOT part of the input, they are just a convenient way of printing out the sequence.

In all these sequences, each “block” is an ascending group with a “blemish” — a duplication or triplication.. In a and b, the blemish moves rightward (as also in the target sequence), and experience with those sequences should make it easier to see the target sequence, more so than experience with c, where the blemish is static.

Such transfer of skills is more typical of symbolic systems since memories of prior sequences can be expressed in terms of symbols such as ascending, blemish, and more complex structures made of these, and accessed when such structures are seen again in a different sequence. The “reification” of these concepts aids the transfer.

Like to keep: Variables, Bindings, etc

Consider the sequence below, shown at a stage when the program has figured out the ovals.

What is the relationship between “4, 5, 6” and “7, 8, 9”? The program can recognize a mapping, shown below.

This has the following skeleton, which when “applied” to “7, 8, 9” will yield “10, 11, 12”. I can’t imagine current neural networks keeping track of the many bits and not mixing them up — they even have trouble learning the identity function, as Marcus has shown.

Seqsee

Seqsee is a cognitive architecture created along with Douglas Hofstadter and is descended from Melanie Mitchell’s Copycat. You can watch a 8 minute long Youtube video about what it does and how it works. A pdf of the dissertation is here.

In broad brushstrokes: Seqsee takes as input the first few terms of an integer sequence and finds subsequent terms. These are “pattern sequences” rather than mathematical sequences such as Fibonacci or Triangular numbers. Patterns are all around us, and these are cognitively prior and historically prior to their mathematical cousins. It understands sequences by creating islands of meaning that it tries to extend outward.

Seqsee is symbolic and it also produces additional symbols via composition as needed. Along the way, it annotates pieces of the sequence with categories. Thus, a (1, 2, 3) may be annotated as an ascending group. More interestingly, even a ((1, 1), (2, 2), (3, 3)) can be seen, by squinting, as an ascending group, as can (1, 2, 3, 4, (5, 5, 5), 6), where this has a blemish. Furthermore, the following two can be seen an ascending groups of ascending groups: ((2, 3), (2, 3, 4), (2, 3, 4, 5)) and differently, ((2, 3), (3, 4, 5), (4, 5, 6, 7)).

The screenshot below shows Seqsee “seeing a ((1 1) 2 3) as (1 2 3)-with-a-blemish”.

Brute force is not an option, as the following example shows. That Seqsee understands the following sequence thrills me.

In this sequence, there are two things worth noting: the initial (1, 1, 1) is a garden path, and the program must realize that each “1” there plays a distinct role. The role of the initial “1”, in fact, is an ascending group of ascending groups! That is a bizarre and rare situation that Seqsee must never test for without a reasonable belief that such attempt would succeed. Only motivated tests are made, and such motivation comes from symbols being “active”: just as door knobs have affordances that allow us to see what to do with them, so also mental structures in Seqsee have affordances that want groups to extend to either sides, for instance. Section 1.2 of the dissertation is titled “Brute-force vs Concept-centered strategies” and elaborated on these issues.

Seqsee uses spreading activation, but I base activation spread on current conscious attention (section 7.1) and was very reminiscent of what Bengio called the “Consciousness Prior”.

What’s wrong with Seqsee?

Plenty, of course. The building blocks for the symbols were hand crafted. Seqsee can combine these in a million ways to produce more intricate structures, but it cannot produce new building blocks. Also, I would like it to work with millions of concepts and work with language. Section 8.1 of the dissertation is titled “A few deficiencies in Seqsee”, and 8.2 takes a hard look at granularity. 8.3 suggests some ways forward using Mental Spaces, and these are reminiscent of the Microtheories from CYC that Marcus mentioned.

My subsequent work has been with language, trying to address these shortcomings in a different space. I have not worked with Seqsee-like architectures recently, but I am producing the building blocks for such work.

Category Builder

Concepts are complex. A concept such as milk has many facets: it is a liquid, a nutritious food, an animal product, white, an allergen, and so forth. A word like Nike has several overlapping meanings: it is a company, a shoe, a ticker symbol, a logo, and many things besides, such as this cow, thanks to the white tick mark on its side.

A manual curation of such facets is well-nigh impossible. How can we create these concepts automatically? Category Builder is a system I built at Google and open sourced that can do such expansion in a polysemy-resistant way and get at nuances (Paper, Github).

Category Builder works by building statistical descriptions of words in a million-plus dimensional space. Each dimension corresponds, in that specific instantiation of category builder, to parse-tree contexts as gleaned from two billion web pages. Milk thus has features such as “object of spill”, “object of drink”, “object of sip”, “object of slurp”, “object of buy” and so forth. In other instantiations (not yet published) the features can come from social graphs or images.

An interesting test bed for whether we capture a facet is afforded by the set expansion problem. What else belongs to the set {Ford, Nixon}? That is different from what belongs to the set {Ford, Chevy}. You could think of this as generalization with two training examples.

In that case, the two Fords were different concepts. But now consider what else belongs to these sets: {Milk, Kerosene}, {Milk, pasta} or {Milk, pollen, ragweed}.

Building on this, we created a way to do text classification with very little training data (sometimes as few as two or three examples per class) by creating concepts from the training data. In a nutshell: for a text classification problem such as 20 newsgroups, if we only have five training examples of each class, one may still note that many words in the sci.space group co-occurred, in our 2-billion-sized corpus, with the term “Saturn V”. Then, for an unseen document with the never-seen-in-training term Apollo, we can believe that it is related to sci.space.

Using the limited training data, we identify categories relevant to each class. Suppose we end up with 288 categories. For each document to be classified, we produced a 288-dimension vector which not only does NOT have the word order, it does not even have the words themselves. And yet, we could beat CNNs with this, and for fewer than a few hundred training examples, we beat BERT-large (Paper, Video of talk). We can also combine this vector with other vectors from CNN or BERT, getting the best of both worlds.

A third strand of work (not yet published) creates concepts as semi cliques in certain graphs. This is yet another way to produce robust symbols, with graded membership, at various granularities, with distributed representations and with semantic grounding.

A long road ahead!

These are exciting days, even for someone like me who doesn’t buy into the hype. I think we are taking very tiny steps, but I think we will come closer to somewhat deeper understanding by computers.

I recognize that this post is skimpy on detail: I am happy to elaborate on any of these aspects , please leave a comment or reach me on twitter.