Halfway between Marcus and Bengio

What symbols are for Marcus differs from what symbols are for Bengio — and in between those extremes exists another possibility that offers a compromise.

My research lives midway between the two extremes. I want symbols, but I don’t wish them handcrafted: my work has been in obtaining symbols by statistical techniques by reading from tons of data (from web pages while at Google, and from text and graphs at Pinterest). My symbols have a distributed representation and graded evocation, and I am after compositionality and higher level structures. Finally, one key ambition of the cognitive architecture I built for my Ph.D. was avoiding brute force.

In this post, I summarize some of my work. I am indebted to many for their ideas, and the phrases “my” in the preceding paragraph is intended to encompass all the folks I adopted and stole ideas from, including Douglas Hofstadter, Melanie Mitchell, and John M. Ellis, to name just three.

Contours of the Issues

We need efficient large-scale learning

I will mention just three here. First, consider the task of reading a sentence and inserting the discovered facts into the Knowledge Graph. Say you encounter the sentence “M42 is in the constellation Orion”. Curiously, KG does not have any relation called “X is in constellation Y”. It does have half a dozen related relations: “Star X is in Constellation Y”, “Nebula X is in constellation Y”, “Galaxy X is in constellation Y”, etc. Presumably, this is so because of the need to exclude wanderers: there is no “Planet X is in constellation Y” as that is a transient fact. But this strange splintering of a semantically simple relation makes it frustratingly hard to place a new fact. It is an artifact of the graph design, and all such designed graphs have artifacts.

Second example. The Knowledge Graph has a property called “Sexual Orientation”, along with additional information such as “Sexual Orientation Start Date”. The world is more complex: people don’t change orientation on one specific date. But such nuances are hard to capture as nodes in a graph.

Final example. Many curated set of topics are represented as ontologies, in practice as trees. But a topic such as “classroom activities in geology” has at least two parents: “classroom activities” and “geology”, but a complex structure like that is hard to curate without having an army of ontologists who agree with each other.

Manual curation is expensive and error prone. We need methods to learn at scale.

We need semantic grounding in system 1

We need distributed representations for generalization

Efficient Search Based on System 1

7, 1, 105, 8, 2, 106, 9, 3, 107, 10, 4, 108, …

I wager that you saw this as an interlacing of three sequences (or perhaps a sequence of 3-integer-sized chunks). You never tried seeing this as an interlacing of four or two sequences: the 3-ness jumped out at you, despite such 3-ness not having a high prior probability. You did not methodically work through the millions of possible ways of parsing the sequence. You did not brute force your way to the solution.

We need to deal with uncertainty

Like to keep: Systematic Generalization

In all these sequences, each “block” is an ascending group with a “blemish” — a duplication or triplication.. In a and b, the blemish moves rightward (as also in the target sequence), and experience with those sequences should make it easier to see the target sequence, more so than experience with c, where the blemish is static.

Such transfer of skills is more typical of symbolic systems since memories of prior sequences can be expressed in terms of symbols such as ascending, blemish, and more complex structures made of these, and accessed when such structures are seen again in a different sequence. The “reification” of these concepts aids the transfer.

Like to keep: Variables, Bindings, etc

What is the relationship between “4, 5, 6” and “7, 8, 9”? The program can recognize a mapping, shown below.

This has the following skeleton, which when “applied” to “7, 8, 9” will yield “10, 11, 12”. I can’t imagine current neural networks keeping track of the many bits and not mixing them up — they even have trouble learning the identity function, as Marcus has shown.


In broad brushstrokes: Seqsee takes as input the first few terms of an integer sequence and finds subsequent terms. These are “pattern sequences” rather than mathematical sequences such as Fibonacci or Triangular numbers. Patterns are all around us, and these are cognitively prior and historically prior to their mathematical cousins. It understands sequences by creating islands of meaning that it tries to extend outward.

Seqsee is symbolic and it also produces additional symbols via composition as needed. Along the way, it annotates pieces of the sequence with categories. Thus, a (1, 2, 3) may be annotated as an ascending group. More interestingly, even a ((1, 1), (2, 2), (3, 3)) can be seen, by squinting, as an ascending group, as can (1, 2, 3, 4, (5, 5, 5), 6), where this has a blemish. Furthermore, the following two can be seen an ascending groups of ascending groups: ((2, 3), (2, 3, 4), (2, 3, 4, 5)) and differently, ((2, 3), (3, 4, 5), (4, 5, 6, 7)).

The screenshot below shows Seqsee “seeing a ((1 1) 2 3) as (1 2 3)-with-a-blemish”.

Brute force is not an option, as the following example shows. That Seqsee understands the following sequence thrills me.

In this sequence, there are two things worth noting: the initial (1, 1, 1) is a garden path, and the program must realize that each “1” there plays a distinct role. The role of the initial “1”, in fact, is an ascending group of ascending groups! That is a bizarre and rare situation that Seqsee must never test for without a reasonable belief that such attempt would succeed. Only motivated tests are made, and such motivation comes from symbols being “active”: just as door knobs have affordances that allow us to see what to do with them, so also mental structures in Seqsee have affordances that want groups to extend to either sides, for instance. Section 1.2 of the dissertation is titled “Brute-force vs Concept-centered strategies” and elaborated on these issues.

Seqsee uses spreading activation, but I base activation spread on current conscious attention (section 7.1) and was very reminiscent of what Bengio called the “Consciousness Prior”.

What’s wrong with Seqsee?

Plenty, of course. The building blocks for the symbols were hand crafted. Seqsee can combine these in a million ways to produce more intricate structures, but it cannot produce new building blocks. Also, I would like it to work with millions of concepts and work with language. Section 8.1 of the dissertation is titled “A few deficiencies in Seqsee”, and 8.2 takes a hard look at granularity. 8.3 suggests some ways forward using Mental Spaces, and these are reminiscent of the Microtheories from CYC that Marcus mentioned.

My subsequent work has been with language, trying to address these shortcomings in a different space. I have not worked with Seqsee-like architectures recently, but I am producing the building blocks for such work.

Category Builder

A manual curation of such facets is well-nigh impossible. How can we create these concepts automatically? Category Builder is a system I built at Google and open sourced that can do such expansion in a polysemy-resistant way and get at nuances (Paper, Github).

Category Builder works by building statistical descriptions of words in a million-plus dimensional space. Each dimension corresponds, in that specific instantiation of category builder, to parse-tree contexts as gleaned from two billion web pages. Milk thus has features such as “object of spill”, “object of drink”, “object of sip”, “object of slurp”, “object of buy” and so forth. In other instantiations (not yet published) the features can come from social graphs or images.

An interesting test bed for whether we capture a facet is afforded by the set expansion problem. What else belongs to the set {Ford, Nixon}? That is different from what belongs to the set {Ford, Chevy}. You could think of this as generalization with two training examples.

In that case, the two Fords were different concepts. But now consider what else belongs to these sets: {Milk, Kerosene}, {Milk, pasta} or {Milk, pollen, ragweed}.

Building on this, we created a way to do text classification with very little training data (sometimes as few as two or three examples per class) by creating concepts from the training data. In a nutshell: for a text classification problem such as 20 newsgroups, if we only have five training examples of each class, one may still note that many words in the sci.space group co-occurred, in our 2-billion-sized corpus, with the term “Saturn V”. Then, for an unseen document with the never-seen-in-training term Apollo, we can believe that it is related to sci.space.

Using the limited training data, we identify categories relevant to each class. Suppose we end up with 288 categories. For each document to be classified, we produced a 288-dimension vector which not only does NOT have the word order, it does not even have the words themselves. And yet, we could beat CNNs with this, and for fewer than a few hundred training examples, we beat BERT-large (Paper, Video of talk). We can also combine this vector with other vectors from CNN or BERT, getting the best of both worlds.

A third strand of work (not yet published) creates concepts as semi cliques in certain graphs. This is yet another way to produce robust symbols, with graded membership, at various granularities, with distributed representations and with semantic grounding.

A long road ahead!

I recognize that this post is skimpy on detail: I am happy to elaborate on any of these aspects , please leave a comment or reach me on twitter.

I do unsupervised concept discovery at Pinterest (and previously at Google). Twitter: @amahabal