A few years ago, I read a paper making a startling argument. Its title was “The Derivative of a Regular Type is its Type of One-Hole Contexts.” I’m not entirely sure how I found it or why I started reading it, but by the time I was halfway though it, I was slapping myself on the forehead: it was so brilliant, and yet so obvious as to be almost trivial — how could nobody have thought of it before?
The idea was that if you take an algebraic data type — say something simple like a plain old list — and poke a hole in it, you get a new data type that looks like a derivative of the first data type. That probably sounds very abstract and uninteresting at first, especially if you aren’t very familiar with algebraic data types. But once you understand them, the idea is simple. And I think it has ramifications for people interested in natural language: it tells us something about the relationship between meaning and context. In particular, it could give us a new way to think about how semantic analysis programs like
word2vec function: they perform a kind of calculus on language.
What’s an Algebraic Data Type?
An algebraic data type is really just a regular data type — seen through a particular lens. So it’s worth taking a moment to think through what a regular data type is, even if you’re already familiar with the concept.
A regular data type is just an abstract representation of the kind of data being stored or manipulated by a computer. Let’s consider, as a first example, the simplest possible item of data to be found in a modern computer: a bit. It can take just two values, zero and one. Most programming languages provide a type describing this kind of value — a boolean type. And values of this type are like computational atoms: every piece of data that modern computers store or manipulate is made of combinations of boolean values.1
So how do we combine boolean values? There are many ways, but here are two simple ones that cover a lot of ground. In the first way, we take two boolean values and declare that they are joined, but can vary independently. So if the first boolean value is equal to zero, the second boolean value may be equal to either zero or one, and if the first boolean value is equal to one, the second boolean value may still be equal to either zero or one. And vice versa. They’re joined, but they don’t pay attention to each other at all. Like this:
1) 0 0
2) 0 1
3) 1 0
4) 1 1
In the second way, we take two boolean values and declare that they are joined, but cannot vary independently — only one of them can be active at a given time. The others are disabled (represented by “*” here) — like this:
1) * 0
2) * 1
3) 0 *
4) 1 *
Now suppose we wanted to combine three boolean values. In the first case, we get this:
1) 0 0 0
2) 0 0 1
3) 0 1 0
4) 0 1 1
5) 1 0 0
6) 1 0 1
7) 1 1 0
8) 1 1 1
And in the second case, we get this:
1) * * 0
2) * * 1
3) * 0 *
4) * 1 *
5) 0 * *
6) 1 * *
At this point, you might start to notice a pattern. Using the first way of combining boolean values, every additional bit doubles the number of possible combined values. Using the second way, every additional bit only adds two to the number of possible combined values.
This observation is the basis of algebraic type theory. In the language of algebraic data types, the first way of combining bits produces a product type, and the second way of combining bits produces a sum type. To get the number of possible combined values of a product type, you simply multiply together the number of possible values that each component type can take. For two bits, that’s ; for three bits, that’s ; and for bits, that’s . And to get the number of possible combined values of a sum type, you simply add together the number of possible values that each component type can take: , or , or .
Pretty much all of the numerical types that a computer stores are product types of bits. For example, a 32-bit integer is just the product type of 32 boolean values. It can represent numbers between and (inclusive), for a total of values. Larger numbers require more binary digits to store.
Sum types are a little more esoteric, but they become useful when you want to represent things that can come in multiple categories. For example, suppose you want to represent a garden plot, and you want to have a variable that stores the number of plants in the plot. But you also want to indicate what kind of plants are in the plot — basil or thyme, say. A sum type provides a compact way to represent this state of affairs. A plot can be either a basil plot, or a thyme plot, but not both. If there can be up to twenty thyme plants in a thyme plot, and up to sixteen basil plants in a basil plot, the final sum type has a total of thirty-six possible values:
If you’ve done any programming, product types probably look pretty familiar, but sum types might be new to you. If you want to try a language that explicitly includes sum types, take a look at Haskell.
Poking Holes in Types
Given this understanding of algebraic data types, here’s what it means to “poke a hole” in a type. Suppose you have a list of five two-bit integers:
| a | b | c | d | e |
This new variable’s type is a product type. It can take this value:
| 0 | 0 | 0 | 0 | 0 |
Or this value:
| 0 | 0 | 0 | 0 | 1 |
Or this value…
| 0 | 0 | 0 | 0 | 2 |
And so on, continuing through values like these:
| 0 | 0 | 0 | 0 | 3 |
| 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 | 2 |
| 0 | 0 | 0 | 1 | 3 |
| 0 | 0 | 0 | 2 | 0 |
All the way to these final values:
| 3 | 3 | 3 | 2 | 3 |
| 3 | 3 | 3 | 3 | 0 |
| 3 | 3 | 3 | 3 | 1 |
| 3 | 3 | 3 | 3 | 2 |
| 3 | 3 | 3 | 3 | 3 |
The total number of possible values for a variable of this type is .
Now think about what happens if you disable a single slot. The result is a list type with a hole in it. What does a variable with this new type look like? Well, say you disable the first slot. You get something that looks like this — a list with a hole in the first slot:
| * | 0 | 0 | 0 | 0 |
Now that there’s a hole in the first slot, there are only possible values this variable can take.
| * | 0 | 0 | 0 | 1 |
| * | 0 | 0 | 0 | 2 |
We could also poke a hole in any of the other slots:
| 0 | * | 0 | 0 | 0 |
| 0 | 0 | * | 0 | 0 |
| 0 | 0 | 0 | * | 0 |
| 0 | 0 | 0 | 0 | * |
But the hole can only be in one of the slots at any given time. Sound familiar? You could say that this is the “sum type” of each of those one-hole types. Each possible placement of the hole adds to (rather than multiplying) the number of possible values. There are five different places for the hole to go, and for each of those, there are possible values. That’s a total of .
Now suppose we make this a little more general, by replacing that with an . This way we can do this same thing with slots of any size — slots that can hold ten values, or 256 values, or 65536 values, or whatever. For the plain list type, that’s a total of possible values. And for the list-with-a-hole, that’s possible values. And now let’s go even further and replace the number of slots with an . That way we can have any number of slots — five, ten, fifty, whatever you like. Now, the plain list has a total of possible values, and the list-with-a-hole has possible values.
If you ever took calculus, that probably looks very familiar. It’s just the power rule! This means you can take data types and do calculus with them.
I think that’s pretty wild, and it kept my head spinning for a couple of years when I first read about it. But I moved on. Fast forward to a few weeks ago when I started really reading about
word2vec and learning how it works: it finds vectors that are good at predicting the word in the middle of an n-gram, or are good at predicting an n-gram given a word in the middle. Let’s start with the first case — say you have this incomplete 5-gram:
I went _ the store
What word is likely to be in there? Well, “to” is a good candidate — I’d venture a guess that it was the first word that popped into your mind. But there are other possibilities. “Past” might work. If the n-gram is part of a longer sentence, “from” is a possibility. And there are many others. So
word2vec will group those words together in its vector space, because they all fit nicely in this context, and in many others.
But look again at that sentence — it’s a sequence of words with a hole in it. So if in this model, a word’s meaning is defined by the n-grams in which it may be embedded, then the type of a word’s meaning is a list-with-a-hole — just like the derivative of the list type that we were looking at above.
The basic idea of
word2vec isn’t extremely surprising; from a certain perspective, this is an elaboration of the concept of a “minimal pair” in linguistics. If you know that “bat” means a thing you hit a ball with, and “mat” means a thing you wipe your feet on, then you can tell that /b/ and /m/ are different phonemes, even though they both involve putting your lips together: the difference between their sounds makes a difference in meaning. Conversely, if the difference between two sounds doesn’t make a difference in meaning, then the sounds must represent the same phoneme. “Photograph” pronounced with a distinct “oh” sound in the second syllable is the same word as “photograph” pronounced with an “uh” sound in the second syllable. They’re different sounds, but they both make the same word here, so they represent the same phoneme.
The thinking behind
word2vec resembles the second case. If you can put one word in place of another in many different contexts and get similar meanings, then the two words must be relatively close together in meaning themselves.
What is surprising is that by pairing these two concepts, we can link the idea of derivatives in calculus to the idea of meaning. We might even be able to develop a model of meaning in which the meaning of a sentence has the same type as the derivative of that sentence’s type. Or — echoing the language of the paper I began with — the meaning type of a sentence type is its type of one-hole contexts.