Meaning, Context, and Algebraic Data Types

A few years ago, I read a paper making a startling argument. Its title was “The Derivative of a Regular Type is its Type of One-Hole Contexts.” I’m not entirely sure how I found it or why I started reading it, but by the time I was halfway though it, I was slapping myself on the forehead: it was so brilliant, and yet so obvious as to be almost trivial — how could nobody have thought of it before?

The idea was that if you take an algebraic data type — say something simple like a plain old list — and poke a hole in it, you get a new data type that looks like a derivative of the first data type. That probably sounds very abstract and uninteresting at first, especially if you aren’t very familiar with algebraic data types. But once you understand them, the idea is simple. And I think it has ramifications for people interested in natural language: it tells us something about the relationship between meaning and context. In particular, it could give us a new way to think about how semantic analysis programs like word2vec function: they perform a kind of calculus on language.

What’s an Algebraic Data Type?

An algebraic data type is really just a regular data type — seen through a particular lens. So it’s worth taking a moment to think through what a regular data type is, even if you’re already familiar with the concept.

A regular data type is just an abstract representation of the kind of data being stored or manipulated by a computer. Let’s consider, as a first example, the simplest possible item of data to be found in a modern computer: a bit. It can take just two values, zero and one. Most programming languages provide a type describing this kind of value — a boolean type. And values of this type are like computational atoms: every piece of data that modern computers store or manipulate is made of combinations of boolean values.1

So how do we combine boolean values? There are many ways, but here are two simple ones that cover a lot of ground. In the first way, we take two boolean values and declare that they are joined, but can vary independently. So if the first boolean value is equal to zero, the second boolean value may be equal to either zero or one, and if the first boolean value is equal to one, the second boolean value may still be equal to either zero or one. And vice versa. They’re joined, but they don’t pay attention to each other at all. Like this:

1) 0 0
2) 0 1
3) 1 0
4) 1 1

In the second way, we take two boolean values and declare that they are joined, but cannot vary independently — only one of them can be active at a given time. The others are disabled (represented by “*” here) — like this:

1) * 0
2) * 1
3) 0 *
4) 1 *

Now suppose we wanted to combine three boolean values. In the first case, we get this:

1) 0 0 0
2) 0 0 1
3) 0 1 0
4) 0 1 1
5) 1 0 0 
6) 1 0 1
7) 1 1 0
8) 1 1 1

And in the second case, we get this:

1) * * 0
2) * * 1
3) * 0 *
4) * 1 *
5) 0 * *
6) 1 * *

At this point, you might start to notice a pattern. Using the first way of combining boolean values, every additional bit doubles the number of possible combined values. Using the second way, every additional bit only adds two to the number of possible combined values.

This observation is the basis of algebraic type theory. In the language of algebraic data types, the first way of combining bits produces a product type, and the second way of combining bits produces a sum type. To get the number of possible combined values of a product type, you simply multiply together the number of possible values that each component type can take. For two bits, that’s 2 \times 2; for three bits, that’s 2 \times 2 \times 2; and for n bits, that’s 2 ^ n. And to get the number of possible combined values of a sum type, you simply add together the number of possible values that each component type can take: 2 + 2, or 2 + 2 + 2, or 2 \times n.

Pretty much all of the numerical types that a computer stores are product types of bits. For example, a 32-bit integer is just the product type of 32 boolean values. It can represent numbers between 0 and 2 ^{32} - 1 (inclusive), for a total of 2 ^{32} values. Larger numbers require more binary digits to store.

Sum types are a little more esoteric, but they become useful when you want to represent things that can come in multiple categories. For example, suppose you want to represent a garden plot, and you want to have a variable that stores the number of plants in the plot. But you also want to indicate what kind of plants are in the plot — basil or thyme, say. A sum type provides a compact way to represent this state of affairs. A plot can be either a basil plot, or a thyme plot, but not both. If there can be up to twenty thyme plants in a thyme plot, and up to sixteen basil plants in a basil plot, the final sum type has a total of thirty-six possible values:

Basil   Thyme
*       1
*       2
...     ...
*       19    
*       20
1       *
2       *
...     ...
15      *
16      *

If you’ve done any programming, product types probably look pretty familiar, but sum types might be new to you. If you want to try a language that explicitly includes sum types, take a look at Haskell.

Poking Holes in Types

Given this understanding of algebraic data types, here’s what it means to “poke a hole” in a type. Suppose you have a list of five two-bit integers:

| a | b | c | d | e |

This new variable’s type is a product type. It can take this value:

| 0 | 0 | 0 | 0 | 0 |

Or this value:

| 0 | 0 | 0 | 0 | 1 |

Or this value…

| 0 | 0 | 0 | 0 | 2 |

And so on, continuing through values like these:

| 0 | 0 | 0 | 0 | 3 |
| 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 | 2 |
| 0 | 0 | 0 | 1 | 3 |
| 0 | 0 | 0 | 2 | 0 |

All the way to these final values:

| 3 | 3 | 3 | 2 | 3 |
| 3 | 3 | 3 | 3 | 0 |
| 3 | 3 | 3 | 3 | 1 |
| 3 | 3 | 3 | 3 | 2 |
| 3 | 3 | 3 | 3 | 3 |

The total number of possible values for a variable of this type is (2 ^ 2) ^ 5 = 4 ^ 5 = 4 * 4 * 4 * 4 * 4 = 1024.

Now think about what happens if you disable a single slot. The result is a list type with a hole in it. What does a variable with this new type look like? Well, say you disable the first slot. You get something that looks like this — a list with a hole in the first slot:

| * | 0 | 0 | 0 | 0 |

Now that there’s a hole in the first slot, there are only 4 * 4 * 4 * 4 = 256 possible values this variable can take.

| * | 0 | 0 | 0 | 1 |
| * | 0 | 0 | 0 | 2 |

We could also poke a hole in any of the other slots:

| 0 | * | 0 | 0 | 0 |
| 0 | 0 | * | 0 | 0 |
| 0 | 0 | 0 | * | 0 |
| 0 | 0 | 0 | 0 | * |

But the hole can only be in one of the slots at any given time. Sound familiar? You could say that this is the “sum type” of each of those one-hole types. Each possible placement of the hole adds to (rather than multiplying) the number of possible values. There are five different places for the hole to go, and for each of those, there are 256 possible values. That’s a total of 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 = 5 \times 4 ^ 4.

Now suppose we make this a little more general, by replacing that 4 with an x. This way we can do this same thing with slots of any size — slots that can hold ten values, or 256 values, or 65536 values, or whatever. For the plain list type, that’s a total of x ^ 5 possible values. And for the list-with-a-hole, that’s 5 x^4 possible values. And now let’s go even further and replace the number of slots with an n. That way we can have any number of slots — five, ten, fifty, whatever you like. Now, the plain list has a total of x ^ n possible values, and the list-with-a-hole has n x ^ {(n - 1)} possible values.

If you ever took calculus, that probably looks very familiar. It’s just the power rule! This means you can take data types and do calculus with them.

One-Hole Contexts

I think that’s pretty wild, and it kept my head spinning for a couple of years when I first read about it. But I moved on. Fast forward to a few weeks ago when I started really reading about word2vec and learning how it works: it finds vectors that are good at predicting the word in the middle of an n-gram, or are good at predicting an n-gram given a word in the middle. Let’s start with the first case — say you have this incomplete 5-gram:

I went _ the store

What word is likely to be in there? Well, “to” is a good candidate — I’d venture a guess that it was the first word that popped into your mind. But there are other possibilities. “Past” might work. If the n-gram is part of a longer sentence, “from” is a possibility. And there are many others. So word2vec will group those words together in its vector space, because they all fit nicely in this context, and in many others.

But look again at that sentence — it’s a sequence of words with a hole in it. So if in this model, a word’s meaning is defined by the n-grams in which it may be embedded, then the type of a word’s meaning is a list-with-a-hole — just like the derivative of the list type that we were looking at above.

The basic idea of word2vec isn’t extremely surprising; from a certain perspective, this is an elaboration of the concept of a “minimal pair” in linguistics. If you know that “bat” means a thing you hit a ball with, and “mat” means a thing you wipe your feet on, then you can tell that /b/ and /m/ are different phonemes, even though they both involve putting your lips together: the difference between their sounds makes a difference in meaning. Conversely, if the difference between two sounds doesn’t make a difference in meaning, then the sounds must represent the same phoneme. “Photograph” pronounced with a distinct “oh” sound in the second syllable is the same word as “photograph” pronounced with an “uh” sound in the second syllable. They’re different sounds, but they both make the same word here, so they represent the same phoneme.

The thinking behind word2vec resembles the second case. If you can put one word in place of another in many different contexts and get similar meanings, then the two words must be relatively close together in meaning themselves.

What is surprising is that by pairing these two concepts, we can link the idea of derivatives in calculus to the idea of meaning. We might even be able to develop a model of meaning in which the meaning of a sentence has the same type as the derivative of that sentence’s type. Or — echoing the language of the paper I began with — the meaning type of a sentence type is its type of one-hole contexts.

  1. But don’t take this for granted. There have been — and perhaps will be in the future — ternary computers

An Interactive Citation Simulator

In my last post on citation rates, I described the output of a citation simulator, and I promised an interactive version. As it turns out, creating computation-heavy JavaScript that doesn’t completely bog down browsers is hard; hence the delay in posting.1 But here it is!

I’ve added a few presets that illustrate the phenomena I talked about in my last post. The order-of-magnitude changes I talked about overwhelm even Chrome’s JavaScript engine, but these somewhat smaller changes show similar behavior.2

  • The default has all settings at their lowest value except for the number of journals (100), the bias (1), and the decay (75 percent). Lower the number of journals to 50 if the simulator runs too slowly; the results are similar, though somewhat inflated.

  • The first preset has the same settings as the default, but increases the number of outgoing citations per article to 40.

  • The second preset has the same settings as the default, but increases the number of issues per year per journal to 20.

If you run the simulator on these three presets, you’ll see that increasing the number of issues per year produces a 100 percent jump in the top H5 index. Increasing the number of outgoing citations per article produces a 35 percent jump. If you combine both increases, there’s a 250 percent jump; these two settings are mutually reinforcing. But the relative effect of increasing journal output is still stronger, as you can see if you hold the number of outgoing citations at 40 and increase the number of issues per year from 5 to 20.

I’ve also included three additional presets that illustrate the effect of the bias parameter.

  • The third preset has the same settings as the default, but reduces the bias term to 0, and reduces the decay term to 25 percent.
  • The fourth preset has the same settings as preset three, but increases the number of outgoing citations per article to 40.

  • The fifth preset has the same settings as preset three, but increases the number of issues per year per journal to 20.

These three presets show that the relative impact of the outgoing citations and issues per year parameters depends highly on the bias parameter. Because the bias is 0, every active article is equally likely to be cited.3 Under these circumstances, increasing the number of outgoing citations produces a much larger jump in H5 indices.

Please play with this and let me know if you run into any problems!

DefaultPreset 1Preset 2Preset 3Preset 4Preset 5

Number of Journals: 
Outgoing Citations Per Article: 
Articles Per Issue: 
Issues Per Year: 
Bias Exponent: 
Decay Probability: 


Current year:
Average H5 score for top journal:
Average H5 score for median journal:

Top H Scores:

Maximum incoming citations per article within each decile of active articles (log scale):

  1. I clearly need to learn about web workers, because this still bogs down the page on high settings; use with caution! 
  2. The values jump around a little at the beginning; this is because of "edge effects" early in the simulation. After the first eight or ten years, the values often stabilize -- but for some settings they continue to jump around in a way that suggests chaotic behavior. In most cases, the averages tend to converge to a stable point after about thirty years. 
  3. The exponential decay you see in the second chart is due to the decay term, which steadily removes papers from the pool of citable papers. Since older papers are both more likely to be highly cited and more likely to be removed from the pool, most papers are cited only a few times.