# Meaning, Context, and Algebraic Data Types

A few years ago, I read a paper making a startling argument. Its title was “The Derivative of a Regular Type is its Type of One-Hole Contexts.” I’m not entirely sure how I found it or why I started reading it, but by the time I was halfway though it, I was slapping myself on the forehead: it was so brilliant, and yet so obvious as to be almost trivial — how could nobody have thought of it before?

The idea was that if you take an algebraic data type — say something simple like a plain old list — and poke a hole in it, you get a new data type that looks like a derivative of the first data type. That probably sounds very abstract and uninteresting at first, especially if you aren’t very familiar with algebraic data types. But once you understand them, the idea is simple. And I think it has ramifications for people interested in natural language: it tells us something about the relationship between meaning and context. In particular, it could give us a new way to think about how semantic analysis programs like word2vec function: they perform a kind of calculus on language.

### What’s an Algebraic Data Type?

An algebraic data type is really just a regular data type — seen through a particular lens. So it’s worth taking a moment to think through what a regular data type is, even if you’re already familiar with the concept.

A regular data type is just an abstract representation of the kind of data being stored or manipulated by a computer. Let’s consider, as a first example, the simplest possible item of data to be found in a modern computer: a bit. It can take just two values, zero and one. Most programming languages provide a type describing this kind of value — a boolean type. And values of this type are like computational atoms: every piece of data that modern computers store or manipulate is made of combinations of boolean values.1

So how do we combine boolean values? There are many ways, but here are two simple ones that cover a lot of ground. In the first way, we take two boolean values and declare that they are joined, but can vary independently. So if the first boolean value is equal to zero, the second boolean value may be equal to either zero or one, and if the first boolean value is equal to one, the second boolean value may still be equal to either zero or one. And vice versa. They’re joined, but they don’t pay attention to each other at all. Like this:

1) 0 0
2) 0 1
3) 1 0
4) 1 1


In the second way, we take two boolean values and declare that they are joined, but cannot vary independently — only one of them can be active at a given time. The others are disabled (represented by “*” here) — like this:

1) * 0
2) * 1
3) 0 *
4) 1 *


Now suppose we wanted to combine three boolean values. In the first case, we get this:

1) 0 0 0
2) 0 0 1
3) 0 1 0
4) 0 1 1
5) 1 0 0
6) 1 0 1
7) 1 1 0
8) 1 1 1


And in the second case, we get this:

1) * * 0
2) * * 1
3) * 0 *
4) * 1 *
5) 0 * *
6) 1 * *


At this point, you might start to notice a pattern. Using the first way of combining boolean values, every additional bit doubles the number of possible combined values. Using the second way, every additional bit only adds two to the number of possible combined values.

This observation is the basis of algebraic type theory. In the language of algebraic data types, the first way of combining bits produces a product type, and the second way of combining bits produces a sum type. To get the number of possible combined values of a product type, you simply multiply together the number of possible values that each component type can take. For two bits, that’s $2 \times 2$; for three bits, that’s $2 \times 2 \times 2$; and for $n$ bits, that’s $2 ^ n$. And to get the number of possible combined values of a sum type, you simply add together the number of possible values that each component type can take: $2 + 2$, or $2 + 2 + 2$, or $2 \times n$.

Pretty much all of the numerical types that a computer stores are product types of bits. For example, a 32-bit integer is just the product type of 32 boolean values. It can represent numbers between $0$ and $2 ^{32} - 1$ (inclusive), for a total of $2 ^{32}$ values. Larger numbers require more binary digits to store.

Sum types are a little more esoteric, but they become useful when you want to represent things that can come in multiple categories. For example, suppose you want to represent a garden plot, and you want to have a variable that stores the number of plants in the plot. But you also want to indicate what kind of plants are in the plot — basil or thyme, say. A sum type provides a compact way to represent this state of affairs. A plot can be either a basil plot, or a thyme plot, but not both. If there can be up to twenty thyme plants in a thyme plot, and up to sixteen basil plants in a basil plot, the final sum type has a total of thirty-six possible values:

Basil   Thyme
*       1
*       2
...     ...
*       19
*       20
1       *
2       *
...     ...
15      *
16      *


If you’ve done any programming, product types probably look pretty familiar, but sum types might be new to you. If you want to try a language that explicitly includes sum types, take a look at Haskell.

### Poking Holes in Types

Given this understanding of algebraic data types, here’s what it means to “poke a hole” in a type. Suppose you have a list of five two-bit integers:

| a | b | c | d | e |


This new variable’s type is a product type. It can take this value:

| 0 | 0 | 0 | 0 | 0 |


Or this value:

| 0 | 0 | 0 | 0 | 1 |


Or this value…

| 0 | 0 | 0 | 0 | 2 |


And so on, continuing through values like these:

| 0 | 0 | 0 | 0 | 3 |
| 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 | 2 |
| 0 | 0 | 0 | 1 | 3 |
| 0 | 0 | 0 | 2 | 0 |
...


All the way to these final values:

...
| 3 | 3 | 3 | 2 | 3 |
| 3 | 3 | 3 | 3 | 0 |
| 3 | 3 | 3 | 3 | 1 |
| 3 | 3 | 3 | 3 | 2 |
| 3 | 3 | 3 | 3 | 3 |


The total number of possible values for a variable of this type is $(2 ^ 2) ^ 5 = 4 ^ 5 = 4 * 4 * 4 * 4 * 4 = 1024$.

Now think about what happens if you disable a single slot. The result is a list type with a hole in it. What does a variable with this new type look like? Well, say you disable the first slot. You get something that looks like this — a list with a hole in the first slot:

| * | 0 | 0 | 0 | 0 |


Now that there’s a hole in the first slot, there are only $4 * 4 * 4 * 4 = 256$ possible values this variable can take.

| * | 0 | 0 | 0 | 1 |
| * | 0 | 0 | 0 | 2 |
...


We could also poke a hole in any of the other slots:

| 0 | * | 0 | 0 | 0 |
| 0 | 0 | * | 0 | 0 |
| 0 | 0 | 0 | * | 0 |
| 0 | 0 | 0 | 0 | * |


But the hole can only be in one of the slots at any given time. Sound familiar? You could say that this is the “sum type” of each of those one-hole types. Each possible placement of the hole adds to (rather than multiplying) the number of possible values. There are five different places for the hole to go, and for each of those, there are $256$ possible values. That’s a total of $4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 = 5 \times 4 ^ 4$.

Now suppose we make this a little more general, by replacing that $4$ with an $x$. This way we can do this same thing with slots of any size — slots that can hold ten values, or 256 values, or 65536 values, or whatever. For the plain list type, that’s a total of $x ^ 5$ possible values. And for the list-with-a-hole, that’s $5 x^4$ possible values. And now let’s go even further and replace the number of slots with an $n$. That way we can have any number of slots — five, ten, fifty, whatever you like. Now, the plain list has a total of $x ^ n$ possible values, and the list-with-a-hole has $n x ^ {(n - 1)}$ possible values.

If you ever took calculus, that probably looks very familiar. It’s just the power rule! This means you can take data types and do calculus with them.

### One-Hole Contexts

I think that’s pretty wild, and it kept my head spinning for a couple of years when I first read about it. But I moved on. Fast forward to a few weeks ago when I started really reading about word2vec and learning how it works: it finds vectors that are good at predicting the word in the middle of an n-gram, or are good at predicting an n-gram given a word in the middle. Let’s start with the first case — say you have this incomplete 5-gram:

I went _ the store


What word is likely to be in there? Well, “to” is a good candidate — I’d venture a guess that it was the first word that popped into your mind. But there are other possibilities. “Past” might work. If the n-gram is part of a longer sentence, “from” is a possibility. And there are many others. So word2vec will group those words together in its vector space, because they all fit nicely in this context, and in many others.

But look again at that sentence — it’s a sequence of words with a hole in it. So if in this model, a word’s meaning is defined by the n-grams in which it may be embedded, then the type of a word’s meaning is a list-with-a-hole — just like the derivative of the list type that we were looking at above.

The basic idea of word2vec isn’t extremely surprising; from a certain perspective, this is an elaboration of the concept of a “minimal pair” in linguistics. If you know that “bat” means a thing you hit a ball with, and “mat” means a thing you wipe your feet on, then you can tell that /b/ and /m/ are different phonemes, even though they both involve putting your lips together: the difference between their sounds makes a difference in meaning. Conversely, if the difference between two sounds doesn’t make a difference in meaning, then the sounds must represent the same phoneme. “Photograph” pronounced with a distinct “oh” sound in the second syllable is the same word as “photograph” pronounced with an “uh” sound in the second syllable. They’re different sounds, but they both make the same word here, so they represent the same phoneme.

The thinking behind word2vec resembles the second case. If you can put one word in place of another in many different contexts and get similar meanings, then the two words must be relatively close together in meaning themselves.

What is surprising is that by pairing these two concepts, we can link the idea of derivatives in calculus to the idea of meaning. We might even be able to develop a model of meaning in which the meaning of a sentence has the same type as the derivative of that sentence’s type. Or — echoing the language of the paper I began with — the meaning type of a sentence type is its type of one-hole contexts.

1. But don’t take this for granted. There have been — and perhaps will be in the future — ternary computers

# An Interactive Citation Simulator

In my last post on citation rates, I described the output of a citation simulator, and I promised an interactive version. As it turns out, creating computation-heavy JavaScript that doesn’t completely bog down browsers is hard; hence the delay in posting.1 But here it is!

I’ve added a few presets that illustrate the phenomena I talked about in my last post. The order-of-magnitude changes I talked about overwhelm even Chrome’s JavaScript engine, but these somewhat smaller changes show similar behavior.2

• The default has all settings at their lowest value except for the number of journals (100), the bias (1), and the decay (75 percent). Lower the number of journals to 50 if the simulator runs too slowly; the results are similar, though somewhat inflated.

• The first preset has the same settings as the default, but increases the number of outgoing citations per article to 40.

• The second preset has the same settings as the default, but increases the number of issues per year per journal to 20.

If you run the simulator on these three presets, you’ll see that increasing the number of issues per year produces a 100 percent jump in the top H5 index. Increasing the number of outgoing citations per article produces a 35 percent jump. If you combine both increases, there’s a 250 percent jump; these two settings are mutually reinforcing. But the relative effect of increasing journal output is still stronger, as you can see if you hold the number of outgoing citations at 40 and increase the number of issues per year from 5 to 20.

I’ve also included three additional presets that illustrate the effect of the bias parameter.

• The third preset has the same settings as the default, but reduces the bias term to 0, and reduces the decay term to 25 percent.
• The fourth preset has the same settings as preset three, but increases the number of outgoing citations per article to 40.

• The fifth preset has the same settings as preset three, but increases the number of issues per year per journal to 20.

These three presets show that the relative impact of the outgoing citations and issues per year parameters depends highly on the bias parameter. Because the bias is 0, every active article is equally likely to be cited.3 Under these circumstances, increasing the number of outgoing citations produces a much larger jump in H5 indices.

Please play with this and let me know if you run into any problems!

Number of Journals:
Outgoing Citations Per Article:
Articles Per Issue:
Issues Per Year:
H-Index:
Bias Exponent:
Decay Probability:

Current year:
Average H5 score for top journal:
Average H5 score for median journal:

Top H Scores:

Maximum incoming citations per article within each decile of active articles (log scale):

1. I clearly need to learn about web workers, because this still bogs down the page on high settings; use with caution!
2. The values jump around a little at the beginning; this is because of "edge effects" early in the simulation. After the first eight or ten years, the values often stabilize -- but for some settings they continue to jump around in a way that suggests chaotic behavior. In most cases, the averages tend to converge to a stable point after about thirty years.
3. The exponential decay you see in the second chart is due to the decay term, which steadily removes papers from the pool of citable papers. Since older papers are both more likely to be highly cited and more likely to be removed from the pool, most papers are cited only a few times.

# Simulating Citation Networks

A few months ago, Patrick Dunleavy published a post on the London School of Economics Impact Blog describing “a huge gulf between many STEM scientists… and scholars in other disciplines in how often they cite other people’s research.” After providing some statistics in support of this claim, some of which placed citation rates in the humanities an entire order of magnitude below those in the natural and life sciences, he offered some advice:

Those social science and humanities academics who go through life anxious to prove themselves a ‘source’ (of citations) but not a ‘hub’ (of cites to others) are not practicing the reputation-building skills that they fondly imagine… Their practice is self-harming not just to themselves, but to all who read their works or follow after them as students at any level. Others who perhaps accept such attitudes without practicing them themselves – for example, as departmental colleagues or journal reviewers – are also failing in their scholarly obligations, albeit in a minor way.

Initially I felt chastened by Dunleavy’s article. “It’s a shame,” I thought, “that we in the humanities operate in such backwards ways that our citation rates are an entire order of magnitude lower than the citation rates of researchers in the sciences. We’d really better start citing each other more — being ‘hubs’ instead of ‘sources’!” But after giving some careful thought to what that means, I’ve concluded that this issue deserves more investigation than it received in Dunleavy’s post.

My own informal investigations suggest conclusions that are quite different from those that Dunleavy drew. The global point he made — that scholars in the humanities and social sciences ought to reform their citation practices — may still hold. But if it does, I believe it will be for different reasons than he suggested in his post. And those reasons will not be fully clear without much more systematic research into the relationship between research practices in the humanities and the statistical measurements he relied on. In the end, that relationship will probably have less to do with the number of citations in bibliographies than with the age of articles being cited, and the number of articles being published. Dunleavy summarily rejected the second claim, but there’s a strong line of reasoning suggesting that he was wrong to do so. The number of publications in a field — and especially the number of articles published per journal per year — could significantly affect both of the statistics he uses. To support that argument I’ve created a citation simulator that’s publicly available as a gist. I’ve also posted an interactive JavaScript version, and I’ll talk below about the assumptions it makes and the citation patterns it produces.

But I want to begin by describing my initial thought process upon reading Dunleavy’s post. I think it’s important to take seriously the kind of informal reasoning that might lead to skepticism in cases like this. Humanists aren’t often trained to make complex mathematical arguments, but that training isn’t always necessary to see when those arguments have problems. We don’t all need more mathematical training. We just need to get more practice subjecting mathematical arguments to sniff tests. These tests often involve paying attention to the order of magnitude of a value. If the number you get from an argument is in the low five digits, and the number you expect has at least six digits, then something’s probably wrong.

### Off by Inches or by Feet?

What first bothered me about Dunleavy’s argument was the mismatch between incoming and outgoing citation rates. Although the first chart he displayed shows an entire order of magnitude difference between incoming citation rates in the humanities and the sciences, I had never noticed such a dramatic difference in the number of outgoing citations per article. Being in a self-critical mood, my first instinct was to consider my own work. I thought about how many secondary sources I cited in my most recent publication — just about twenty out of a total of fifty including primary sources1. That didn’t seem great — about average for the field at best, and probably even below average. But as I continued to think about it, that also seemed about average for most of the computer science papers I had read. In fact, a lot of those papers seemed to cite between fifteen and twenty other papers. But that was a vague hunch — it called for investigation. So I thought of a computer science paper that I know has been cited many times: the original paper describing Latent Dirichlet Allocation by David Blei, Andrew Ng, and Michael Jordan.2

It only cites seven papers!

Then I started to feel a little better about my own citation practices. At least I was doing better than rockstars like David Blei and Andrew Ng. (And that was back before they were rockstars.) What about people who cited their paper? I did a couple of random checks. One cited twenty-five papers; one cited eighteen. One cited 332 — that threw me for a loop until I realized it was a book-length document. But then it seemed about in line with the secondary bibliographies of most humanities monographs that I’ve seen — higher than average, perhaps, but certainly not by an order of magnitude.

Even based on such a small and unsystematically collected sample, I think it’s reasonable to conclude that humanists and scientists probably aren’t citing wildly different numbers of fellow scholars and researchers. Clearly this is an assertion that demands a much more thorough investigation. But we can expect that if humanists were generating an entire order of magnitude fewer citations, it would be obvious at first glance. And it’s not obvious. Humanists seem to be generating a roughly similar number of outgoing citations per article on average — optimistically, about the same number, and pessimistically, perhaps two thirds or three fifths as many. But not a tenth as many.

So what are the citation practices that we need to change? Dunleavy’s argument was that if we work harder at being ‘hubs,’ we’ll also have more success — potentially an order of magnitude more — as ‘sources.’ In other words, we should include far more secondary citations in our bibliographies. This question about outgoing citations is a pretty good test of that claim, and it didn’t do very well.3

That suggests that outgoing citations by humanists are being lost by these statistics, or incoming citations by scientists are being magnified somehow. Or perhaps the data is outright biased. At worst we need to be citing different articles — not ten times as many. Later in his article, Dunleavy suggested that we need to be doing more thorough literature reviews. Could that be causing the problem somehow? Maybe it’s a matter of citing more recent work, or more obscure work. Or perhaps we’re citing people outside our own field too frequently — perhaps we should be keeping citations inside our own circles.4 Or maybe something else is happening.

It’s difficult to tell exactly why there’s such a dramatic difference, but Dunleavy’s article suggested one intriguing explanation — only to reject it. After talking about low incoming citation rates, Dunleavy went on to talk about the h5 scores of journals in various fields. Then he gave us this graph:

Even now, I look at that and have to quash little twinges of insecurity. My field is all the way at the bottom! Panic! But cooler heads prevail. Dunleavy followed the chart with a sliver of analysis before moving towards conclusions, asking “What can or should be done?” Only after asking that question did he address the possibility that publication volume might have something to do with the discrepancy: “The greater volume of STEM science publications is cited as if it explains things — it doesn’t, most individual STEM science papers are not often cited.”

As I read that sentence, my first thought was “that sentence belongs way earlier in the post.” It’s part of the analysis, not a conclusion. And what’s the logic behind it? Dunleavy didn’t explain. That means we have to do some math.

### Scale Models

Let’s begin by considering the statistics that Dunleavy discussed: citation rate and h5 index. The h5 index is easier to specify, so I’ll start with that. Google offers this definition: “h5-index is the h-index for articles published in the last 5 complete years. It is the largest number h such that h articles published in 2009-2013 have at least h citations each.” Articles older than five years effectively expire for the purpose of this statistic, and their citations are no longer counted.

Citation rate is a bit harder to work out. Dunleavy didn’t explicitly state what kind of citations the citation rate statistic counts, but if it were counting outgoing citations, then the chart he begins his post with would be hard to take seriously. I am quite confident that there are more than four outgoing citations per publication in all of the fields listed on that chart. So it must count incoming citations. Dunleavy also didn’t specify which citations it counts. For consistency, I’ll take five years as the cutoff for this statistic as well: articles older than five years expire.

Now let’s test Dunleavy’s claims on a simple, artificial example. Say you have two fields, A and B. Field A produces one thousand articles per year; field B produces ten thousand. Dunleavy’s first claim was that given similar citation practices, the increase in volume will not significantly affect the citation statistics he’s talking about.5 And his second was that most of the articles in either field will not be cited.

“Most” is a little vague, so let’s say that in either field, all outgoing citations in a particular year will be evenly distributed among the top ten percent of articles from the previous year. Dunleavy also didn’t remark on the number of journals in each field, so let’s suppose that field A has fifty journals and field B has two hundred. And to keep things simple, let’s say the top articles are distributed evenly across all journals. We’ll also assume that all articles cite just ten articles from the previous year. These are unrealistic assumptions, but they aren’t totally outlandish, and they should at least help us learn some things about what Dunleavy has claimed.

Let’s start with field A. For any given year, there will be one thousand articles published in the field. They will generate ten citations each, for a total of ten thousand citations. Those citations will be evenly distributed across the top ten percent of articles from the previous year — one hundred articles receiving one hundred citations each. Those articles will be evenly distributed across all fifty journals — two each. So for that year, there will be just two articles per journal with at least two citations; they will both count towards the journal’s h5 score. Over five years, there will be four such pairs of articles (because the articles from the most recent year won’t be cited until next year). Written out numerically, there are $1000 \times 10 / 100 = 100$ citations per cited article, and there are $4 \times 100 / 50 = 8$ cited articles per journal. So that’s a total of eight articles published per journal that received at least eight citations in the last five years in field A, giving an h5 score of eight for every journal in the field.

The calculation for the citation rate is slightly different. Every year, a set of ten thousand citations are generated and distributed evenly among last year’s journals, and four sets will be produced that count for a given five-year span. That’s a total of forty thousand citations, divided evenly among fifty journals. Those journals together produce five thousand articles over five years, and so the citation rate is eight.

On to field B. For any given year, there will be ten thousand articles in this field, each citing ten papers, for a total of one hundred thousand citations. Those citations will be divided among the previous year’s top ten percent of papers — one thousand papers this time, each receiving one hundred citations. This time, those thousand articles are divided among two hundred journals. That’s five articles per journal, and four sets of five per five-year period. Numerically, there are $10000 \times 10 / 1000 = 100$ citations per cited article, and $4 \times 1000 / 200 = 20$ cited articles per journal. That gives us twenty articles published per journal with at least twenty citations each, for an h5 score of twenty for every journal in the field.

And now for the citation rate: every year, one hundred thousand citations are generated, with four sets produced over five years. That’s a total of four hundred thousand citations, divided evenly among two hundred journals. Those journals will produce fifty thousand articles in total, and so the citation rate is again eight.

So publication volume has indeed affected the h5 statistic, though perhaps in a slightly different way than Dunleavy was talking about. The change in the number of articles published per year had no effect. But the change in the number of articles published per journal had a dramatic effect. Had the number of journals in field B also gone up by a full order of magnitude, to five hundred, there would have been no difference in either statistic; had the number of journals in field B only doubled to one hundred, the difference in their h5 statistics would have been even more noticeable. This might seem a bit like cheating: I didn’t scale all the values equally. But that’s arguably more realistic. A larger field will support — and may even require — larger journals that publish more frequently.

Now consider the fact that whereas Nature publishes weekly issues that each contain between ten and twenty articles and “letters” (with full bibliographies), even a very large, respected humanities journal like PMLA might publish only four or five issues a year, each containing between ten and twenty articles and other papers with full bibliographies. That’s a minimum of five hundred articles per year from Nature, compared to a minimum of fifty per year from PMLA. An order of magnitude difference.

Once you’ve worked through the mathematics, it’s not surprising. Journals that publish more articles will naturally capture more citations, all else being equal. And it’s a pattern that you can see in real data. Consider SCIMAGO‘s list of top journals. The correlation between the “Total Docs” statistic and the “H index” statistic is immediately noticeable. Try sorting the output by “H index” — the first journal with fewer than five hundred publications over three years is ranked fifty-ninth. Sixty three of the first hundred have more than a thousand citable documents over three years, and many have more than three thousand. Most humanities journals have fewer than two hundred. In total, the SCIMAGO database contains more than a thousand journals with a thousand citable documents over three years. None of them are dedicated to the humanities.6

At one point, Dunleavy wrote “the gulf charted here isn’t the product of just a few science super-journals.” What about a thousand science super-journals?

### Simulating Citation Networks

But let’s assume that’s all just a coincidence. It might not hold up to further scrutiny. And recall that the assumptions I made for the simple calculations above are highly artificial. What would happen if we used a more realistic set of assumptions? I decided to try creating a citation simulator to see. Rather than trying to work out some kind of probabilistic closed-form h5 equation, I wrote a script that simulates thirty years of publication, displaying h5 values and citation rates for each year. I found that its behavior was unpredictable, and sensitive in complex ways to various inputs. But the results also seemed reasonable — they looked like the kinds of statistics one sees browsing through Google Scholar.

It still makes simplifying assumptions that are not realistic, but it does a much better job imitating the particular kind of rich-get-richer power law behavior of citation networks. There are no arbitrary values determining which articles will be cited and which will not be, but articles that already have citations will be more likely to receive additional citations. Here’s an enumeration of the assumptions the simulator makes, and the values it allows you to tune:

1. A set number of journals publish articles in a given field. The number is tunable.
2. Each journal publishes a set number of issues per year. The number is tunable.
3. Each journal publishes a set number of articles per issue. The number is tunable.
4. Each article cites a set number of other articles. The number is tunable.
5. Citations for each article are chosen randomly, but with a bias towards articles that have already received citations. The probability that a given article will receive a new citation is proportional to the number of citations it has already received, plus one. The code provides a tunable skew parameter that strengthens or weakens the bias towards oft-cited articles.7 Articles become available for citation in the issue cycle after they are published.
6. Between each issue cycle, some articles are forgotten or randomly superseded by others, and expire for the purpose of citation. The probability that an article will expire in a given cycle is the same for all articles, and is tunable.

To the degree that these parameters correspond to actual scholarly practices, a number of them are likely to vary widely between disciplines. For example, the speed with which articles expire in the humanities will probably be lower, and so older articles will be cited more often. And the number of issues published per year will often be lower. As it happens, those are two values that the h5 index is very sensitive to. It’s often less sensitive to the number of citations per article. For example, given the simulator’s initial default settings, if you multiply the number of issues per year by ten, the top journal’s h5 index increases almost threefold, for an average increase of about four percent per additional issue. But given those same initial defaults, if you multiply the number of outgoing citations per article by ten, the top journal’s h5 index changes by just thirty-five percent — an average increase of about four tenths of a percent per additional citation.8 A field that wanted to double its h5 numbers under these circumstances could publish twenty or twenty-five more issues of each journal per year — or cite two hundred more sources per article on average.

The sensitivity of the index to the number of outgoing citations depends partially on the bias parameter; when the bias towards famous articles is lower, increasing the number of outgoing citations has a greater effect. But the bias has to be quite low — distributing citations almost evenly among articles that haven’t been forgotten or superseded — before changes in outgoing citation rate are as significant as changes in the number of issues per journal. This pattern also makes sense in light of the calculations above. The citations were far too concentrated on the top articles; had they been spread out among other articles, the resulting h5 scores would have been higher. The bias and decay values also influence the relationship between outgoing citations and the field-wide citation rate; for some values, the field-wide citation rate can be as low as five percent of the outgoing citation rate, because so many of the outgoing citations are going to older articles that have expired for the purpose of the calculation.

There are a number of phenomena the simulator does not try to model at all. For example, it assumes that there is no particular bias in favor of one journal over another. Arguably even a mild bias could skew the results dramatically. In its current form, the simulator tends to produce fields that balance citations relatively evenly across all journals. A more realistic simulation might distribute the majority of citations over the top thirty or forty percent of journals; this would probably drive those journals’ h5 indices even higher.

But my goal is not to produce a perfectly realistic simulator. My goal is to show that a simulator that approaches even a moderate level of realism produces complex, unpredictable, nonlinear relationships between many different variables. Suppose we assume for the moment that the numbers that Google Scholar produces for humanities journals are as reliable as the numbers it produces for the sciences.9 And suppose we assume that we really should want the h5 indices of our journals to go up. We can’t expect to get a straightforward linear response by citing more articles and crossing our fingers. Given some very reasonable assumptions about publication conventions in the humanities, there’s a good chance that citing more articles will have only a small effect. The effect will certainly not be large enough to address the score gap between the humanities and the sciences. Other decisions about citation will matter more: which articles we cite, how recently they were published, which journals they were published in, and the number of articles those journals publish.

The assumptions that lead to those conclusions are not based on any evidence. This simulator can’t tell us anything about the actual state of the humanities. Perhaps a field-wide increase in outgoing citation rates would dramatically boost incoming citation rates and h5 scores. We can’t be certain without more careful investigation.

However, that means being doubly skeptical of hasty conclusions that reinforce popular stereotypes about the humanities and the sciences. I was troubled at times while reading Dunleavy’s post — especially when he implied that fields with lower citation rates are more likely to harbor scholars who are “ignoring other views, perspectives and contra-indications from the evidence.” The humanists I know make special effort to do just the opposite, because they know that the kind of research we do is often more vulnerable to ideological bias than research in the sciences. And we can’t shift the burden of objectivity onto our methods; we can’t pretend to be passive spectators, as some scientists might. To do good intellectual work, we have to confront our bias directly, paying careful attention to conflicting evidence from multiple perspectives. That’s challenging, certainly, but I’m not at all convinced that we draw our conclusions in more biased ways than scientists do.

It also troubled me when Dunleavy cited an op-ed by Mark Bauerlein suggesting that literary scholars should give up doing research altogether. Bauerlein is building his reputation as an ivory tower provocateur, and some have used even stronger language to describe his recent output — words like “trolling” and “clickbait” come to mind. The fact that he and Dunleavy might agree about this doesn’t exactly give me confidence that Dunleavy’s perspective is unbiased.

Despite those issues, I remain sympathetic to his call for citation reform. I would not have written this post if his hadn’t called for a thoughtful response. His suggestion of adopting systematic review deserves serious consideration; it might even address some of the factors that could be leading to the apparent dearth of incoming citations in the humanities, because it concerns not only the number of outgoing citations, but also their distribution over time and across the field. Although the literary scholars I know conduct secondary research in thorough and systematic ways, they each do it a little differently. It would be helpful to articulate a clearer set of field-wide standards for secondary research and citation practices.

But if we choose to do that, we shouldn’t worry about increasing the h5 statistics of our journals. We shouldn’t worry about the impact our work has within some arbitrary time frame. We should worry about creating better literary scholarship.

1. “Common Knowledge: Epistemology and the Beginnings of Copyright Law,” forthcoming in PMLA
2. It has been cited exactly 11111 times as of this writing, reports Google Scholar.
3. Unless, that is, it turns out that by citing secondary sources fifty percent more, we could increase our citation rate by four or five times. That seems unlikely.
4. I cited several historians of philosophy in my paper — those citations were “lost” for my field. When this occurred to me I thought “Oops! Well, C’est la vie.” Never mind that this directly contradicts Dunleavy’s advice to avoid “discipline-siloed” citation practices.
5. To be perfectly explicit, I am interpreting Dunleavy’s claim as entailing the contrapositive of the following premise: if higher publication volume significantly increases h5 statistics, then higher publication volume explains at least part of the gap between the humanities and the sciences.
6. The size of these journals is surely related to the publication incentive structures at work in the sciences. And at least one Nobel-winning scientist, Randy Schekman, has argued that scientific incentive structures are broken: “Mine is a professional world that achieves great things for humanity. But it is disfigured by inappropriate incentives.”
7. The citation sampler selects papers using a bin-based sampling process that’s fast and works intuitively, but that has zero theoretical justification. The papers are placed in an array of bins; papers with more citations appear towards the beginning of the array, and get more bins. Then a random number between zero and one is chosen and multiplied by the number of bins. That number is used as an index into the array of bins. The bias parameter is applied in one of two ways. If it’s greater than one, then the random number is raised to the power of the bias parameter before being multiplied by the number of bins. So if the bias parameter is two, then the square of the number is used. This pushes the values downwards — recall that the square of one half is one quarter — towards the most often cited papers. If the bias parameter is less than one, then the number of bins allocated to each paper is raised to the power of the bias parameter. So if the bias parameter is one half, then a paper that already has sixteen citations gets only four bins. This strategy produces a smooth transition between the two kinds of bias, and produces fewer bins than citations when possible, but never more.
8. The script is set with the following defaults: five hundred journals, five issues per year, ten articles per issue, and ten citations per article. The bias parameter is set to one (a standard rich-get-richer bias), and the decay parameter is set to seventy-five percent.
9. There are strong reasons to doubt this. Google Scholar has some fairly specific requirements for inclusion. Do as many humanities journals as science journals worry about meeting those requirements? Almost certainly not. And Google’s coverage for journals in my field looks incredibly dodgy. I don’t blame Google for that — but it certainly should have some bearing on the kind of reform we aim for. Let’s work on getting our journals properly indexed before we start overhauling our entire field.

# Model Corpora for Distant Reading Research

There has been some recent discussion on Twitter in response to Alan Liu’s call for model corpora for students. It’s an extremely exciting conversation that’s long overdue. We need better model corpora. And I think it’s important for DH scholars to recognize that this is not just a pedagogical requirement — it’s a requirement for research at every level.1

Consider the field of machine learning. Almost anyone who has ever done research in the field has heard of the MNIST database. It’s a collection of 60,000 handwritten numerical digits, labeled with their actual values, preprocessed for ease of use, and carefully designed to pose a substantial challenge for learning algorithms. Achieving a decent error rate isn’t too hard, but achieving a very low error rate seems to require things like 6-layer feed-forward neural networks with thousands of neurons in each layer.

What’s great about the MNIST database is that many, many people have used it to test their work. Now, whenever you try a new learning algorithm, you can toss the MNIST data at it. If your algorithm achieves a reasonable error rate, you can feel pretty confident that you’re not making some kind of gross error, and that your algorithm really is recognizing handwriting, rather than noticing and generalizing from irrelevant details. Unfamiliar data could have surprising biases or unexpected patterns that throw off the performance of your algorithm, but the MNIST database is carefully curated to avoid many such problems. That’s vital for teaching, but it’s just as vital for research.

Andrew Ng compares this to the use of model organisms in biology. Mice, for example, consume few resources, develop and reproduce quickly, and share fundamental biological traits with many mammals, including humans. For those reasons, people have been studying them for a very long time and their biology is well understood. Researchers who want to study mammals, and who are already committed to making the kinds of ethical trade-offs that animal experimentation entails, will almost certainly start with mice or other related model organisms. The epistemological, practical, and ethical benefits are manifold. There will be fewer ways for the work to go wrong in unexpected ways, the research will proceed more efficiently, and fewer animals will suffer overall.

Fortunately, digital humanists don’t face the same ethical questions as biologists. Our “mouse models” can consist of nothing but data. But we still don’t have enough of them.2

I found the absence particularly frustrating as I sat down to play with Syuzhet a couple of weeks ago. I was initially at a loss about what data to use. It quickly occurred to me that I ought to start with Romeo and Juliet because that’s what other researchers had used, and for good reason. It’s familiar to a broad range of audiences, so it’s easy to get data about it from actual human beings. It has large variations in emotional valence with relatively clear trajectories. And well, you know — it’s Shakespeare. But one play by one author isn’t really good enough. What we need is a model corpus — or rather, many model corpora from different periods, in different genres, different languages, different registers, and so on.

There has been some discussion about how to construct corpora that are representative, but in these discussions, the question is often about whether the corpus gives us access to some kind of ground truth about the culture from which it is sampled.3 That’s an extremely important question — one of the most important in the digital humanities. But I worry that we’re not quite ready to begin answering it. We don’t know whether corpora are representative, but we also don’t know for sure what tools to use in our investigations. And it’s going to be difficult to refine our tools and build new representative corpora at the same time. In our rush to take advantage of the sudden ubiquity of literary and historical data, we might be skipping a step. We need to understand the tools we use, and to understand the tools we use, we need to test them on corpora that we understand.4

From one perspective, this is a matter of validation — of checking new results against what we already know.5 But it’s a different kind of validation than many of us are used to — where by “us” I mean mostly literary scholars, but possibly other humanists as well. It doesn’t ask “is this a claim that comports with known evidence?” It asks “what is the precise meaning of this claim?” This second question becomes important when we use an unfamiliar tool to make a claim; we need to understand the claim itself before saying whether it comports with known evidence.

From another perspective, this is a matter of theorization — of clarifying assumptions, developing conceptual structures, and evaluating argumentative protocols. But it’s a different kind of theory than many of us are used to. It doesn’t ask “what can we learn by troubling the unspoken assumptions that make this or that interpretation seem obvious?” It asks “how can we link the representations these tools produce to familiar concepts?” Literary theory has often involved questioning the familiar by setting it at a distance. But distant reading already does that — there are few obvious interpretations in the field right now, and theory may need to play a slightly different role than it has in previous decades.

From either perspective, the point of a model corpus would not be to learn about the texts in the corpus, or about the culture that produced those texts. It would be to learn about the tools that we use on that corpus, about the claims those tools might support, and about the claims they cannot support.

But what should a model corpus look like? This is where I become less certain. My first instinct is to say “let’s look at what corpus linguists do.” But the kinds of questions linguists are likely to ask are very different from the ones that literary scholars are likely to ask. Still, there are some great starting points, including a remarkably comprehensive list from Richard Xiao. Among those, the ARCHER corpus seems particularly promising. (Thanks to Heather Froelich for these suggestions!)

But in the long run, we’ll want to produce our own corpora. Fortunately, Alan Liu has already given this some thought! His focus is on pedagogical issues, but again, the kinds of model corpora he talks about are vital for research as well. On Twitter, he offered a brilliant enumeration of desirable qualities such corpora would have. I’m reproducing it here, lightly paraphrased:

Imagine what a ready-for-student-use corpus of literary materials would look like. Specs include the following:

1. Free and public domain.
2. Of manageable size (e.g., low thousands and not hundreds of thousands of items).
3. Modular by nation, genre, period, language.
4. Socioculturally diverse.
6. Pre-cleaned and chunked (or packaged with easy-to-use processing scripts).
7. Compatible in format with similar corpora of materials in history and other fields (to encourage cross-domain experiments in analysis).
8. Matched by period and nation to available linguistic corpora that can be used as reference corpora.

I think this is a terrific set of initial goals for model corpora, both for researchers and students. We’ll need to keep having conversations about requirements, and of course no one corpus will serve all our needs. Different disciplines within the humanities will have different requirements. But it’s clear to me that if digital humanists can develop a “canon” of familiar corpora useful for validating new tools, the field will have taken a significant step forward.

Let’s get started!

##### Appendix

Since there are already several links to helpful resources for thinking about humanistic corpora, I’m going to start a corpus creation and curation bibliography here. This will probably graduate into its own page or post.

1. Update: Since posting this, I’ve learned that Laura Mandell, in collaboration with the NovelTM partnership, is working on a proposal for a journal dedicated to digital humanities corpora. I think this will be a fantastic development for the field!
2. There are some examples — such as the much-studied Federalist papers, which might be a good dataset to consider for testing new approaches to authorship attribution. And of course there are numerous standard corpora for use by corpus linguists — the Brown, AP, and Wall Street Journal corpora come to mind for American English, and there are many others. But these corpora have not been selected with literary studies in mind! This is where I parade my ignorance in the hope that someone will enlighten me: are there other corpora designed to be of specific methodological interest to literary scholars?
3. Most recently, Scott Weingart took up the question of corpus representativeness in his discussion of the Great Tsundoku, focusing on questions about what was written, what was published, and what was read. He also noted a conversation from a few years ago that Ted Underwood documented, and drew attention to Heather Froelich, who does lots of careful thinking about issues like this. And needless to say, Franco Moretti was thinking about this long ago. I think we’ll be asking this question in different ways for another fifty years.
4. Initially I said “We need to understand the tools we use first,” but that’s not quite right either. There’s a cyclical dependency here!
5. This has been a topic of widespread interest after the recent Syuzhet conversation, and I think the kinds of collective validation that Andrew Piper and others have called for would be made vastly easier by a somewhat standardized set of model corpora familiar to many researchers.

# Directed Probabilistic Topic Networks

Suppose you’re standing in front of your bookcase, feeling a little bored. You pick up a book at random, and read a few pages on a random topic. It piques your curiosity, so you put the first book away and pick another one that has something to say about the same topic. You read a few pages from it and notice a second topic that interests you. So you pick up a third book on that topic, and that book draws your attention to yet another topic. And you continue moving from book to topic to book to topic — forever.

Wouldn’t be interesting if we could describe that process mathematically?

For the for the last few months I’ve been thinking about the best way to create useful networks with topic models. People have been creating network visualizations of topic models for a long time now, but they sometimes feel a bit like window dressing.1 The problem is that we don’t know what these networks actually represent. The topics are just blobs linked together and floating in a mysterious, abstract space. But what if we could create a network with a clear and concrete interpretation in terms of a physical process that we understand? What if we could create a network that represents the process of browsing through the books on a bookshelf?

I have struck upon a formula that I think does just that — it describes the probability of moving from one topic to another while browsing through a corpus. Remarkably, the formula is very similar to the formula for cosine similarity, which is one of the more popular ways of measuring the similarity between topics. But it differs in crucial ways, and it creates a kind of topic network that I haven’t seen before.2 I’d like to hear what others think about it.

I’ve developed two different theoretical arguments that suggest that the networks this formula creates are more useful than the networks that cosine similarity creates. One argument is related to the theory of bimodal networks, and the other is related to the theory of Markov chains. I have several posts queued up that go into the details, but I’ve decided not to post them just yet. Instead, I’m going to let the method speak for itself on practical grounds. I’ll post more once I feel confident that the result is worth the cognitive investment. However, if you’re interested in the fundamental math, I’ve posted a derivation.

For now, I’ll assume that most readers are already familiar with — or else are profoundly indifferent to — a few background ideas about topic modeling, cosine similarity, and topic networks.3 I hope that won’t exclude too many people from the conversation, because my core argument will be mostly visual and practical: I think that visualizations of these networks look better, and I think the idea of a “browsing similarity” between topics sounds useful — do you?

So feel free to skip past the wonkish bits to the diagrams!

In my own experimentation and research, I’ve found that browsing similarity creates topic networks that differ in several ways from those that cosine similarity creates. First, they distribute links more uniformly between nodes. It’s desirable to simplify topic networks by cutting links with a flat threshold, because the result is easy to reason about. When you do that with this new kind of network, most of the nodes stick together in one loose clump with lots of internal clustering. Second, they invite a probabilistic interpretation with some interesting and well understood theoretical properties.4 Those properties ensure that even some of the more abstract network-theoretical measures, like eigenvalue centrality, have concrete interpretations. And third, they are directed — which says some important things about the relationships between topics in a corpus.

Below are three network diagrams based on a topic model of two thousand eighteenth-century books published between 1757 and 1795.5 Each has 150 nodes (one for each topic in the model). The strengths of the links between each of the nodes are calculated using either cosine similarity or browsing similarity between topic vectors. Each vector is a sequence of book proportions for a given topic. To usefully visualize a topic model as a network, you must cut some links, and the easiest approach is to apply some kind of threshold. Links stronger than some value stay, and the rest are cut. In all the network diagrams below, I’ve selected threshold values that produce exactly 225 links. For layout, I’ve used D3’s force-directed layout routine, so the diagrams will look a little different each time you reload this page.6

In the first diagram, I’ve used cosine similarity with a simple flat threshold. The result is a hairball with a lot of little islands floating around it:

To deal with this problem, Ted Underwood came up with a really clever link-cutting heuristic that produces much cleaner network diagrams. However, it’s a little ad-hoc; it involves retaining at least one link from every node, and then retaining additional links if they’re strong enough. It’s like a compromise between a flat threshold (take all links stronger than $x$) and a rank-based threshold (take the strongest $n$ links from each node).

In the second diagram, I’ve used cosine similarity again, but applied a variation on Underwood’s heuristic with a tunable base threshold.7

The result is much more coherent, and there’s even a bit of suggestive clustering in places. There are a few isolated archipelagos, but there are no singleton islands, because this method guarantees that each node will link to at least one other node.

Now for the browsing similarity approach. In the third diagram, I’ve used browsing similarity with a simple flat threshold:

Although this diagram has both singleton islands and archipelagos, it’s far more connected than the first, and it has almost as many mainland connections as the second. It also shows a bit more clustering behavior than diagram two does. But what I find most interesting about it is that it represents the concrete browsing process I described above: each of the edges represents a probability that while browsing randomly through the corpus, you will happen upon one topic, given that you are currently reading about another.8 That’s why the edges are directed — you won’t be as likely to move from topic A to topic B as from topic B to topic A. This makes perfect sense: it ought to be harder to move from common topics to rare topics than to move from rare topics to common topics.

Because I wanted to show the shapes of these graphs clearly, I’ve removed the topic labels. But you can also see full-screen versions of the cosine, Underwood, and browsing graphs, complete with topic labels that show more about the kinds of relationships that each of them preserve.

Here’s everything you need to play with the browsing similarity formula. First, the mathematical formula itself:

$\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}$

You can think of $b$ as standing for “book,” and $X$ and $Y$ as two different topics. $x_1$ is the proportion of book 1 that is labeled as being about topic $X$, and so on. The formula is very similar to the forumla for cosine similarity, and the tops are identical. Both calculate the dot product of two topic-book vectors. The difference between them is on the bottom. For browsing similarity, it’s simply the sum of the values in $Y$, but for cosine similarity, it’s the product of the lengths of the two vectors:

$\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}$

Here a bit of jargon is actually helpful: cosine similarity uses the “euclidean norm” of both $X$ and $Y$, while browsing similarity uses only the “manhattan norm” of $Y$, where “norm” is just a ten dollar word for length. Thinking about these as norms helps clarify the relationship between the two formulas: they both do the same thing, but browsing similarity uses a different concept of length, and ignores the length of $X$. These changes turn the output of the formula into a probability.

Next, some tools. I’ve written a script that generates Gephi-compatible or D3-compatible graphs from MALLET output. It can use cosine or browsing similarity, and can apply flat, Underwood-style, or rank-based cutoff thresholds. It’s available at GitHub, and it requires numpy and networkx. To use it, simply run MALLET on your corpus of choice, and pass the output to tmtk.py on the command line like so:

./tmtk.py network --remove-self-loops \
--threshold-value 0.05 \
--threshold-function flat \
--similarity-function browsing \
--output-type gexf \
--write-network-file browsing_sim_flat \
mallet_output.composition


It should be possible to cut and paste the above command into any bash terminal — including Terminal in OS X under default settings. If you have any difficulties, though, let me know! It may require some massaging to work with Windows. The command should generate a file that can be directly opened by Gephi. I hope the option names are obvious enough; more detailed information about options is available via the --help option.

In case you’d prefer to work this formula into your own code, here is a simplified version of the browsing_similarity function that appears in the above Python script. Here, A is a matrix of topic row vectors. The code here is vectorized to calculate every possible topic combination at once and put them all into a new matrix. You can therefore interpret the output as the weighted adjacency matrix of a fully-connected topic network.

def browsing_similarity(A):
A = numpy.asarray(A)
norm = A.T.sum(axis=0)
return numpy.dot(A, A.T) / norm


And here’s the same thing in R9:

browsingsim norm = rowSums(A)
dot = A %*% t(A)
return(t(dot / norm))
}


Matrix calculations like this are a dream when the shapes are right, and a nightmare when they’re wrong. So to be ridiculously explicit, the matrix A should have number_of_topics rows and number_of_books columns.

I have lots more to say about bimodal networks, conditional probability, Markov chains, and — at my most speculative — about the questions we ought to ask as we adapt more sophisticated mathematical techniques for use in the digital humanities.

But until then, comments are open!

1. Ted Underwood has written that “it’s probably best to view network visualization as a convenience,” and there seems to be an implicit consensus that topic networks are more visually stunning than useful. My hope is that by creating networks with more concrete interpretations, we can use them to produce evidence that supports interesting arguments. There are sure to be many details to work through and test before that’s possible, but I think it’s a research program worth developing further.
2. I’ve never seen anything quite like this formula or the networks it produces. But I’d love to hear about other work that people have done along these lines — it would make the theoretical burden much lighter. Please let me know if there’s something I’ve missed. See also a few near-misses in the first footnote to my post on the formula’s derivation. [Update: I found a description of the Markov Cluster Algorithm, which uses a matrix that is similar to the one that browsing similarity produces, but that is created in a slightly different way. I’m investigating this further, and I’ll discuss it when I post on Markov chains.]
3. If you’d like to read some background material, and you don’t already have a reading list, I propose the following sequence: Matt Jockers, The LDA Buffet Is Now Open (very introductory); Ted Underwood, Topic Modeling Made Just Simple Enough (simple enough but no simpler); Miriam Posner and Andy Wallace, Very Basic Strategies for Interpreting Results from the Topic Modeling Tool (practical approaches for quick bootstrapping); Scott Weingart, Topic Modeling and Network Analysis (introduction to topic networks); Ted Underwood, Visualizing Topic Models (additional theorization of topic visualization).
4. Specifically, their links can be interpreted as transition probabilities in an irreducible, aperiodic Markov chain. That’s true of many networks, strictly speaking. But in this case, the probabilities are not derived from the network itself, but from the definition of the browsing process.
5. I have a ton of stuff to say about this corpus in the future. It’s part of a collaborative project that Mae Capozzi and I have been working on.
6. Because the force-directed layout strategy is purely heuristic, the layout itself is less important than the way the nodes are interconnected. But the visual intuition that the force-directed layout provides is still helpful. I used the WPD3 WordPress plugin to embed these. It’s a little finicky, so please let me know if something has gone wrong.
7. Underwood’s original method kept the first link, the second link if it was stronger than 0.2, and the third link if it was stronger than 0.38. This variation takes a base threshold $t$, which is multiplied by the rank of a given link to determine the threshold it must meet. So if the $n$th strongest link from a node is stronger than $t * (n - 1)$, then it stays.
8. Because some links have been cut, the diagram doesn’t represent a full set of probabilities. It only represents the strongest links — that is, the topic transitions that the network is most biased towards. But the base network retains all that information, and standard network measurements apply to it in ways that have concrete meanings.
9. I had to do some odd transpositions to ensure that the R function generates the same output as the Python function. I’m not sure I used the best method — the additional transpositions make the ideas behind the R code seem less obvious to me. (The Python transpositions might seem odd to others — I guess they look normal to me because I’m used to Python.) Please let me know if there’s a more conventional way to manage that calculation.

# The Markov Chains of La Grande Jatte A Short Introduction to Gibbs Sampling

Topic modeling has been attracting the attention of scholars in the digital humanities for several years now, and quite a few substantive introductions to the subject have been written. Ben Schmidt offered a brief overview of the genre in 2012, and the list he provided is still fairly comprehensive, as far as I can tell.1 My current favorite is an entry from Miriam Posner and Andy Wallace that emphasizes the practical side of topic modeling — it’s great for bootstrapping if you’re new to the subject.

This post will cover something slightly different. When I started to delve into the details of topic modeling, I quickly realized that I needed to create my own implementation of Latent Dirichlet Allocation (LDA) to begin understanding how it worked. I eventually did, but even with all the terrific resources available, I ran into several significant roadblocks.2 The biggest one for me was figuring out Gibbs sampling. A lot of introductions to topic modeling don’t spend much time on Gibbs sampling, for understandable reasons. It’s not part of LDA properly speaking, so you don’t need to understand how it works to understand the fundamentals of LDA. In fact, in his original description of LDA, David Blei didn’t even talk about Gibbs sampling — he used a thing called “variational inference,” which is a wall of abstraction that I still haven’t managed to scale.

Fortunately, Gibbs sampling yielded to my efforts more readily. And although it’s not strictly necessary to understand Gibbs sampling to understand LDA, I think it’s worth understanding for other reasons. In fact, I’ve come to believe that Gibbs sampling is a wonderful introduction to the rapidly evolving world of machine learning — a world that I think at least a subset of digital humanists should have much broader knowledge of.

### What is Gibbs sampling?

Here’s my attempt at a definition: Gibbs sampling is a way to build a picture of a global probability distribution when you only have local information about that distribution. That’s more of a description than a definition; other techniques do that too. But I like it because it shows what Gibbs sampling is good at. You can use it to take lots of little bits of information — like individual word counts — and construct a global view of those bits.

Suppose that you are temporarily Georges Seurat, but you couldn’t make it to the Island of La Grande Jatte today. Instead of seeing it for yourself or looking at someone else’s picture, you decide to consult with Sam, your omniscient imaginary friend. Sam supplies you with some probabilities like so:

Given that you have just put a green dot here (Sam points at a spot on the canvas):

• The probability is $\mu$ that your next dot will be an orange dot there.
• The probability is $\eta$ that your next dot will be a blue dot over there.
• The probability is … [more tiny numbers]

This list goes on until every possible location on the canvas and every possible color has a probability associated with it. It turns out they all add up to one. (They’re probabilities, after all!) Then Sam gives you another list that starts with another location and possible color. You get lists from every possible point and color on the canvas, to every possible point and color on the canvas. Now, at any moment while painting, you can look up the dot you’ve just painted in the table. You can then use that dot’s transition probabilities to decide how to paint the next one.

So you just start painting dots. And lo and behold, after a really long time, you’re looking at a picture of La Grande Jatte.

What I’ve just described is called a Markov chain.3 Gibbs sampling adds just one more little twist. But before I get to that, I want to explain why this is possible. Sam’s table of probabilities has to meet three conditions for this to work. The first two dictate the kinds of movements between points and colors that the table of probabilities must allow. First, the table of probabilities must allow you to get from any point and color in the painting to any other. It doesn’t have to allow you to get from one to the other in a single step, but it has to allow you to get there eventually. And second, the table of probabilities must allow you to get from one point and color to another at irregular intervals. So if it aways takes you two, or four, or eight steps to get from node A to node B, and never any other number of steps, then the table doesn’t satisfy this condition, because the number of steps required to get from A to B is always a multiple of two.

Together, these conditions tell us that the Markov chain has what’s called a stationary distribution.4 It’s a probability distribution over every point and possible color on the canvas. It tells you how often you will paint a particular dot, on average, if you keep painting forever. If Sam’s table meets these first two conditions, then we can prove that it has a stationary distribution, and we can even prove that its stationary distribution is unique. At that point, it only has to meet one more condition: its stationary distribution must be a painting of La Grande Jatte.

What’s neat about this is that none of the individual transition probabilities know much about the painting. It’s only when they get together and “talk” to one another for a while that they start to realize what’s actually going on.5 That’s what Gibbs sampling allows.

### The Catch

The difficult part of using Markov chains this way is figuring out the transition probabilities. How many coordinates and color codes would you need to create an adequate representation of a Seurat painting? I’m not sure, but I bet it’s a number with a lot of zeros at the end. Call it $N$. And to create the full transition table, you’d have to calculate and store probabilities from each of those values to each of those values. That’s a big square table with $N$ rows and columns. These numbers get mind-bogglingly huge for even relatively simple problems.

Gibbs sampling uses a clever trick to get around that issue. It’s based on the simple insight that you don’t have to change every dimension at once. Instead of jumping directly from one point and color to another — from $(x_1, y_1, c_1)$ to $(x_2, y_2, c_2)$ — you can move along one dimension at a time, jumping from $(x_1, y_1, c_1)$ to $(x_2, y_1, c_1)$ to $(x_2, y_2, c_1)$ to $(x_2, y_2, c_2)$, and so on. It turns out that calculating probabilities for those transitions is often much easier and faster — and the stationary distribution stays the same.

In effect, this means that although you might not be able to calculate all the transition probabilities in the table, you can calculate all the relevant translation probabilities pretty easily. This makes almost no practical difference to you as you paint La Grande Jatte. It just means you do three lookups instead of one before painting the next dot. (It also might mean you don’t paint the dot every time, but only every fifth or tenth time, so that your dots aren’t too tightly correlated with one another, and come closer to being genuinely independent samples from the stationary distribution.)

In the context of the LDA model, this means that you don’t have to leap from one set of hypothetical topic labels to an entirely different one. That makes a huge difference now, because instead of working with a canvas, we’re working with a giant topic hypercube with a dimension for every single word in the corpus. Given that every word is labeled provisionally with a topic, we can just change each topic label individually, over and over, using transition probabilities from this formula that some really smart people have helpfully derived for us. And every time we save a set of topic labels, we’ve painted a single dot on the canvas of La Grande Jatte.6

### So What?

I began this post with a promise that you’d get something valuable out of this explanation of Gibbs sampling, even though it isn’t part of the core of LDA. I’m going to offer three brief payoffs now, which I hope to expand in later posts.

First, most implementations of LDA use Gibbs sampling, and at least some of the difficulties that LDA appears to have — including some identified by Ben Schmidt — are probably more related to issues with Gibbs sampling than with LDA. Think back to the requirement that to have a stationary distribution, a Markov chain has to be able to reach every possible state from every other possible state. That’s strictly true in LDA, because the LDA model assumes that every word has a nonzero probability of appearing in every topic, and every topic has a nonzero probability of appearing in every document. But in some cases, those probabilities are extremely small. This is particularly true for word distributions in topics, which tend to be very sparse. That suggests that although the Markov chain has a stationary distribution, it may be hard to approximate quickly, because it will take a very long time for the chain to move from one set of states to another. For all we know, it could take only hours to reach a result that looks plausible, but years to reach a result that’s close to the actual stationary distribution. Returning to the Grand Jatte example, this would be a bit like getting a really clear picture of the trees in the upper-right-hand corner of the canvas and concluding that the rest must be a picture of a forest. The oddly conjoined and split topics that Schmidt and others have identified in their models seem a little less mysterious once you understand the quirks of Gibbs sampling.

Second, Gibbs sampling could be very useful for solving other kinds of problems. For some time now, I’ve had an eccentric obsession with encoding text into prime numbers and back into text again. The source of this obsession has to do with copyright law and some of the strange loopholes that the idea-expression dichotomy creates.7 I’m going to leave that somewhat mysterious for now, and jump to the point: part of my obsession has involved trying to figure out how to automatically break simple substitution cyphers. I’ve found that Gibbs sampling is surprisingly good at it. This is, I’ll admit, a somewhat peripheral concern. But I can’t get rid of the sense that there are other interesting things that Gibbs sampling could do that are more directly relevant to digital humanists. It’s a surprisingly powerful and flexible technique, and I think its power comes from that ability to take little bits of fragmentary information and assemble them into a gestalt.8

Third, I think Gibbs sampling is — or should be — theoretically interesting for humanists of all stripes. The theoretical vistas opened up by LDA are fairly narrow because there’s something a little bit single-purpose about it. Although it’s remarkably flexible in some ways, it makes strong assumptions about the structure of the data that it analyzes. Those assumptions limit its possible uses as a model for more speculative thinking. Gibbs sampling makes fewer such assumptions; or to be more precise, it accommodates a wider range of possible assumptions. MALLET is a tool for pounding, and it does a great job at it. But Gibbs sampling is more like the handle of a bit-driver. It’s only half-complete — assembly is required to get it to do something interesting — but it’s the foundation of a million different possible tools.

It’s the kind of tool a bricoleur ought to own.

1. If you know of new or notable entries that are missing, let me know and I’ll add them to a list here.
2. You can take a look here. Caveat emptor! I called it ldazy for a reason — it stands for “LDA implementation by someone who is too lazy” to make further improvements. It’s poorly-commented, inefficient, and bad at estimating hyperparameters. (At least it tries!) Its only strength is that it is short and written in pure Python, which means that its code is somewhat legible without additional comment.
3. After writing this, I did some Googling to see if anybody else had thought about Markov chains in terms of pointillism. I didn’t find anything that takes quite the same approach, but I did find an article describing a way to use Markov chains to model brushstrokes for the purpose of attribution!
4. In case you want to talk to math people about this, these conditions are respectively called “irreducibility” and “aperiodicity.”
5. Sorry, I couldn’t resist.
6. I’m risking just a bit of confusion by extending this analogy so far, because it’s tempting to liken colors to topics. But that’s not quite right. To perfect this analogy, expand the canvas into a three-dimensional space in which all green dots occupy one plane, all orange dots occupy another, and so on. In this scheme, the dots are only present or absent — they are themselves “colorless,” and only take on a color insofar as one of the dimensions is interpreted as a color dimension. And suppose the $x$, $y$, and $c$ variables can take values between 1 and 50. Now each dimension could just as easily represent a single word in a three-word corpus, and each dot in this three-dimensional space could represent a sequence of topic assignments for a fifty-topic model — with a value between 1 and 50 for each word in the corpus.
7. “In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work” 17 USC 102
8. For another way of thinking about the possibilities of Gibbs sampling and other so-called Monte Carlo Markov chain (MCMC) methods, see the wonderful sub-subroutine post on using MCMC to learn about bread prices during the napoleonic wars.

# A Sentimental Derivative

Ben Schmidt’s terrific insight into the assumptions that the Fourier transform imposes on sentiment data has been sinking in, and I have a left-field suggestion for anyone who cares to check it out. I plan to investigate it myself when I have the time, but I’ve decided to broadcast it now.

In the imaginary universe of Fourier land, all texts start and end at the same sentiment amplitude. This is clearly incorrect, as I see it.1 But what could we say about the beginning and end of texts that might hold up?

One possibility is that all texts might start and end with a flat sentiment curve. That is, at the very beginning and end of a text, we can assume that the valence of words won’t shift dramatically. That’s not clearly incorrect. I think it’s even plausible.

Now consider how we talk about plot most of the time: we speak of rising action (slope positive), falling action (slope negative), and climaxes (local and global maxima). That’s first derivative talk! And the first derivative of a flat curve is always zero. So if the first derivative of a sentiment curve always starts and ends at zero, then at least one objection to the Fourier transform approach can be worked around. For example, we could simply take the first finite difference of a text’s sentiment time series, perform a DFT and low-pass filter, do a reverse transform, and then do a cumulative sum (i.e. a discrete integration) of the result.2

What would that look like?

1. Nonetheless, I think there’s some value to remaining agnostic about this for some time still — even now, after the dust has settled a bit.
2. You might be able to skip a step or two.

# What’s a Sine Wave of Sentiment?

Over the last month a fascinating series of debates has unfolded over Matt Jockers’ Syuzhet package. The debates have focused on whether Syuzhet’s smoothing strategy, which involves using a Fourier transform and low-pass filter, is appropriate. Annie Swafford has produced several compelling arguments that this strategy doesn’t work. And Ted Underwood has responded with what is probably the most accurate assessment of our current state of knowledge: we don’t know enough yet to say anything at all.

I have something to add to these debates, but I’ll begin by admitting that I haven’t used Syuzhet. I’m only just now starting to learn R. (I’ve been using Python / Numpy / Scipy / Pandas in my DH work so far.) My objection is not based on data or validation, statistical or otherwise. It’s based on a more theoretical line of reasoning.1

I broadly agree with Annie Swafford’s assessment: it looks to me like this strategy is producing unwanted ringing artifacts.2 But Matt Jockers’ counterargument troubles her line of reasoning — he suggests that ringing artifacts are what we want. That doesn’t sound right to me, but that argumentative move shows what’s really at stake here. The question is not whether ringing artifacts distort the data relative to some ground truth. There’s no doubt about that — this is, after all, a way of distorting rough data to make it smooth. The question is whether we want this particular kind of distortion. My issue with using Fourier transforms to represent sentiment time series data is that we have no clear theoretical justification to do so. We have no theoretical reason to want the kind of distortion it produces.

If we hope to use data mining tools to produce evidence, we need to think about ways to model data that are suited to our own fields. This is a point Ted Underwood made early on in the conversation about LDA, well before much had been published by literary scholars on the subject. The point he made is as important now as then: we should do our best to ensure that the mathematical models we use have clear and concrete interpretations in terms of the physical processes that we study and seek to understand: reading, writing, textual distribution, influence, and so on. This is what Syuzhet fails to do at the smoothing and filtering stage right now. I think the overall program of Syuzhet is a promising one (though there may be other important aspects of the thing-that-is-not-fabula that it ignores). But I think the choice of Fourier analysis for smoothing is not a good choice — not yet.

A Fourier transform models time series data as a weighted sum of sine waves of different frequencies. But I can think of no compelling reason to represent a sequence of sentiment measurements as a sum of sine waves. Consider LDA as a point of comparison (as Jockers has). There’s a clear line of reasoning that supports our using the Dirichlet distribution as a prior. One could certainly argue that the Dirichlet density has the wrong shape, but its support — the set of values over which it is defined — has the right shape.3 It’s a set of N distinct real-valued variables that always sum to one. (In other words, it’s a distribution over the ways to break a stick into N parts.) Since we have good reasons to think about language as being made of distinct words, thinking in terms of categorical probability distributions over those words makes sense. And the Dirichlet distribution is a reasonable prior for categorical distributions because its support consists entirely of categorical probability distributions (ways to break a stick). Even if we were to decide that we need a different prior distribution, we would probably still choose a distribution defined over the same support.4

But the support of the function produced by a Fourier transform is the frequency domain of a sinusoidal curve. Is that the right support for this purpose? Setting aside the fact that we’re no longer talking about a probability distribution, I think it’s still important to ask that question. If we could have confidence that it makes sense to represent a sentiment time series as a sum of sinusoidal curves, then we might be able to get somewhere with this approach. The support would be correct, even if the shape of the curve over the frequency domain weren’t quite right. But why should we accept that? Why shouldn’t we be looking at functions over domains of wavelets or chirplets or any number of other possibilities? Why should the sentimental valence of the words in a novel be best represented by sine waves in particular?

I think this is a bit like using a Gaussian mixture model (GMM) to do topic modeling. You can use Gaussian distributions as priors for topic models. It might even be a good idea to do so, because it could allow us to get good results faster. But it’s not going to help us understand how topic modeling works in the first place. The Gaussian prior obscures what’s really going on under the hood.5 Even if we all moved over to Gaussian priors in our topic models, we’d probably still use classic LDA to get a handle on the algorithm. In this case, I think the GMM is best understood as a way to approximate LDA.

And now, notice that we can use a Fourier transform to approximate any function at all. But what does doing so tell us about the function? Does it tell us what we want to know? I have no idea in this case, and I don’t think anyone else does either. It’s anyone’s guess whether the sine waves that this transform uses will correspond to anything at all familiar to us.

I think this is a crucial issue, and it’s one we can frame in terms of disciplinary continuity. Whenever we do any kind of informal reasoning based on word counts alone, we’re essentially thinking in terms of categorical distributions. And I think literary scholars would have paid attention to a well-reasoned argument based on word counts thirty years ago. If that’s right, then LDA simply gives us a way to accelerate modes of reasoning about language that we already understand. But if thirty years ago someone had argued that the movement of sentiment in a novel should be understood through sinusoidal shapes, I don’t think they would have been taken very seriously.

Admittedly, I have no strong justification for this claim, and if there’s widespread disagreement about it, then this debate will probably continue for some time. But either way, we need to start thinking very concretely about what it means to represent sentiment specifically as a sine wave. We will then be able to trust our intuitions about our own field of study to guide us.

1. This means that to a certain degree, I’m not taking Syuzhet in the spirit with which it was offered. Jockers writes that his “primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories.” I’ve not done that, even though I think it’s a sound method. We can’t depend only on statistical measurements; our conclusions need intuitive support. But I also think the theoretical questions I’m asking here are necessary to build that kind of support.
2. I suspected as much the moment I read about the package, though I’m certain I couldn’t have articulated my concerns without Swafford’s help. Update: And I hope it’s clear to everyone, now that the dust has settled, that Swafford has principal investigator status in this case. If she hadn’t started it, the conversation probably wouldn’t have happened at all.
3. The support of a function is the set of inputs that it maps to nonzero outputs.
4. The logic of this argument is closely related to the theory of types in computer programming. One could say that a categorical sampling algorithm accepts variables of the “broken stick” type and samples from them; and one could say that when we sample from a Dirichlet distribution, the output is a variable of the “broken stick” type.
5. The truth of this is strongly suggested to me by the fact that the above cited paper on GMM-based topic modeling initially proposes a model based on “cut points” — a move I will admit that I understand only in vague terms as a way of getting discrete output from a continuous function. That paper looks to me like an attempt to use a vector space model for topic modeling. But as I’ll discuss in a later post, I don’t find vector space models of language especially compelling because I can’t develop a concrete interpretation of them in terms of authors, texts, and readers.

# A Random Entry

There’s a way of telling a history of the digital humanities that does not follow the well known trajectory from Father Busa’s Index Thomisticus, Mosteller and Wallace’s study of the Federalist Papers, and the Text Encoding Initiative — to Distant Reading, data mining, and the present day. It does not describe the slow transformation of a once-peripheral field into an increasingly mainstream one. Instead, it describes a series of missed opportunities.

It’s a polemical history that inverts many unspoken assumptions about the relationship between the humanities and the sciences. I’m not sure I entirely believe it myself. But I think it’s worth telling.

It starts like this: there once was a guy named Frank Rosenblatt. In 1957, Rosenblatt created the design for a device he called the perceptron. It was an early attempt at simulating the behavior of networks of biological neurons, and it initially provoked a frenzy of interest, including the following New York Times report:

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

Needless to say, the perceptron never managed to do any of those things. But what it did do was exceptional in its own way. It used to great practical effect the following insight: that many small, inaccurate rules can be combined in simple ways to create a larger, more accurate rule. This insight is now central to statistical learning theory.1 But it was not understood as particularly important at the time.

In fact, when people began to realize that the perceptron in its simplest form was limited, a backlash ensued. Marvin Minsky and Seymour Papert wrote a book called Perceptrons that enumerated the limits of simple two-layer perceptrons2; people misunderstood the book’s arguments as applying to all neural networks; and the early promise of perceptrons was forgotten.

This turn of events may have delayed the emergence of interesting machine learning technologies by a couple of decades. After the perceptron backlash, artificial intelligence researchers focused on using logic to model thought — creating ever more complex sets of logical rules that could be combined to generate new rules within a unified and coherent system of concepts. This approach was closely related to the kinds of transformational grammars that Noam Chomsky has been exploring since the 1950s, and it largely displaced statistical approaches — with a few exceptions — until the 1990s.

Unsurprisingly, Chomsky remains hostile to statistical and probabilistic approaches to machine learning and artificial intelligence. Nonetheless, there does seem to be some evidence that those approaches have gotten something right. Peter Norvig offers the following summary:

Chomsky said words to the effect that statistical language models have had some limited success in some application areas. Let’s look at computer systems that deal with language, and at the notion of “success” defined by “making accurate predictions about the world.” First, the major application areas:

• Search engines: 100% of major players are trained and probabilistic. Their operation cannot be described by a simple function.
• Speech recognition: 100% of major systems are trained and probabilistic…
• Machine translation: 100% of top competitors in competitions such as NIST use statistical methods…
• Question answering: this application is less well-developed, and many systems build heavily on the statistical and probabilistic approach used by search engines…

Now let’s look at some components that are of interest only to the computational linguist, not to the end user:

• Word sense disambiguation: 100% of top competitors at the SemEval-2 competition used statistical techniques; most are probabilistic…
• Coreference resolution: The majority of current systems are statistical…
• Part of speech tagging: Most current systems are statistical…
• Parsing: There are many parsing systems, using multiple approaches. Almost all of the most successful are statistical, and the majority are probabilistic…

Clearly, it is inaccurate to say that statistical models (and probabilistic models) have achieved limited success; rather they have achieved a dominant (although not exclusive) position.

In the past fifteen years, these approaches to machine learning have produced a number of substantial leaps forward — consider Google’s famous creation of a neural network that (in at least some sense) reinvented the concept of “cat,” or this recurrent neural network capable of imitating various styles of human handwriting. These extraordinary successes have been made possible by a dramatic increase in computing power. But without an equally dramatic shift in ways of thinking about what constitutes knowledge, that increase in computing power would have accomplished far less. What has changed is that the people doing the math have stopped trying to find logical models of knowledge by hand, and have started trying to find probabilistic models of knowledge — models that embrace heterogeneity, invite contradiction, and tolerate or even seek out ambiguity and uncertainty. As machine learning researchers have discovered, the forms these models take can be defined with mathematical precision, but the models themselves tolerate inconsistencies in ways that appear to be unbound by rigid logic.3

I’d like to suggest that by embracing that kind of knowledge, computer scientists have started walking down a trail that humanists were blazing fifty years ago.

The kind of knowledge that these machines have does not take the form of a rich, highly structured network of immutable concepts and relations with precise and predictable definitions. It takes the form of a loose assembly of inconsistent and mutually incompatible half-truths, always open to revision and transformation, and definable only by the particular distinctions it can make or elide at any given moment. It’s the kind of knowledge that many literary scholars and humanists have found quite interesting for the last few decades.

Since the decline of structuralism, humanists have been driven by a conviction that the loosely or multiply structured behaviors that constitute human culture produce important knowledge that cannot be produced in more structured ways. Those humanities scholars who remained interested in structured ways of producing knowledge — like many of the early practitioners of humanities computing — were often excluded from conversations in the humanistic mainstream.

Now something has changed. The change has certainly brought computational methods closer to the mainstream of the humanities. But we mustn’t mistake the change by imagining that humanists have somehow adopted a new scientism. A better explanation of this change is that computer scientists, as they have learned to embrace the kinds of knowledge produced by randomness, have reached a belated understanding of the value of — dare I say it? — post-structuralist ways of knowing.

It’s a shame it didn’t happen earlier.

1. I first encountered the above formulation of this idea in the first video in Geoffrey Hinton’s online course on neural networks. But you can see it being used by other researchers (p. 21) working in machine learning on a regular basis.
2. In machine learning lingo, it could not learn nonlinear decision boundaries. It didn’t even have the ability to calculate the logical XOR operation on two inputs, which at the time probably made logic-based approaches look far more promising.
3. I say “appear” because it’s not entirely clear what it would mean to be unbound by rigid logic. The mathematical formulation of machine learning models is itself perfectly strict and internally consistent, and if it weren’t, it would be irreparably broken. Why don’t the statistical models represented by that formulation break in the same way? I suspect that it has something to do with the curse blessing of dimensionality. They don’t break because every time a contradiction appears, a new dimension appears to accommodate it in an ad-hoc way — at least until the model’s “capacity” for such adjustments is exhausted. I’m afraid I’m venturing a bit beyond my existential pay grade with these questions — but I hope this sliver of uncertainty doesn’t pierce to the core of my argument.

# Deriving Browsing Similarity

The following is an extremely deliberate, step-by-step derivation of what I think is a novel vector similarity measure.1 I discuss motivations for it here; this focuses on the math. Describing this as a “vector” similarity measure might be a little deceptive because part of the point is to get away from the vector model that we’ve inherited via cosine similarity. But in its final form, the measure is similar enough to cosine similarity that it’s helpful to hold on to that way of thinking for now.

The initial aim of this similarity measure was to create better topic model graphs, and I think it really does. But as I’ve thought it through, I’ve realized that it has applications to any bipartite graph that can be interpreted in probabilistic terms. More on that later!

The derivation is pure conditional probability manipulation; it doesn’t require anything but algebra and a few identities. But if you’re not familiar with the concepts of conditional probability and marginalization (in the mathematical sense!) you may want to read up on them a bit first. Also, I’m not certain that I’m using conventional notation here — please let me know if I’ve done something odd or confusing. But I’m confident the reasoning itself is correct.

Informally, the aim of this measure is to determine the probability of happening upon one topic while browsing through a corpus for another one. Imagine for a moment that the corpus is a collection of physical books on a bookshelf in front of you. You’re interested in crop circles, and you are looking through the books on the bookshelf to find information about them. What is the probability that in the process, you’ll happen upon something about clock gear ratios?2

Formally, suppose we have random variables $X$ and $Y$, each representing identical categorical distributions over topics in a model, and suppose we have $B$, a random variable representing a uniform distribution over all books in the model’s corpus. We’re interested in the probability of happening upon topic $x$ given that we selected a book $b$ based on the proportion of the book discussing topic $y$. To be as precise as possible, I’ll assume that we use the following process to browse through the corpus:

1. Pick a topic $y$ of interest from $Y$.
2. Pick a book at random, with uniform probability.
3. Pick a word from the book at random, with uniform probability.
4. If the word is not labeled $y$, put the book back on the shelf and repeat steps 2 and 3 until the word you choose is labeled $y$.
5. Pick another word from the book at random, again with uniform probability.
6. Use that word’s topic label $x$ as your new topic of interest.
7. Repeat steps 2 through 6 indefinitely.

This is roughly equivalent to the less structured process I described above in the informal statement of the problem, and there’s at least some reason to believe that the probabilities will turn out similarly. Suppose, for example, that you know in advance the proportion of each book devoted to each topic, and choose a book based on that information; and that you then choose your new topic using that book’s topic distribution. The probabilities in that case should work out to be the same as if you use the above process. But specifying the above process ensures that we know exactly where to begin our derivation and how to proceed.

In short, what we have here is a generative model — but this time, instead of being a generative model of writing, such as LDA, it’s a generative model of reading.

Now for the derivation. First, some basic identities. The first two give different versions of the definition of conditional probability. The third shows the relationship between the conditional probabilities of three variables and their joint probability; it’s a version of the first identity generalized to three variables. And the fourth gives the definition of marginalization over the joint probability of three variables — which simply eliminates one of them by summing over the probabilities for all its possible values. (You can do the same thing with any number of variables, given a joint distribution.) I’m using the convention that an uppercase variable represents a probability distribution over a support (set of possible values) and a lowercase variable represents one possible value from that distribution. To avoid clutter, I’ve silently elided the $=x$ in $P(X=x)$ unless clarity requires otherwise.

$1) \quad p(X,Y) = p(X|Y) \times p(Y)$

$2) \quad p(X|Y) = p(X,Y) / p(Y)$

$3) \quad p(X,Y,B) = p(Y) \times p(B|Y) \times p(X|B,Y)$

$4) \quad p(X,Y) = \sum \limits_{b \in B} p(X,Y,B=b)$

Now for the derivation. First, substitute (3) into (4):

$5) \quad p(X,Y) = \sum \limits_{b \in B} (p(Y) \times p(B=b|Y) \times p(X|B=b,Y))$

The prior probability of $Y$, $p(Y)$, is constant with respect to $b$, so we can move it outside the sum:

$6) \quad p(X,Y) = p(Y) \times \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))$

Divide both sides:

$7) \quad p(X,Y) / p(Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))$

And by (2):

$8) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))$

Finally, for any given book $b$, we can simplify $p(X|B=b,Y)$ to $p(X|B=b)$ because the probability of picking a word labeled with topic $x$ depends only on the given book. The topic that led us to choose that book becomes irrelevant once it has been chosen.

$9) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b))$

So now we have a formula for $p(X|Y)$ in terms of $p(B|Y)$ and $p(X|B)$. And we know $p(X|B)$ — it’s just the probability of finding topic $x$ in book $b$, which is part of the output from our topic model. But we don’t yet know $p(B|Y)$.

Here’s how we can determine that value. We’ll introduce a combined version of equations 1 and 2 with the variables swapped as 10 — also known as Bayes’ Theorem:

$10) \quad p(B|Y) = p(Y|B) \times p(B) / p(Y)$

As well as a combination of the two-variable versions of equations 3 and 4 as 11:

$11) \quad p(Y) = \sum \limits_{b \in B} (p(B=b) \times p(Y|B=b))$

Starting with 11, we note that for any given $b$, $p(B=b) = 1 / N$ where $N$ is the number of books. (This is because $B$ is uniformly distributed across all books.) That means $p(B=b)$ is a constant, and so we can move it outside the sum:

$12) \quad p(Y) = p(B) \times \sum \limits_{b \in B} p(Y|B=b)$

Substituting that into equation 10 we get:3

$13) \quad p(B|Y) = p(Y|B) \times p(B) / (p(B) \times \sum \limits_{b \in B} p(Y|B=b))$

Conveniently, the $p(B)$ terms now cancel out:

$14) \quad p(B|Y) = p(Y|B) / \sum \limits_{b \in B} p(Y|B=b)$

And we substitute that into the our previous result (9) above:

$15) \quad p(X|Y) = \sum \limits_{b \in B}(p(X|B=b) \times p(Y|B=b) / \sum \limits_{b \in B} p(Y|B=b))$

Now we can simplify again by noticing that $\sum \limits_{b \in B} p(Y|B=b)$ is constant with respect to the outer sum, because all the changes in the values of $p(Y|B=b)$ in the outer sum are subsumed by the inner sum. So we can move that out of the sum.4

$16) \quad p(X|Y) = \sum \limits_{b \in B} (p(X|B=b) \times p(Y|B=b)) / \sum \limits_{b \in B} p(Y|B=b)$

This is a very interesting result, because it looks suspiciously like the formula for cosine similarity. To see that more clearly, suppose that instead of looking at $p(X|B=b)$ and $p(Y|B=b)$ as conditional probabilities, we looked at them as vectors with a dimension for every value of $b$ — that is, as $X = [x_1\ x_2\ x_3\ ...\ x_n]$ and $Y = [y_1\ y_2\ y_3\ ...\ y_n]$. We’d get this formula:

$17) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}$

And here’s the formula for cosine similarity for comparison:

$18) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}$

As you can see, the sum on top is identical in both formulas. It’s a dot product of the vectors $X$ and $Y$. The difference is on the bottom. In the cosine similarity formula, the bottom is the multiple of the lengths of the two vectors — that is, the Euclidean norm. But in the new formula we derived, it’s a simple sum! In fact, we could even describe it as the Manhattan norm, which makes the relationship between the two formulas even clearer. To convert between them, we can simply replace the Euclidean norm of both vectors with the Manhattan norm of the second vector $Y$ — that is, the conditioning probability distribution in $p(X|Y)$.

So at this point, you might be thinking “That was a lot of trouble for a result that hardly differs from cosine similarity. What was the point again?”

The first emphasizes the familiar. Although we just derived something that looks almost identical to cosine similarity, we derived it using a specific and well-defined statistical model that invites a new and more concrete set of interpretations. Now we don’t have to think about vectors in an abstract space; we can think about physical people holding physical books. So we’ve come to a better way to theorize what cosine similarity measures in this case, even if we don’t seem to have discovered anything new.

The second emphasizes the unfamiliar. Although what we have derived looks similar to cosine similarity, it is dramatically different in one respect. It’s an asymetric operator.

If you’re not sure what that means, look at the diagrams to the right. They illustrate the vector space model that cosine similarity uses, this time with two simple 2-d vectors. Think of the vectors as representing topics and the axes as representing documents. In that case, the cosine similarity corresponds roughly to the similarity between topics — assuming, that is, that topics that appear together frequently must be similar.

The cosine similarity is simply the cosine of the angle between them, theta. As the angle gets larger, the cosine goes down; as it gets smaller, the cosine goes up. If the vectors both point in exactly the same direction, then the cosine is 1. And if they are at right angles, it’s 0. But notice that if you swap them, the value doesn’t change. The cosine similarity of A with respect to B is the same as the cosine similarity of B with respect to A. That’s what it means to say that this is a symmetric operator.

But browsing similarity is not a symmetric operator. Why? Because the denominator only includes the norm of the conditioning variable (the $Y$ in $p(X|Y)$). The conditioned variable ($X$ in $p(X|Y)$) isn’t included in the denominator. This means that if the sum of $X$ is different from the sum of $Y$, the result will be different depending on the order in which the operator is applied. This means that the graph this measure produces will be a directed graph. Look — see the arrows?

The arrows indicate the order in which the similarity operator has been applied to the given topic vectors. If the arrow is pointing towards topic $X$, then it represents the value $p(X|Y)$ — otherwise, it represents the value $p(Y|X)$.

What’s exciting is that this has a physical interpretation. Concretely, the probability that people who are interested in topic $Y$ will be exposed to topic $X$ is different from the probability that people who are interested in topic $X$ will be exposed to topic $Y$. This shouldn’t be too surprising; obviously a very popular topic will be more likely to “attract” readers from other topics than an unpopular topic. So this is really how it ought to be: if $Y$ is an unpopular topic, and $X$ is a very popular topic, then $p(Y|X)$ should be much lower than $P(X|Y)$.

This makes me think that we’ve been using the wrong similarity measure to create our topic graphs. We’ve been assuming that any two topics are equally related to one another, but that’s not really a sound assumption, except under a model that’s so abstract that we can’t pin a physical interpretation to it at all.

Furthermore, this measurement isn’t limited to use on topic models. It can be used on any pair of types related to each other by categorical distributions; or, to put it another way, it can be used to collapse any bipartite graph into a non-bipartite graph in a probabilistically sound way. This opens up lots of interesting possibilities that I’ll discuss in later posts. But to give you just one example, consider this: instead of thinking of topics and books, think of books and recommenders. Suppose you have a community of recommenders, each of whom is disposed to recommend a subset of books with different probabilities. The recommenders form one set of nodes; the books form the other. The browsing process would look like this: you talk to someone likely to recommend a book you already like and ask for another recommendation. Then you talk to someone else likely to recommend that book, and so on. This could be a new way to implement recommender systems like those used by Amazon.

1. I hope someone will tell me it’s not! In that case, my task is easier, because it means I won’t have to do all the theory. I’ve found a few near misses. In a paper on topic model diagnostics, Chuang, et. al. propose a refinement of cosine similarity for finding matches between topics across multiple independently-run models. Unlike the measure I’m proposing, theirs uses topic-word vectors; is not based on a generative model; and is a symmetric operator. However, theirs was better at predicting human judgments than cosine similarity was, and is worth investigating further for that reason. See also this wonderful post by Brendan O’Connor, which lays out a fascinating set of relationships between various measures that all boil down to normed dot products using different norms. This measure could be added to that list.
2. Here’s the graph-theoretic interpretation: given that the topics and books together form a complete bipartite graph, in which the edges are weighted by the proportion of the book that is about the linked topic, this is equivalent to building a non-bipartite directed graph describing the probability of moving from one topic to another while browsing. This takes up ideas that I first encountered through Scott Weingart
3. This is almost the same as what Wikipedia calls the “extended version” of Bayes Theorem. The only difference is a small modification that resulted from the step we took to generate equation 12, which was possible because $p(B)$ is uniform.
4. If you’re anything like me, you probably start to feel a little anxious every time a variable moves inside or outside the summation operator. To give myself a bit more confidence that I wasn’t doing something wrong, I wrote this script, which repeatedly simulates the 7-step process specified above a million times on an imaginary corpus, and compares the resulting topic transition probability to the probability calculated by the derived formula. The match is close, with a tight standard deviation, and very similar results for several different averaging methods. The script is a little bit sloppy — my apologies — but I encourage anyone interested in verifying this result to look at the script and check my work. Let me know if you see a potential problem!