The Markov Chains of La Grande Jatte A Short Introduction to Gibbs Sampling

Topic modeling has been attracting the attention of scholars in the digital humanities for several years now, and quite a few substantive introductions to the subject have been written. Ben Schmidt offered a brief overview of the genre in 2012, and the list he provided is still fairly comprehensive, as far as I can tell.1 My current favorite is an entry from Miriam Posner and Andy Wallace that emphasizes the practical side of topic modeling — it’s great for bootstrapping if you’re new to the subject.

This post will cover something slightly different. When I started to delve into the details of topic modeling, I quickly realized that I needed to create my own implementation of Latent Dirichlet Allocation (LDA) to begin understanding how it worked. I eventually did, but even with all the terrific resources available, I ran into several significant roadblocks.2 The biggest one for me was figuring out Gibbs sampling. A lot of introductions to topic modeling don’t spend much time on Gibbs sampling, for understandable reasons. It’s not part of LDA properly speaking, so you don’t need to understand how it works to understand the fundamentals of LDA. In fact, in his original description of LDA, David Blei didn’t even talk about Gibbs sampling — he used a thing called “variational inference,” which is a wall of abstraction that I still haven’t managed to scale.

Fortunately, Gibbs sampling yielded to my efforts more readily. And although it’s not strictly necessary to understand Gibbs sampling to understand LDA, I think it’s worth understanding for other reasons. In fact, I’ve come to believe that Gibbs sampling is a wonderful introduction to the rapidly evolving world of machine learning — a world that I think at least a subset of digital humanists should have much broader knowledge of.

What is Gibbs sampling?

Here’s my attempt at a definition: Gibbs sampling is a way to build a picture of a global probability distribution when you only have local information about that distribution. That’s more of a description than a definition; other techniques do that too. But I like it because it shows what Gibbs sampling is good at. You can use it to take lots of little bits of information — like individual word counts — and construct a global view of those bits.

Suppose that you are temporarily Georges Seurat, but you couldn’t make it to the Island of La Grande Jatte today. Instead of seeing it for yourself or looking at someone else’s picture, you decide to consult with Sam, your omniscient imaginary friend. Sam supplies you with some probabilities like so:

An image of Georges Seruat's La Grande Jatte, at four different levels of detail.
What a Markov chain knows at first (top). What it knows after a long time (bottom).

Given that you have just put a green dot here (Sam points at a spot on the canvas):

  • The probability is \mu that your next dot will be an orange dot there.
  • The probability is \eta that your next dot will be a blue dot over there.
  • The probability is … [more tiny numbers]

This list goes on until every possible location on the canvas and every possible color has a probability associated with it. It turns out they all add up to one. (They’re probabilities, after all!) Then Sam gives you another list that starts with another location and possible color. You get lists from every possible point and color on the canvas, to every possible point and color on the canvas. Now, at any moment while painting, you can look up the dot you’ve just painted in the table. You can then use that dot’s transition probabilities to decide how to paint the next one.

So you just start painting dots. And lo and behold, after a really long time, you’re looking at a picture of La Grande Jatte.

What I’ve just described is called a Markov chain.3 Gibbs sampling adds just one more little twist. But before I get to that, I want to explain why this is possible. Sam’s table of probabilities has to meet three conditions for this to work. The first two dictate the kinds of movements between points and colors that the table of probabilities must allow. First, the table of probabilities must allow you to get from any point and color in the painting to any other. It doesn’t have to allow you to get from one to the other in a single step, but it has to allow you to get there eventually. And second, the table of probabilities must allow you to get from one point and color to another at irregular intervals. So if it aways takes you two, or four, or eight steps to get from node A to node B, and never any other number of steps, then the table doesn’t satisfy this condition, because the number of steps required to get from A to B is always a multiple of two.

Together, these conditions tell us that the Markov chain has what’s called a stationary distribution.4 It’s a probability distribution over every point and possible color on the canvas. It tells you how often you will paint a particular dot, on average, if you keep painting forever. If Sam’s table meets these first two conditions, then we can prove that it has a stationary distribution, and we can even prove that its stationary distribution is unique. At that point, it only has to meet one more condition: its stationary distribution must be a painting of La Grande Jatte.

What’s neat about this is that none of the individual transition probabilities know much about the painting. It’s only when they get together and “talk” to one another for a while that they start to realize what’s actually going on.5 That’s what Gibbs sampling allows.

The Catch

The difficult part of using Markov chains this way is figuring out the transition probabilities. How many coordinates and color codes would you need to create an adequate representation of a Seurat painting? I’m not sure, but I bet it’s a number with a lot of zeros at the end. Call it N. And to create the full transition table, you’d have to calculate and store probabilities from each of those values to each of those values. That’s a big square table with N rows and columns. These numbers get mind-bogglingly huge for even relatively simple problems.

Gibbs sampling uses a clever trick to get around that issue. It’s based on the simple insight that you don’t have to change every dimension at once. Instead of jumping directly from one point and color to another — from (x_1, y_1, c_1) to (x_2, y_2, c_2) — you can move along one dimension at a time, jumping from (x_1, y_1, c_1) to (x_2, y_1, c_1) to (x_2, y_2, c_1) to (x_2, y_2, c_2), and so on. It turns out that calculating probabilities for those transitions is often much easier and faster — and the stationary distribution stays the same.

The skeleton of a ten-dimensional hypercube.
See this crazy incomprehensible rat’s nest? It’s the skeleton of a topic hypercube for a corpus just ten words long.

In effect, this means that although you might not be able to calculate all the transition probabilities in the table, you can calculate all the relevant translation probabilities pretty easily. This makes almost no practical difference to you as you paint La Grande Jatte. It just means you do three lookups instead of one before painting the next dot. (It also might mean you don’t paint the dot every time, but only every fifth or tenth time, so that your dots aren’t too tightly correlated with one another, and come closer to being genuinely independent samples from the stationary distribution.)

In the context of the LDA model, this means that you don’t have to leap from one set of hypothetical topic labels to an entirely different one. That makes a huge difference now, because instead of working with a canvas, we’re working with a giant topic hypercube with a dimension for every single word in the corpus. Given that every word is labeled provisionally with a topic, we can just change each topic label individually, over and over, using transition probabilities from this formula that some really smart people have helpfully derived for us. And every time we save a set of topic labels, we’ve painted a single dot on the canvas of La Grande Jatte.6

So What?

I began this post with a promise that you’d get something valuable out of this explanation of Gibbs sampling, even though it isn’t part of the core of LDA. I’m going to offer three brief payoffs now, which I hope to expand in later posts.

First, most implementations of LDA use Gibbs sampling, and at least some of the difficulties that LDA appears to have — including some identified by Ben Schmidt — are probably more related to issues with Gibbs sampling than with LDA. Think back to the requirement that to have a stationary distribution, a Markov chain has to be able to reach every possible state from every other possible state. That’s strictly true in LDA, because the LDA model assumes that every word has a nonzero probability of appearing in every topic, and every topic has a nonzero probability of appearing in every document. But in some cases, those probabilities are extremely small. This is particularly true for word distributions in topics, which tend to be very sparse. That suggests that although the Markov chain has a stationary distribution, it may be hard to approximate quickly, because it will take a very long time for the chain to move from one set of states to another. For all we know, it could take only hours to reach a result that looks plausible, but years to reach a result that’s close to the actual stationary distribution. Returning to the Grand Jatte example, this would be a bit like getting a really clear picture of the trees in the upper-right-hand corner of the canvas and concluding that the rest must be a picture of a forest. The oddly conjoined and split topics that Schmidt and others have identified in their models seem a little less mysterious once you understand the quirks of Gibbs sampling.

Second, Gibbs sampling could be very useful for solving other kinds of problems. For some time now, I’ve had an eccentric obsession with encoding text into prime numbers and back into text again. The source of this obsession has to do with copyright law and some of the strange loopholes that the idea-expression dichotomy creates.7 I’m going to leave that somewhat mysterious for now, and jump to the point: part of my obsession has involved trying to figure out how to automatically break simple substitution cyphers. I’ve found that Gibbs sampling is surprisingly good at it. This is, I’ll admit, a somewhat peripheral concern. But I can’t get rid of the sense that there are other interesting things that Gibbs sampling could do that are more directly relevant to digital humanists. It’s a surprisingly powerful and flexible technique, and I think its power comes from that ability to take little bits of fragmentary information and assemble them into a gestalt.8

Third, I think Gibbs sampling is — or should be — theoretically interesting for humanists of all stripes. The theoretical vistas opened up by LDA are fairly narrow because there’s something a little bit single-purpose about it. Although it’s remarkably flexible in some ways, it makes strong assumptions about the structure of the data that it analyzes. Those assumptions limit its possible uses as a model for more speculative thinking. Gibbs sampling makes fewer such assumptions; or to be more precise, it accommodates a wider range of possible assumptions. MALLET is a tool for pounding, and it does a great job at it. But Gibbs sampling is more like the handle of a bit-driver. It’s only half-complete — assembly is required to get it to do something interesting — but it’s the foundation of a million different possible tools.

It’s the kind of tool a bricoleur ought to own.

  1. If you know of new or notable entries that are missing, let me know and I’ll add them to a list here. 
  2. You can take a look here. Caveat emptor! I called it ldazy for a reason — it stands for “LDA implementation by someone who is too lazy” to make further improvements. It’s poorly-commented, inefficient, and bad at estimating hyperparameters. (At least it tries!) Its only strength is that it is short and written in pure Python, which means that its code is somewhat legible without additional comment. 
  3. After writing this, I did some Googling to see if anybody else had thought about Markov chains in terms of pointillism. I didn’t find anything that takes quite the same approach, but I did find an article describing a way to use Markov chains to model brushstrokes for the purpose of attribution! 
  4. In case you want to talk to math people about this, these conditions are respectively called “irreducibility” and “aperiodicity.” 
  5. Sorry, I couldn’t resist. 
  6. I’m risking just a bit of confusion by extending this analogy so far, because it’s tempting to liken colors to topics. But that’s not quite right. To perfect this analogy, expand the canvas into a three-dimensional space in which all green dots occupy one plane, all orange dots occupy another, and so on. In this scheme, the dots are only present or absent — they are themselves “colorless,” and only take on a color insofar as one of the dimensions is interpreted as a color dimension. And suppose the x, y, and c variables can take values between 1 and 50. Now each dimension could just as easily represent a single word in a three-word corpus, and each dot in this three-dimensional space could represent a sequence of topic assignments for a fifty-topic model — with a value between 1 and 50 for each word in the corpus. 
  7. “In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work” 17 USC 102
  8. For another way of thinking about the possibilities of Gibbs sampling and other so-called Monte Carlo Markov chain (MCMC) methods, see the wonderful sub-subroutine post on using MCMC to learn about bread prices during the napoleonic wars.

Leave a Reply