The Markov Chains of La Grande Jatte A Short Introduction to Gibbs Sampling

Topic modeling has been attracting the attention of scholars in the digital humanities for several years now, and quite a few substantive introductions to the subject have been written. Ben Schmidt offered a brief overview of the genre in 2012, and the list he provided is still fairly comprehensive, as far as I can tell.1 My current favorite is an entry from Miriam Posner and Andy Wallace that emphasizes the practical side of topic modeling — it’s great for bootstrapping if you’re new to the subject.

This post will cover something slightly different. When I started to delve into the details of topic modeling, I quickly realized that I needed to create my own implementation of Latent Dirichlet Allocation (LDA) to begin understanding how it worked. I eventually did, but even with all the terrific resources available, I ran into several significant roadblocks.2 The biggest one for me was figuring out Gibbs sampling. A lot of introductions to topic modeling don’t spend much time on Gibbs sampling, for understandable reasons. It’s not part of LDA properly speaking, so you don’t need to understand how it works to understand the fundamentals of LDA. In fact, in his original description of LDA, David Blei didn’t even talk about Gibbs sampling — he used a thing called “variational inference,” which is a wall of abstraction that I still haven’t managed to scale.

Fortunately, Gibbs sampling yielded to my efforts more readily. And although it’s not strictly necessary to understand Gibbs sampling to understand LDA, I think it’s worth understanding for other reasons. In fact, I’ve come to believe that Gibbs sampling is a wonderful introduction to the rapidly evolving world of machine learning — a world that I think at least a subset of digital humanists should have much broader knowledge of.

What is Gibbs sampling?

Here’s my attempt at a definition: Gibbs sampling is a way to build a picture of a global probability distribution when you only have local information about that distribution. That’s more of a description than a definition; other techniques do that too. But I like it because it shows what Gibbs sampling is good at. You can use it to take lots of little bits of information — like individual word counts — and construct a global view of those bits.

Suppose that you are temporarily Georges Seurat, but you couldn’t make it to the Island of La Grande Jatte today. Instead of seeing it for yourself or looking at someone else’s picture, you decide to consult with Sam, your omniscient imaginary friend. Sam supplies you with some probabilities like so:

An image of Georges Seruat's La Grande Jatte, at four different levels of detail.
What a Markov chain knows at first (top). What it knows after a long time (bottom).

Given that you have just put a green dot here (Sam points at a spot on the canvas):

  • The probability is \mu that your next dot will be an orange dot there.
  • The probability is \eta that your next dot will be a blue dot over there.
  • The probability is … [more tiny numbers]

This list goes on until every possible location on the canvas and every possible color has a probability associated with it. It turns out they all add up to one. (They’re probabilities, after all!) Then Sam gives you another list that starts with another location and possible color. You get lists from every possible point and color on the canvas, to every possible point and color on the canvas. Now, at any moment while painting, you can look up the dot you’ve just painted in the table. You can then use that dot’s transition probabilities to decide how to paint the next one.

So you just start painting dots. And lo and behold, after a really long time, you’re looking at a picture of La Grande Jatte.

What I’ve just described is called a Markov chain.3 Gibbs sampling adds just one more little twist. But before I get to that, I want to explain why this is possible. Sam’s table of probabilities has to meet three conditions for this to work. The first two dictate the kinds of movements between points and colors that the table of probabilities must allow. First, the table of probabilities must allow you to get from any point and color in the painting to any other. It doesn’t have to allow you to get from one to the other in a single step, but it has to allow you to get there eventually. And second, the table of probabilities must allow you to get from one point and color to another at irregular intervals. So if it aways takes you two, or four, or eight steps to get from node A to node B, and never any other number of steps, then the table doesn’t satisfy this condition, because the number of steps required to get from A to B is always a multiple of two.

Together, these conditions tell us that the Markov chain has what’s called a stationary distribution.4 It’s a probability distribution over every point and possible color on the canvas. It tells you how often you will paint a particular dot, on average, if you keep painting forever. If Sam’s table meets these first two conditions, then we can prove that it has a stationary distribution, and we can even prove that its stationary distribution is unique. At that point, it only has to meet one more condition: its stationary distribution must be a painting of La Grande Jatte.

What’s neat about this is that none of the individual transition probabilities know much about the painting. It’s only when they get together and “talk” to one another for a while that they start to realize what’s actually going on.5 That’s what Gibbs sampling allows.

The Catch

The difficult part of using Markov chains this way is figuring out the transition probabilities. How many coordinates and color codes would you need to create an adequate representation of a Seurat painting? I’m not sure, but I bet it’s a number with a lot of zeros at the end. Call it N. And to create the full transition table, you’d have to calculate and store probabilities from each of those values to each of those values. That’s a big square table with N rows and columns. These numbers get mind-bogglingly huge for even relatively simple problems.

Gibbs sampling uses a clever trick to get around that issue. It’s based on the simple insight that you don’t have to change every dimension at once. Instead of jumping directly from one point and color to another — from (x_1, y_1, c_1) to (x_2, y_2, c_2) — you can move along one dimension at a time, jumping from (x_1, y_1, c_1) to (x_2, y_1, c_1) to (x_2, y_2, c_1) to (x_2, y_2, c_2), and so on. It turns out that calculating probabilities for those transitions is often much easier and faster — and the stationary distribution stays the same.

The skeleton of a ten-dimensional hypercube.
See this crazy incomprehensible rat’s nest? It’s the skeleton of a topic hypercube for a corpus just ten words long.

In effect, this means that although you might not be able to calculate all the transition probabilities in the table, you can calculate all the relevant translation probabilities pretty easily. This makes almost no practical difference to you as you paint La Grande Jatte. It just means you do three lookups instead of one before painting the next dot. (It also might mean you don’t paint the dot every time, but only every fifth or tenth time, so that your dots aren’t too tightly correlated with one another, and come closer to being genuinely independent samples from the stationary distribution.)

In the context of the LDA model, this means that you don’t have to leap from one set of hypothetical topic labels to an entirely different one. That makes a huge difference now, because instead of working with a canvas, we’re working with a giant topic hypercube with a dimension for every single word in the corpus. Given that every word is labeled provisionally with a topic, we can just change each topic label individually, over and over, using transition probabilities from this formula that some really smart people have helpfully derived for us. And every time we save a set of topic labels, we’ve painted a single dot on the canvas of La Grande Jatte.6

So What?

I began this post with a promise that you’d get something valuable out of this explanation of Gibbs sampling, even though it isn’t part of the core of LDA. I’m going to offer three brief payoffs now, which I hope to expand in later posts.

First, most implementations of LDA use Gibbs sampling, and at least some of the difficulties that LDA appears to have — including some identified by Ben Schmidt — are probably more related to issues with Gibbs sampling than with LDA. Think back to the requirement that to have a stationary distribution, a Markov chain has to be able to reach every possible state from every other possible state. That’s strictly true in LDA, because the LDA model assumes that every word has a nonzero probability of appearing in every topic, and every topic has a nonzero probability of appearing in every document. But in some cases, those probabilities are extremely small. This is particularly true for word distributions in topics, which tend to be very sparse. That suggests that although the Markov chain has a stationary distribution, it may be hard to approximate quickly, because it will take a very long time for the chain to move from one set of states to another. For all we know, it could take only hours to reach a result that looks plausible, but years to reach a result that’s close to the actual stationary distribution. Returning to the Grand Jatte example, this would be a bit like getting a really clear picture of the trees in the upper-right-hand corner of the canvas and concluding that the rest must be a picture of a forest. The oddly conjoined and split topics that Schmidt and others have identified in their models seem a little less mysterious once you understand the quirks of Gibbs sampling.

Second, Gibbs sampling could be very useful for solving other kinds of problems. For some time now, I’ve had an eccentric obsession with encoding text into prime numbers and back into text again. The source of this obsession has to do with copyright law and some of the strange loopholes that the idea-expression dichotomy creates.7 I’m going to leave that somewhat mysterious for now, and jump to the point: part of my obsession has involved trying to figure out how to automatically break simple substitution cyphers. I’ve found that Gibbs sampling is surprisingly good at it. This is, I’ll admit, a somewhat peripheral concern. But I can’t get rid of the sense that there are other interesting things that Gibbs sampling could do that are more directly relevant to digital humanists. It’s a surprisingly powerful and flexible technique, and I think its power comes from that ability to take little bits of fragmentary information and assemble them into a gestalt.8

Third, I think Gibbs sampling is — or should be — theoretically interesting for humanists of all stripes. The theoretical vistas opened up by LDA are fairly narrow because there’s something a little bit single-purpose about it. Although it’s remarkably flexible in some ways, it makes strong assumptions about the structure of the data that it analyzes. Those assumptions limit its possible uses as a model for more speculative thinking. Gibbs sampling makes fewer such assumptions; or to be more precise, it accommodates a wider range of possible assumptions. MALLET is a tool for pounding, and it does a great job at it. But Gibbs sampling is more like the handle of a bit-driver. It’s only half-complete — assembly is required to get it to do something interesting — but it’s the foundation of a million different possible tools.

It’s the kind of tool a bricoleur ought to own.

  1. If you know of new or notable entries that are missing, let me know and I’ll add them to a list here. 
  2. You can take a look here. Caveat emptor! I called it ldazy for a reason — it stands for “LDA implementation by someone who is too lazy” to make further improvements. It’s poorly-commented, inefficient, and bad at estimating hyperparameters. (At least it tries!) Its only strength is that it is short and written in pure Python, which means that its code is somewhat legible without additional comment. 
  3. After writing this, I did some Googling to see if anybody else had thought about Markov chains in terms of pointillism. I didn’t find anything that takes quite the same approach, but I did find an article describing a way to use Markov chains to model brushstrokes for the purpose of attribution! 
  4. In case you want to talk to math people about this, these conditions are respectively called “irreducibility” and “aperiodicity.” 
  5. Sorry, I couldn’t resist. 
  6. I’m risking just a bit of confusion by extending this analogy so far, because it’s tempting to liken colors to topics. But that’s not quite right. To perfect this analogy, expand the canvas into a three-dimensional space in which all green dots occupy one plane, all orange dots occupy another, and so on. In this scheme, the dots are only present or absent — they are themselves “colorless,” and only take on a color insofar as one of the dimensions is interpreted as a color dimension. And suppose the x, y, and c variables can take values between 1 and 50. Now each dimension could just as easily represent a single word in a three-word corpus, and each dot in this three-dimensional space could represent a sequence of topic assignments for a fifty-topic model — with a value between 1 and 50 for each word in the corpus. 
  7. “In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work” 17 USC 102
  8. For another way of thinking about the possibilities of Gibbs sampling and other so-called Monte Carlo Markov chain (MCMC) methods, see the wonderful sub-subroutine post on using MCMC to learn about bread prices during the napoleonic wars.

A Sentimental Derivative

Ben Schmidt’s terrific insight into the assumptions that the Fourier transform imposes on sentiment data has been sinking in, and I have a left-field suggestion for anyone who cares to check it out. I plan to investigate it myself when I have the time, but I’ve decided to broadcast it now.

In the imaginary universe of Fourier land, all texts start and end at the same sentiment amplitude. This is clearly incorrect, as I see it.1 But what could we say about the beginning and end of texts that might hold up?

One possibility is that all texts might start and end with a flat sentiment curve. That is, at the very beginning and end of a text, we can assume that the valence of words won’t shift dramatically. That’s not clearly incorrect. I think it’s even plausible.

Now consider how we talk about plot most of the time: we speak of rising action (slope positive), falling action (slope negative), and climaxes (local and global maxima). That’s first derivative talk! And the first derivative of a flat curve is always zero. So if the first derivative of a sentiment curve always starts and ends at zero, then at least one objection to the Fourier transform approach can be worked around. For example, we could simply take the first finite difference of a text’s sentiment time series, perform a DFT and low-pass filter, do a reverse transform, and then do a cumulative sum (i.e. a discrete integration) of the result.2

What would that look like?

  1. Nonetheless, I think there’s some value to remaining agnostic about this for some time still — even now, after the dust has settled a bit. 
  2. You might be able to skip a step or two. 

Deriving Browsing Similarity

The following is an extremely deliberate, step-by-step derivation of what I think is a novel vector similarity measure.1 I discuss motivations for it here; this focuses on the math. Describing this as a “vector” similarity measure might be a little deceptive because part of the point is to get away from the vector model that we’ve inherited via cosine similarity. But in its final form, the measure is similar enough to cosine similarity that it’s helpful to hold on to that way of thinking for now.

The initial aim of this similarity measure was to create better topic model graphs, and I think it really does. But as I’ve thought it through, I’ve realized that it has applications to any bipartite graph that can be interpreted in probabilistic terms. More on that later!

The derivation is pure conditional probability manipulation; it doesn’t require anything but algebra and a few identities. But if you’re not familiar with the concepts of conditional probability and marginalization (in the mathematical sense!) you may want to read up on them a bit first. Also, I’m not certain that I’m using conventional notation here — please let me know if I’ve done something odd or confusing. But I’m confident the reasoning itself is correct.

Informally, the aim of this measure is to determine the probability of happening upon one topic while browsing through a corpus for another one. Imagine for a moment that the corpus is a collection of physical books on a bookshelf in front of you. You’re interested in crop circles, and you are looking through the books on the bookshelf to find information about them. What is the probability that in the process, you’ll happen upon something about clock gear ratios?2

Formally, suppose we have random variables X and Y, each representing identical categorical distributions over topics in a model, and suppose we have B, a random variable representing a uniform distribution over all books in the model’s corpus. We’re interested in the probability of happening upon topic x given that we selected a book b based on the proportion of the book discussing topic y. To be as precise as possible, I’ll assume that we use the following process to browse through the corpus:

  1. Pick a topic y of interest from Y.
  2. Pick a book at random, with uniform probability.
  3. Pick a word from the book at random, with uniform probability.
  4. If the word is not labeled y, put the book back on the shelf and repeat steps 2 and 3 until the word you choose is labeled y.
  5. Pick another word from the book at random, again with uniform probability.
  6. Use that word’s topic label x as your new topic of interest.
  7. Repeat steps 2 through 6 indefinitely.

This is roughly equivalent to the less structured process I described above in the informal statement of the problem, and there’s at least some reason to believe that the probabilities will turn out similarly. Suppose, for example, that you know in advance the proportion of each book devoted to each topic, and choose a book based on that information; and that you then choose your new topic using that book’s topic distribution. The probabilities in that case should work out to be the same as if you use the above process. But specifying the above process ensures that we know exactly where to begin our derivation and how to proceed.

In short, what we have here is a generative model — but this time, instead of being a generative model of writing, such as LDA, it’s a generative model of reading.

Now for the derivation. First, some basic identities. The first two give different versions of the definition of conditional probability. The third shows the relationship between the conditional probabilities of three variables and their joint probability; it’s a version of the first identity generalized to three variables. And the fourth gives the definition of marginalization over the joint probability of three variables — which simply eliminates one of them by summing over the probabilities for all its possible values. (You can do the same thing with any number of variables, given a joint distribution.) I’m using the convention that an uppercase variable represents a probability distribution over a support (set of possible values) and a lowercase variable represents one possible value from that distribution. To avoid clutter, I’ve silently elided the =x in P(X=x) unless clarity requires otherwise.

1) \quad p(X,Y) = p(X|Y) \times p(Y)

2) \quad p(X|Y) = p(X,Y) / p(Y)

3) \quad p(X,Y,B) = p(Y) \times p(B|Y) \times p(X|B,Y)

4) \quad p(X,Y) = \sum \limits_{b \in B} p(X,Y,B=b)

Now for the derivation. First, substitute (3) into (4):

5) \quad p(X,Y) = \sum \limits_{b \in B} (p(Y) \times p(B=b|Y) \times p(X|B=b,Y))

The prior probability of Y, p(Y), is constant with respect to b, so we can move it outside the sum:

6) \quad p(X,Y) = p(Y) \times \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

Divide both sides:

7) \quad p(X,Y) / p(Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

And by (2):

8) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

Finally, for any given book b, we can simplify p(X|B=b,Y) to p(X|B=b) because the probability of picking a word labeled with topic x depends only on the given book. The topic that led us to choose that book becomes irrelevant once it has been chosen.

9) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b))

So now we have a formula for p(X|Y) in terms of p(B|Y) and p(X|B). And we know p(X|B) — it’s just the probability of finding topic x in book b, which is part of the output from our topic model. But we don’t yet know p(B|Y).

Here’s how we can determine that value. We’ll introduce a combined version of equations 1 and 2 with the variables swapped as 10 — also known as Bayes’ Theorem:

10) \quad p(B|Y) = p(Y|B) \times p(B) / p(Y)

As well as a combination of the two-variable versions of equations 3 and 4 as 11:

11) \quad p(Y) = \sum \limits_{b \in B} (p(B=b) \times p(Y|B=b))

Starting with 11, we note that for any given b, p(B=b) = 1 / N where N is the number of books. (This is because B is uniformly distributed across all books.) That means p(B=b) is a constant, and so we can move it outside the sum:

12) \quad p(Y) = p(B) \times \sum \limits_{b \in B} p(Y|B=b)

Substituting that into equation 10 we get:3

13) \quad p(B|Y) = p(Y|B) \times p(B) / (p(B) \times \sum \limits_{b \in B} p(Y|B=b))

Conveniently, the p(B) terms now cancel out:

14) \quad p(B|Y) = p(Y|B) / \sum \limits_{b \in B} p(Y|B=b)

And we substitute that into the our previous result (9) above:

15) \quad p(X|Y) = \sum \limits_{b \in B}(p(X|B=b) \times p(Y|B=b) / \sum \limits_{b \in B} p(Y|B=b))

Now we can simplify again by noticing that \sum \limits_{b \in B} p(Y|B=b) is constant with respect to the outer sum, because all the changes in the values of p(Y|B=b) in the outer sum are subsumed by the inner sum. So we can move that out of the sum.4

16) \quad p(X|Y) = \sum \limits_{b \in B} (p(X|B=b) \times p(Y|B=b)) / \sum \limits_{b \in B} p(Y|B=b)

This is a very interesting result, because it looks suspiciously like the formula for cosine similarity. To see that more clearly, suppose that instead of looking at p(X|B=b) and p(Y|B=b) as conditional probabilities, we looked at them as vectors with a dimension for every value of b — that is, as X = [x_1\ x_2\ x_3\ ...\ x_n] and Y = [y_1\ y_2\ y_3\ ...\ y_n]. We’d get this formula:

17) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}

And here’s the formula for cosine similarity for comparison:

18) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}

As you can see, the sum on top is identical in both formulas. It’s a dot product of the vectors X and Y. The difference is on the bottom. In the cosine similarity formula, the bottom is the multiple of the lengths of the two vectors — that is, the Euclidean norm. But in the new formula we derived, it’s a simple sum! In fact, we could even describe it as the Manhattan norm, which makes the relationship between the two formulas even clearer. To convert between them, we can simply replace the Euclidean norm of both vectors with the Manhattan norm of the second vector Y — that is, the conditioning probability distribution in p(X|Y).

So at this point, you might be thinking “That was a lot of trouble for a result that hardly differs from cosine similarity. What was the point again?”

There are two answers.

A diagram illustrating cosine similarity
The cosine similarity between two topics. Topic One accounts for three words in Book One and four words in Book Two. Topic Two accounts for five words in Book One and four words in Book Two. The cosine similarity is equal to the cosine of the angle between them, Theta. If you’re not confused by this, then I think you should be. It’s weird that we’re still modeling texts as vectors.

The first emphasizes the familiar. Although we just derived something that looks almost identical to cosine similarity, we derived it using a specific and well-defined statistical model that invites a new and more concrete set of interpretations. Now we don’t have to think about vectors in an abstract space; we can think about physical people holding physical books. So we’ve come to a better way to theorize what cosine similarity measures in this case, even if we don’t seem to have discovered anything new.

The second emphasizes the unfamiliar. Although what we have derived looks similar to cosine similarity, it is dramatically different in one respect. It’s an asymetric operator.

If you’re not sure what that means, look at the diagrams to the right. They illustrate the vector space model that cosine similarity uses, this time with two simple 2-d vectors. Think of the vectors as representing topics and the axes as representing documents. In that case, the cosine similarity corresponds roughly to the similarity between topics — assuming, that is, that topics that appear together frequently must be similar.

A graph of the cosine function.
Cosine similarity is symmetric because the cosine function is symmetric around 0. Swapping the order of application negates theta, but the cosine of negative theta is the same as the cosine of theta.

The cosine similarity is simply the cosine of the angle between them, theta. As the angle gets larger, the cosine goes down; as it gets smaller, the cosine goes up. If the vectors both point in exactly the same direction, then the cosine is 1. And if they are at right angles, it’s 0. But notice that if you swap them, the value doesn’t change. The cosine similarity of A with respect to B is the same as the cosine similarity of B with respect to A. That’s what it means to say that this is a symmetric operator.

But browsing similarity is not a symmetric operator. Why? Because the denominator only includes the norm of the conditioning variable (the Y in p(X|Y)). The conditioned variable (X in p(X|Y)) isn’t included in the denominator. This means that if the sum of X is different from the sum of Y, the result will be different depending on the order in which the operator is applied. This means that the graph this measure produces will be a directed graph. Look — see the arrows?

The arrows indicate the order in which the similarity operator has been applied to the given topic vectors. If the arrow is pointing towards topic X, then it represents the value p(X|Y) — otherwise, it represents the value p(Y|X).

What’s exciting is that this has a physical interpretation. Concretely, the probability that people who are interested in topic Y will be exposed to topic X is different from the probability that people who are interested in topic X will be exposed to topic Y. This shouldn’t be too surprising; obviously a very popular topic will be more likely to “attract” readers from other topics than an unpopular topic. So this is really how it ought to be: if Y is an unpopular topic, and X is a very popular topic, then p(Y|X) should be much lower than P(X|Y).

This makes me think that we’ve been using the wrong similarity measure to create our topic graphs. We’ve been assuming that any two topics are equally related to one another, but that’s not really a sound assumption, except under a model that’s so abstract that we can’t pin a physical interpretation to it at all.

Furthermore, this measurement isn’t limited to use on topic models. It can be used on any pair of types related to each other by categorical distributions; or, to put it another way, it can be used to collapse any bipartite graph into a non-bipartite graph in a probabilistically sound way. This opens up lots of interesting possibilities that I’ll discuss in later posts. But to give you just one example, consider this: instead of thinking of topics and books, think of books and recommenders. Suppose you have a community of recommenders, each of whom is disposed to recommend a subset of books with different probabilities. The recommenders form one set of nodes; the books form the other. The browsing process would look like this: you talk to someone likely to recommend a book you already like and ask for another recommendation. Then you talk to someone else likely to recommend that book, and so on. This could be a new way to implement recommender systems like those used by Amazon.

  1. I hope someone will tell me it’s not! In that case, my task is easier, because it means I won’t have to do all the theory. I’ve found a few near misses. In a paper on topic model diagnostics, Chuang, et. al. propose a refinement of cosine similarity for finding matches between topics across multiple independently-run models. Unlike the measure I’m proposing, theirs uses topic-word vectors; is not based on a generative model; and is a symmetric operator. However, theirs was better at predicting human judgments than cosine similarity was, and is worth investigating further for that reason. See also this wonderful post by Brendan O’Connor, which lays out a fascinating set of relationships between various measures that all boil down to normed dot products using different norms. This measure could be added to that list. 
  2. Here’s the graph-theoretic interpretation: given that the topics and books together form a complete bipartite graph, in which the edges are weighted by the proportion of the book that is about the linked topic, this is equivalent to building a non-bipartite directed graph describing the probability of moving from one topic to another while browsing. This takes up ideas that I first encountered through Scott Weingart
  3. This is almost the same as what Wikipedia calls the “extended version” of Bayes Theorem. The only difference is a small modification that resulted from the step we took to generate equation 12, which was possible because p(B) is uniform. 
  4. If you’re anything like me, you probably start to feel a little anxious every time a variable moves inside or outside the summation operator. To give myself a bit more confidence that I wasn’t doing something wrong, I wrote this script, which repeatedly simulates the 7-step process specified above a million times on an imaginary corpus, and compares the resulting topic transition probability to the probability calculated by the derived formula. The match is close, with a tight standard deviation, and very similar results for several different averaging methods. The script is a little bit sloppy — my apologies — but I encourage anyone interested in verifying this result to look at the script and check my work. Let me know if you see a potential problem!