Math

The Volume and Surface Area of Computer Programs

A cube.

I’ve been refactoring some software. It’s not a fun task, exactly, but there’s something strangely satisfying about it — it’s a bit like folding socks. In the past, I have found the concepts of “coupling” (bad) and “cohesion” (good) useful for this kind of work. But the definitions of those concepts have always seemed a bit vague to me. The Wikipedia page linked above offers the following definition of cohesion: “the degree to which the elements of a module belong together.” The definition of coupling isn’t much better: “a measure of how closely connected two routines or modules are.”

What does that mean? Belong together in what sense? Closely connected how? Aren’t those essentially synonyms?

If I were in a less lazy mood, I’d look up the sources cited for those definitions, and find other, more reliable (?) sources than Wikipedia. But I’ve done that before, and couldn’t find anything better. And as I’ve gained experience, I’ve developed a sense that, yes, some code is tightly coupled — and just awful to maintain — and some code is cohesive without being tightly coupled — and a joy to work with. I still can’t give good definitions of coupling or cohesion. I just know them when I see them. Actually, I’m not even sure I always know them when I see them. So I’ve spent a lot of time trying to figure out a more precise way to describe these two concepts. And recently, I’ve been thinking about an analogy that might help explain the difference — it links the relationship between cohesion and coupling to the relationship between volume and surface area.

The analogy begins with the idea of interfaces. Programmers often spend long hours thinking about how to define precise channels of communication between pieces of software. And many programming styles — object-oriented development, for example — emphasize the distinction between private and public components of a computer program. Programs that don’t respect that distinction are often troublesome because there are many different ways to modify the behavior of those programs. If there are too many ways, and if all those ways are used, then it becomes much more difficult to predict the final behavior that will result.

The net of a cube.
The net (unfolded surface) of a cube.

Suppose we think of interfaces and public variables as the surface of a program, and think of private variables and methods as being part of the program’s interior — as contributing to its volume. In a very complex program, enforcing this distinction between public and private becomes like minimizing the surface area of the program. As the program gets more and more complex, it becomes more and more important to hide that complexity behind a much simpler interface, and if the interface is to remain simple, its complexity must increase more slowly than the complexity of the overall program.

What if the relationship between these two rates — the rate of increase in complexity of the interface and the rate of increase in complexity of the overall program — is governed by a law similar to the law governing the relationship between surface area and volume? Consider a cube. As the cube grows in size, its volume grows faster than its surface area. To be precise (and forgive me if this seems too obvious to be worth stating) its volume is x^3, while its surface area is 6 \times x^2 for a given edge-length x.

Biologists have used this pattern to explain why the cells of organisms rarely grow beyond a certain size. As they grow larger, the nutrient requirements of the interior of the cell increase more quickly than the surface’s capacity to transmit nutrients; if the cell keeps growing, its interior will eventually starve. The cell can avoid that fate for a while by changing its shape to expand its surface area. But the cell can only do that for so long, because the expanded surface area costs more and more to maintain. Eventually, it must divide into two smaller cells or die.

Might this not also give us a model that explains why it’s difficult to develop large, complex programs without splitting them into smaller parts? On one hand, we have pressure to minimize interface complexity; on the other, we have pressure to transmit information into the program more efficiently. As the program grows, the amount of information it needs increases, but if the number of information inlets increases proportionally, then soon, it becomes too complex to understand or maintain. For a while, we can increase the complexity of the program while keeping the interface simple enough just by being clever. But eventually, the program’s need for external information overwhelms even the most clever design. The program has to be divided into smaller parts.

So what do “coupling” and “cohesion” correspond to in this model? I’m not sure exactly; the terms might not be defined clearly enough to have precise analogs. But I think coupling is closely related to — returning to the cell analogy now — the nutrient demand of the interior of the cell. If that demand goes unchecked, the cell will keep expanding its surface area. It will wrinkle its outer membrane into ever more complex and convoluted shapes, attempting to expose more of its interior to external nutrient sources. At this point in the analogy, the underlying concept becomes visible; here, the cell is tightly coupled to its environment. After this coupling exceeds some threshold, the cell’s outer membrane becomes too large and complex to maintain.A drop of water.

In turn, cohesion is closely related to the contrary impulse — the impulse to reduce surface area.

Imagine a drop of water on a smooth surface. It rests lightly on the surface seeming almost to lift itself up. Why? It turns out that surfaces take more energy to maintain than volumes; they cost more. Molecules in the interior of the drop can take configurations that molecules on the surface can’t. Some of those configurations have lower energy than any possible configuration on the surface, so molecules on the surface will tend to “fall” into the interior. The drop will try to minimize its surface area, in much the way a marble in a bowl will roll to the bottom. And the shape with the lowest surface-area-to-volume ratio is a sphere.

We have different words for this depending on context. This phenomenon is the same phenomenon we sometimes name “surface tension.” Water striders can glide across the surface of a pond because the water wants to minimize its surface area. The water does not adhere to their limbs; it coheres; it remains decoupled. These ways of thinking about coherence and coupling make the concepts seem a bit less mysterious to me.

Meaning, Context, and Algebraic Data Types

A few years ago, I read a paper making a startling argument. Its title was “The Derivative of a Regular Type is its Type of One-Hole Contexts.” I’m not entirely sure how I found it or why I started reading it, but by the time I was halfway though it, I was slapping myself on the forehead: it was so brilliant, and yet so obvious as to be almost trivial — how could nobody have thought of it before?

The idea was that if you take an algebraic data type — say something simple like a plain old list — and poke a hole in it, you get a new data type that looks like a derivative of the first data type. That probably sounds very abstract and uninteresting at first, especially if you aren’t very familiar with algebraic data types. But once you understand them, the idea is simple. And I think it has ramifications for people interested in natural language: it tells us something about the relationship between meaning and context. In particular, it could give us a new way to think about how semantic analysis programs like word2vec function: they perform a kind of calculus on language.

What’s an Algebraic Data Type?

An algebraic data type is really just a regular data type — seen through a particular lens. So it’s worth taking a moment to think through what a regular data type is, even if you’re already familiar with the concept.

A regular data type is just an abstract representation of the kind of data being stored or manipulated by a computer. Let’s consider, as a first example, the simplest possible item of data to be found in a modern computer: a bit. It can take just two values, zero and one. Most programming languages provide a type describing this kind of value — a boolean type. And values of this type are like computational atoms: every piece of data that modern computers store or manipulate is made of combinations of boolean values.1

So how do we combine boolean values? There are many ways, but here are two simple ones that cover a lot of ground. In the first way, we take two boolean values and declare that they are joined, but can vary independently. So if the first boolean value is equal to zero, the second boolean value may be equal to either zero or one, and if the first boolean value is equal to one, the second boolean value may still be equal to either zero or one. And vice versa. They’re joined, but they don’t pay attention to each other at all. Like this:

1) 0 0
2) 0 1
3) 1 0
4) 1 1

In the second way, we take two boolean values and declare that they are joined, but cannot vary independently — only one of them can be active at a given time. The others are disabled (represented by “*” here) — like this:

1) * 0
2) * 1
3) 0 *
4) 1 *

Now suppose we wanted to combine three boolean values. In the first case, we get this:

1) 0 0 0
2) 0 0 1
3) 0 1 0
4) 0 1 1
5) 1 0 0 
6) 1 0 1
7) 1 1 0
8) 1 1 1

And in the second case, we get this:

1) * * 0
2) * * 1
3) * 0 *
4) * 1 *
5) 0 * *
6) 1 * *

At this point, you might start to notice a pattern. Using the first way of combining boolean values, every additional bit doubles the number of possible combined values. Using the second way, every additional bit only adds two to the number of possible combined values.

This observation is the basis of algebraic type theory. In the language of algebraic data types, the first way of combining bits produces a product type, and the second way of combining bits produces a sum type. To get the number of possible combined values of a product type, you simply multiply together the number of possible values that each component type can take. For two bits, that’s 2 \times 2; for three bits, that’s 2 \times 2 \times 2; and for n bits, that’s 2 ^ n. And to get the number of possible combined values of a sum type, you simply add together the number of possible values that each component type can take: 2 + 2, or 2 + 2 + 2, or 2 \times n.

Pretty much all of the numerical types that a computer stores are product types of bits. For example, a 32-bit integer is just the product type of 32 boolean values. It can represent numbers between 0 and 2 ^{32} - 1 (inclusive), for a total of 2 ^{32} values. Larger numbers require more binary digits to store.

Sum types are a little more esoteric, but they become useful when you want to represent things that can come in multiple categories. For example, suppose you want to represent a garden plot, and you want to have a variable that stores the number of plants in the plot. But you also want to indicate what kind of plants are in the plot — basil or thyme, say. A sum type provides a compact way to represent this state of affairs. A plot can be either a basil plot, or a thyme plot, but not both. If there can be up to twenty thyme plants in a thyme plot, and up to sixteen basil plants in a basil plot, the final sum type has a total of thirty-six possible values:

Basil   Thyme
*       1
*       2
...     ...
*       19    
*       20
1       *
2       *
...     ...
15      *
16      *

If you’ve done any programming, product types probably look pretty familiar, but sum types might be new to you. If you want to try a language that explicitly includes sum types, take a look at Haskell.

Poking Holes in Types

Given this understanding of algebraic data types, here’s what it means to “poke a hole” in a type. Suppose you have a list of five two-bit integers:

| a | b | c | d | e |

This new variable’s type is a product type. It can take this value:

| 0 | 0 | 0 | 0 | 0 |

Or this value:

| 0 | 0 | 0 | 0 | 1 |

Or this value…

| 0 | 0 | 0 | 0 | 2 |

And so on, continuing through values like these:

| 0 | 0 | 0 | 0 | 3 |
| 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 | 2 |
| 0 | 0 | 0 | 1 | 3 |
| 0 | 0 | 0 | 2 | 0 |
...

All the way to these final values:

...
| 3 | 3 | 3 | 2 | 3 |
| 3 | 3 | 3 | 3 | 0 |
| 3 | 3 | 3 | 3 | 1 |
| 3 | 3 | 3 | 3 | 2 |
| 3 | 3 | 3 | 3 | 3 |

The total number of possible values for a variable of this type is (2 ^ 2) ^ 5 = 4 ^ 5 = 4 * 4 * 4 * 4 * 4 = 1024.

Now think about what happens if you disable a single slot. The result is a list type with a hole in it. What does a variable with this new type look like? Well, say you disable the first slot. You get something that looks like this — a list with a hole in the first slot:

| * | 0 | 0 | 0 | 0 |

Now that there’s a hole in the first slot, there are only 4 * 4 * 4 * 4 = 256 possible values this variable can take.

| * | 0 | 0 | 0 | 1 |
| * | 0 | 0 | 0 | 2 |
...

We could also poke a hole in any of the other slots:

| 0 | * | 0 | 0 | 0 |
| 0 | 0 | * | 0 | 0 |
| 0 | 0 | 0 | * | 0 |
| 0 | 0 | 0 | 0 | * |

But the hole can only be in one of the slots at any given time. Sound familiar? You could say that this is the “sum type” of each of those one-hole types. Each possible placement of the hole adds to (rather than multiplying) the number of possible values. There are five different places for the hole to go, and for each of those, there are 256 possible values. That’s a total of 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 = 5 \times 4 ^ 4.

Now suppose we make this a little more general, by replacing that 4 with an x. This way we can do this same thing with slots of any size — slots that can hold ten values, or 256 values, or 65536 values, or whatever. For the plain list type, that’s a total of x ^ 5 possible values. And for the list-with-a-hole, that’s 5 x^4 possible values. And now let’s go even further and replace the number of slots with an n. That way we can have any number of slots — five, ten, fifty, whatever you like. Now, the plain list has a total of x ^ n possible values, and the list-with-a-hole has n x ^ {(n - 1)} possible values.

If you ever took calculus, that probably looks very familiar. It’s just the power rule! This means you can take data types and do calculus with them.

One-Hole Contexts

I think that’s pretty wild, and it kept my head spinning for a couple of years when I first read about it. But I moved on. Fast forward to a few weeks ago when I started really reading about word2vec and learning how it works: it finds vectors that are good at predicting the word in the middle of an n-gram, or are good at predicting an n-gram given a word in the middle. Let’s start with the first case — say you have this incomplete 5-gram:

I went _ the store

What word is likely to be in there? Well, “to” is a good candidate — I’d venture a guess that it was the first word that popped into your mind. But there are other possibilities. “Past” might work. If the n-gram is part of a longer sentence, “from” is a possibility. And there are many others. So word2vec will group those words together in its vector space, because they all fit nicely in this context, and in many others.

But look again at that sentence — it’s a sequence of words with a hole in it. So if in this model, a word’s meaning is defined by the n-grams in which it may be embedded, then the type of a word’s meaning is a list-with-a-hole — just like the derivative of the list type that we were looking at above.

The basic idea of word2vec isn’t extremely surprising; from a certain perspective, this is an elaboration of the concept of a “minimal pair” in linguistics. If you know that “bat” means a thing you hit a ball with, and “mat” means a thing you wipe your feet on, then you can tell that /b/ and /m/ are different phonemes, even though they both involve putting your lips together: the difference between their sounds makes a difference in meaning. Conversely, if the difference between two sounds doesn’t make a difference in meaning, then the sounds must represent the same phoneme. “Photograph” pronounced with a distinct “oh” sound in the second syllable is the same word as “photograph” pronounced with an “uh” sound in the second syllable. They’re different sounds, but they both make the same word here, so they represent the same phoneme.

The thinking behind word2vec resembles the second case. If you can put one word in place of another in many different contexts and get similar meanings, then the two words must be relatively close together in meaning themselves.

What is surprising is that by pairing these two concepts, we can link the idea of derivatives in calculus to the idea of meaning. We might even be able to develop a model of meaning in which the meaning of a sentence has the same type as the derivative of that sentence’s type. Or — echoing the language of the paper I began with — the meaning type of a sentence type is its type of one-hole contexts.


  1. But don’t take this for granted. There have been — and perhaps will be in the future — ternary computers

Directed Probabilistic Topic Networks

Suppose you’re standing in front of your bookcase, feeling a little bored. You pick up a book at random, and read a few pages on a random topic. It piques your curiosity, so you put the first book away and pick another one that has something to say about the same topic. You read a few pages from it and notice a second topic that interests you. So you pick up a third book on that topic, and that book draws your attention to yet another topic. And you continue moving from book to topic to book to topic — forever.

Wouldn’t be interesting if we could describe that process mathematically?

Books from Yale's Beinecke LibraryFor the for the last few months I’ve been thinking about the best way to create useful networks with topic models. People have been creating network visualizations of topic models for a long time now, but they sometimes feel a bit like window dressing.1 The problem is that we don’t know what these networks actually represent. The topics are just blobs linked together and floating in a mysterious, abstract space. But what if we could create a network with a clear and concrete interpretation in terms of a physical process that we understand? What if we could create a network that represents the process of browsing through the books on a bookshelf?

I have struck upon a formula that I think does just that — it describes the probability of moving from one topic to another while browsing through a corpus. Remarkably, the formula is very similar to the formula for cosine similarity, which is one of the more popular ways of measuring the similarity between topics. But it differs in crucial ways, and it creates a kind of topic network that I haven’t seen before.2 I’d like to hear what others think about it.

I’ve developed two different theoretical arguments that suggest that the networks this formula creates are more useful than the networks that cosine similarity creates. One argument is related to the theory of bimodal networks, and the other is related to the theory of Markov chains. I have several posts queued up that go into the details, but I’ve decided not to post them just yet. Instead, I’m going to let the method speak for itself on practical grounds. I’ll post more once I feel confident that the result is worth the cognitive investment. However, if you’re interested in the fundamental math, I’ve posted a derivation.

A snippet of a topic network diagram.
A snippet from a network diagram of a topic model of PMLA issues by Ted Underwood — part of the preliminary research that led to Goldstone and Underwood’s “The Quiet Transformations of Literary Studies.”

For now, I’ll assume that most readers are already familiar with — or else are profoundly indifferent to — a few background ideas about topic modeling, cosine similarity, and topic networks.3 I hope that won’t exclude too many people from the conversation, because my core argument will be mostly visual and practical: I think that visualizations of these networks look better, and I think the idea of a “browsing similarity” between topics sounds useful — do you?

So feel free to skip past the wonkish bits to the diagrams!

In my own experimentation and research, I’ve found that browsing similarity creates topic networks that differ in several ways from those that cosine similarity creates. First, they distribute links more uniformly between nodes. It’s desirable to simplify topic networks by cutting links with a flat threshold, because the result is easy to reason about. When you do that with this new kind of network, most of the nodes stick together in one loose clump with lots of internal clustering. Second, they invite a probabilistic interpretation with some interesting and well understood theoretical properties.4 Those properties ensure that even some of the more abstract network-theoretical measures, like eigenvalue centrality, have concrete interpretations. And third, they are directed — which says some important things about the relationships between topics in a corpus.

Below are three network diagrams based on a topic model of two thousand eighteenth-century books published between 1757 and 1795.5 Each has 150 nodes (one for each topic in the model). The strengths of the links between each of the nodes are calculated using either cosine similarity or browsing similarity between topic vectors. Each vector is a sequence of book proportions for a given topic. To usefully visualize a topic model as a network, you must cut some links, and the easiest approach is to apply some kind of threshold. Links stronger than some value stay, and the rest are cut. In all the network diagrams below, I’ve selected threshold values that produce exactly 225 links. For layout, I’ve used D3’s force-directed layout routine, so the diagrams will look a little different each time you reload this page.6

In the first diagram, I’ve used cosine similarity with a simple flat threshold. The result is a hairball with a lot of little islands floating around it:

To deal with this problem, Ted Underwood came up with a really clever link-cutting heuristic that produces much cleaner network diagrams. However, it’s a little ad-hoc; it involves retaining at least one link from every node, and then retaining additional links if they’re strong enough. It’s like a compromise between a flat threshold (take all links stronger than x) and a rank-based threshold (take the strongest n links from each node).

In the second diagram, I’ve used cosine similarity again, but applied a variation on Underwood’s heuristic with a tunable base threshold.7

The result is much more coherent, and there’s even a bit of suggestive clustering in places. There are a few isolated archipelagos, but there are no singleton islands, because this method guarantees that each node will link to at least one other node.

Now for the browsing similarity approach. In the third diagram, I’ve used browsing similarity with a simple flat threshold:

Although this diagram has both singleton islands and archipelagos, it’s far more connected than the first, and it has almost as many mainland connections as the second. It also shows a bit more clustering behavior than diagram two does. But what I find most interesting about it is that it represents the concrete browsing process I described above: each of the edges represents a probability that while browsing randomly through the corpus, you will happen upon one topic, given that you are currently reading about another.8 That’s why the edges are directed — you won’t be as likely to move from topic A to topic B as from topic B to topic A. This makes perfect sense: it ought to be harder to move from common topics to rare topics than to move from rare topics to common topics.

Because I wanted to show the shapes of these graphs clearly, I’ve removed the topic labels. But you can also see full-screen versions of the cosine, Underwood, and browsing graphs, complete with topic labels that show more about the kinds of relationships that each of them preserve.

Here’s everything you need to play with the browsing similarity formula. First, the mathematical formula itself:

\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}

You can think of b as standing for “book,” and X and Y as two different topics. x_1 is the proportion of book 1 that is labeled as being about topic X, and so on. The formula is very similar to the forumla for cosine similarity, and the tops are identical. Both calculate the dot product of two topic-book vectors. The difference between them is on the bottom. For browsing similarity, it’s simply the sum of the values in Y, but for cosine similarity, it’s the product of the lengths of the two vectors:

\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}

Here a bit of jargon is actually helpful: cosine similarity uses the “euclidean norm” of both X and Y, while browsing similarity uses only the “manhattan norm” of Y, where “norm” is just a ten dollar word for length. Thinking about these as norms helps clarify the relationship between the two formulas: they both do the same thing, but browsing similarity uses a different concept of length, and ignores the length of X. These changes turn the output of the formula into a probability.

Next, some tools. I’ve written a script that generates Gephi-compatible or D3-compatible graphs from MALLET output. It can use cosine or browsing similarity, and can apply flat, Underwood-style, or rank-based cutoff thresholds. It’s available at GitHub, and it requires numpy and networkx. To use it, simply run MALLET on your corpus of choice, and pass the output to tmtk.py on the command line like so:

./tmtk.py network --remove-self-loops \
    --threshold-value 0.05 \
    --threshold-function flat \
    --similarity-function browsing \
    --output-type gexf \
    --write-network-file browsing_sim_flat \
    --topic-metadata topic_names.csv \
    mallet_output.composition

It should be possible to cut and paste the above command into any bash terminal — including Terminal in OS X under default settings. If you have any difficulties, though, let me know! It may require some massaging to work with Windows. The command should generate a file that can be directly opened by Gephi. I hope the option names are obvious enough; more detailed information about options is available via the --help option.

In case you’d prefer to work this formula into your own code, here is a simplified version of the browsing_similarity function that appears in the above Python script. Here, A is a matrix of topic row vectors. The code here is vectorized to calculate every possible topic combination at once and put them all into a new matrix. You can therefore interpret the output as the weighted adjacency matrix of a fully-connected topic network.

def browsing_similarity(A):
    A = numpy.asarray(A)
    norm = A.T.sum(axis=0)
    return numpy.dot(A, A.T) / norm

And here’s the same thing in R9:

browsingsim norm = rowSums(A)
    dot = A %*% t(A)
    return(t(dot / norm))
}

Matrix calculations like this are a dream when the shapes are right, and a nightmare when they’re wrong. So to be ridiculously explicit, the matrix A should have number_of_topics rows and number_of_books columns.

I have lots more to say about bimodal networks, conditional probability, Markov chains, and — at my most speculative — about the questions we ought to ask as we adapt more sophisticated mathematical techniques for use in the digital humanities.

But until then, comments are open!


  1. Ted Underwood has written that “it’s probably best to view network visualization as a convenience,” and there seems to be an implicit consensus that topic networks are more visually stunning than useful. My hope is that by creating networks with more concrete interpretations, we can use them to produce evidence that supports interesting arguments. There are sure to be many details to work through and test before that’s possible, but I think it’s a research program worth developing further. 
  2. I’ve never seen anything quite like this formula or the networks it produces. But I’d love to hear about other work that people have done along these lines — it would make the theoretical burden much lighter. Please let me know if there’s something I’ve missed. See also a few near-misses in the first footnote to my post on the formula’s derivation. [Update: I found a description of the Markov Cluster Algorithm, which uses a matrix that is similar to the one that browsing similarity produces, but that is created in a slightly different way. I’m investigating this further, and I’ll discuss it when I post on Markov chains.] 
  3. If you’d like to read some background material, and you don’t already have a reading list, I propose the following sequence: Matt Jockers, The LDA Buffet Is Now Open (very introductory); Ted Underwood, Topic Modeling Made Just Simple Enough (simple enough but no simpler); Miriam Posner and Andy Wallace, Very Basic Strategies for Interpreting Results from the Topic Modeling Tool (practical approaches for quick bootstrapping); Scott Weingart, Topic Modeling and Network Analysis (introduction to topic networks); Ted Underwood, Visualizing Topic Models (additional theorization of topic visualization). 
  4. Specifically, their links can be interpreted as transition probabilities in an irreducible, aperiodic Markov chain. That’s true of many networks, strictly speaking. But in this case, the probabilities are not derived from the network itself, but from the definition of the browsing process. 
  5. I have a ton of stuff to say about this corpus in the future. It’s part of a collaborative project that Mae Capozzi and I have been working on. 
  6. Because the force-directed layout strategy is purely heuristic, the layout itself is less important than the way the nodes are interconnected. But the visual intuition that the force-directed layout provides is still helpful. I used the WPD3 WordPress plugin to embed these. It’s a little finicky, so please let me know if something has gone wrong. 
  7. Underwood’s original method kept the first link, the second link if it was stronger than 0.2, and the third link if it was stronger than 0.38. This variation takes a base threshold t, which is multiplied by the rank of a given link to determine the threshold it must meet. So if the nth strongest link from a node is stronger than t * (n - 1), then it stays. 
  8. Because some links have been cut, the diagram doesn’t represent a full set of probabilities. It only represents the strongest links — that is, the topic transitions that the network is most biased towards. But the base network retains all that information, and standard network measurements apply to it in ways that have concrete meanings. 
  9. I had to do some odd transpositions to ensure that the R function generates the same output as the Python function. I’m not sure I used the best method — the additional transpositions make the ideas behind the R code seem less obvious to me. (The Python transpositions might seem odd to others — I guess they look normal to me because I’m used to Python.) Please let me know if there’s a more conventional way to manage that calculation. 

The Markov Chains of La Grande Jatte A Short Introduction to Gibbs Sampling

Topic modeling has been attracting the attention of scholars in the digital humanities for several years now, and quite a few substantive introductions to the subject have been written. Ben Schmidt offered a brief overview of the genre in 2012, and the list he provided is still fairly comprehensive, as far as I can tell.1 My current favorite is an entry from Miriam Posner and Andy Wallace that emphasizes the practical side of topic modeling — it’s great for bootstrapping if you’re new to the subject.

This post will cover something slightly different. When I started to delve into the details of topic modeling, I quickly realized that I needed to create my own implementation of Latent Dirichlet Allocation (LDA) to begin understanding how it worked. I eventually did, but even with all the terrific resources available, I ran into several significant roadblocks.2 The biggest one for me was figuring out Gibbs sampling. A lot of introductions to topic modeling don’t spend much time on Gibbs sampling, for understandable reasons. It’s not part of LDA properly speaking, so you don’t need to understand how it works to understand the fundamentals of LDA. In fact, in his original description of LDA, David Blei didn’t even talk about Gibbs sampling — he used a thing called “variational inference,” which is a wall of abstraction that I still haven’t managed to scale.

Fortunately, Gibbs sampling yielded to my efforts more readily. And although it’s not strictly necessary to understand Gibbs sampling to understand LDA, I think it’s worth understanding for other reasons. In fact, I’ve come to believe that Gibbs sampling is a wonderful introduction to the rapidly evolving world of machine learning — a world that I think at least a subset of digital humanists should have much broader knowledge of.

What is Gibbs sampling?

Here’s my attempt at a definition: Gibbs sampling is a way to build a picture of a global probability distribution when you only have local information about that distribution. That’s more of a description than a definition; other techniques do that too. But I like it because it shows what Gibbs sampling is good at. You can use it to take lots of little bits of information — like individual word counts — and construct a global view of those bits.

Suppose that you are temporarily Georges Seurat, but you couldn’t make it to the Island of La Grande Jatte today. Instead of seeing it for yourself or looking at someone else’s picture, you decide to consult with Sam, your omniscient imaginary friend. Sam supplies you with some probabilities like so:

An image of Georges Seruat's La Grande Jatte, at four different levels of detail.
What a Markov chain knows at first (top). What it knows after a long time (bottom).

Given that you have just put a green dot here (Sam points at a spot on the canvas):

  • The probability is \mu that your next dot will be an orange dot there.
  • The probability is \eta that your next dot will be a blue dot over there.
  • The probability is … [more tiny numbers]

This list goes on until every possible location on the canvas and every possible color has a probability associated with it. It turns out they all add up to one. (They’re probabilities, after all!) Then Sam gives you another list that starts with another location and possible color. You get lists from every possible point and color on the canvas, to every possible point and color on the canvas. Now, at any moment while painting, you can look up the dot you’ve just painted in the table. You can then use that dot’s transition probabilities to decide how to paint the next one.

So you just start painting dots. And lo and behold, after a really long time, you’re looking at a picture of La Grande Jatte.

What I’ve just described is called a Markov chain.3 Gibbs sampling adds just one more little twist. But before I get to that, I want to explain why this is possible. Sam’s table of probabilities has to meet three conditions for this to work. The first two dictate the kinds of movements between points and colors that the table of probabilities must allow. First, the table of probabilities must allow you to get from any point and color in the painting to any other. It doesn’t have to allow you to get from one to the other in a single step, but it has to allow you to get there eventually. And second, the table of probabilities must allow you to get from one point and color to another at irregular intervals. So if it aways takes you two, or four, or eight steps to get from node A to node B, and never any other number of steps, then the table doesn’t satisfy this condition, because the number of steps required to get from A to B is always a multiple of two.

Together, these conditions tell us that the Markov chain has what’s called a stationary distribution.4 It’s a probability distribution over every point and possible color on the canvas. It tells you how often you will paint a particular dot, on average, if you keep painting forever. If Sam’s table meets these first two conditions, then we can prove that it has a stationary distribution, and we can even prove that its stationary distribution is unique. At that point, it only has to meet one more condition: its stationary distribution must be a painting of La Grande Jatte.

What’s neat about this is that none of the individual transition probabilities know much about the painting. It’s only when they get together and “talk” to one another for a while that they start to realize what’s actually going on.5 That’s what Gibbs sampling allows.

The Catch

The difficult part of using Markov chains this way is figuring out the transition probabilities. How many coordinates and color codes would you need to create an adequate representation of a Seurat painting? I’m not sure, but I bet it’s a number with a lot of zeros at the end. Call it N. And to create the full transition table, you’d have to calculate and store probabilities from each of those values to each of those values. That’s a big square table with N rows and columns. These numbers get mind-bogglingly huge for even relatively simple problems.

Gibbs sampling uses a clever trick to get around that issue. It’s based on the simple insight that you don’t have to change every dimension at once. Instead of jumping directly from one point and color to another — from (x_1, y_1, c_1) to (x_2, y_2, c_2) — you can move along one dimension at a time, jumping from (x_1, y_1, c_1) to (x_2, y_1, c_1) to (x_2, y_2, c_1) to (x_2, y_2, c_2), and so on. It turns out that calculating probabilities for those transitions is often much easier and faster — and the stationary distribution stays the same.

The skeleton of a ten-dimensional hypercube.
See this crazy incomprehensible rat’s nest? It’s the skeleton of a topic hypercube for a corpus just ten words long.

In effect, this means that although you might not be able to calculate all the transition probabilities in the table, you can calculate all the relevant translation probabilities pretty easily. This makes almost no practical difference to you as you paint La Grande Jatte. It just means you do three lookups instead of one before painting the next dot. (It also might mean you don’t paint the dot every time, but only every fifth or tenth time, so that your dots aren’t too tightly correlated with one another, and come closer to being genuinely independent samples from the stationary distribution.)

In the context of the LDA model, this means that you don’t have to leap from one set of hypothetical topic labels to an entirely different one. That makes a huge difference now, because instead of working with a canvas, we’re working with a giant topic hypercube with a dimension for every single word in the corpus. Given that every word is labeled provisionally with a topic, we can just change each topic label individually, over and over, using transition probabilities from this formula that some really smart people have helpfully derived for us. And every time we save a set of topic labels, we’ve painted a single dot on the canvas of La Grande Jatte.6

So What?

I began this post with a promise that you’d get something valuable out of this explanation of Gibbs sampling, even though it isn’t part of the core of LDA. I’m going to offer three brief payoffs now, which I hope to expand in later posts.

First, most implementations of LDA use Gibbs sampling, and at least some of the difficulties that LDA appears to have — including some identified by Ben Schmidt — are probably more related to issues with Gibbs sampling than with LDA. Think back to the requirement that to have a stationary distribution, a Markov chain has to be able to reach every possible state from every other possible state. That’s strictly true in LDA, because the LDA model assumes that every word has a nonzero probability of appearing in every topic, and every topic has a nonzero probability of appearing in every document. But in some cases, those probabilities are extremely small. This is particularly true for word distributions in topics, which tend to be very sparse. That suggests that although the Markov chain has a stationary distribution, it may be hard to approximate quickly, because it will take a very long time for the chain to move from one set of states to another. For all we know, it could take only hours to reach a result that looks plausible, but years to reach a result that’s close to the actual stationary distribution. Returning to the Grand Jatte example, this would be a bit like getting a really clear picture of the trees in the upper-right-hand corner of the canvas and concluding that the rest must be a picture of a forest. The oddly conjoined and split topics that Schmidt and others have identified in their models seem a little less mysterious once you understand the quirks of Gibbs sampling.

Second, Gibbs sampling could be very useful for solving other kinds of problems. For some time now, I’ve had an eccentric obsession with encoding text into prime numbers and back into text again. The source of this obsession has to do with copyright law and some of the strange loopholes that the idea-expression dichotomy creates.7 I’m going to leave that somewhat mysterious for now, and jump to the point: part of my obsession has involved trying to figure out how to automatically break simple substitution cyphers. I’ve found that Gibbs sampling is surprisingly good at it. This is, I’ll admit, a somewhat peripheral concern. But I can’t get rid of the sense that there are other interesting things that Gibbs sampling could do that are more directly relevant to digital humanists. It’s a surprisingly powerful and flexible technique, and I think its power comes from that ability to take little bits of fragmentary information and assemble them into a gestalt.8

Third, I think Gibbs sampling is — or should be — theoretically interesting for humanists of all stripes. The theoretical vistas opened up by LDA are fairly narrow because there’s something a little bit single-purpose about it. Although it’s remarkably flexible in some ways, it makes strong assumptions about the structure of the data that it analyzes. Those assumptions limit its possible uses as a model for more speculative thinking. Gibbs sampling makes fewer such assumptions; or to be more precise, it accommodates a wider range of possible assumptions. MALLET is a tool for pounding, and it does a great job at it. But Gibbs sampling is more like the handle of a bit-driver. It’s only half-complete — assembly is required to get it to do something interesting — but it’s the foundation of a million different possible tools.

It’s the kind of tool a bricoleur ought to own.


  1. If you know of new or notable entries that are missing, let me know and I’ll add them to a list here. 
  2. You can take a look here. Caveat emptor! I called it ldazy for a reason — it stands for “LDA implementation by someone who is too lazy” to make further improvements. It’s poorly-commented, inefficient, and bad at estimating hyperparameters. (At least it tries!) Its only strength is that it is short and written in pure Python, which means that its code is somewhat legible without additional comment. 
  3. After writing this, I did some Googling to see if anybody else had thought about Markov chains in terms of pointillism. I didn’t find anything that takes quite the same approach, but I did find an article describing a way to use Markov chains to model brushstrokes for the purpose of attribution! 
  4. In case you want to talk to math people about this, these conditions are respectively called “irreducibility” and “aperiodicity.” 
  5. Sorry, I couldn’t resist. 
  6. I’m risking just a bit of confusion by extending this analogy so far, because it’s tempting to liken colors to topics. But that’s not quite right. To perfect this analogy, expand the canvas into a three-dimensional space in which all green dots occupy one plane, all orange dots occupy another, and so on. In this scheme, the dots are only present or absent — they are themselves “colorless,” and only take on a color insofar as one of the dimensions is interpreted as a color dimension. And suppose the x, y, and c variables can take values between 1 and 50. Now each dimension could just as easily represent a single word in a three-word corpus, and each dot in this three-dimensional space could represent a sequence of topic assignments for a fifty-topic model — with a value between 1 and 50 for each word in the corpus. 
  7. “In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work” 17 USC 102
  8. For another way of thinking about the possibilities of Gibbs sampling and other so-called Monte Carlo Markov chain (MCMC) methods, see the wonderful sub-subroutine post on using MCMC to learn about bread prices during the napoleonic wars.
     

A Sentimental Derivative

Ben Schmidt’s terrific insight into the assumptions that the Fourier transform imposes on sentiment data has been sinking in, and I have a left-field suggestion for anyone who cares to check it out. I plan to investigate it myself when I have the time, but I’ve decided to broadcast it now.

In the imaginary universe of Fourier land, all texts start and end at the same sentiment amplitude. This is clearly incorrect, as I see it.1 But what could we say about the beginning and end of texts that might hold up?

One possibility is that all texts might start and end with a flat sentiment curve. That is, at the very beginning and end of a text, we can assume that the valence of words won’t shift dramatically. That’s not clearly incorrect. I think it’s even plausible.

Now consider how we talk about plot most of the time: we speak of rising action (slope positive), falling action (slope negative), and climaxes (local and global maxima). That’s first derivative talk! And the first derivative of a flat curve is always zero. So if the first derivative of a sentiment curve always starts and ends at zero, then at least one objection to the Fourier transform approach can be worked around. For example, we could simply take the first finite difference of a text’s sentiment time series, perform a DFT and low-pass filter, do a reverse transform, and then do a cumulative sum (i.e. a discrete integration) of the result.2

What would that look like?


  1. Nonetheless, I think there’s some value to remaining agnostic about this for some time still — even now, after the dust has settled a bit. 
  2. You might be able to skip a step or two. 

Deriving Browsing Similarity

The following is an extremely deliberate, step-by-step derivation of what I think is a novel vector similarity measure.1 I discuss motivations for it here; this focuses on the math. Describing this as a “vector” similarity measure might be a little deceptive because part of the point is to get away from the vector model that we’ve inherited via cosine similarity. But in its final form, the measure is similar enough to cosine similarity that it’s helpful to hold on to that way of thinking for now.

The initial aim of this similarity measure was to create better topic model graphs, and I think it really does. But as I’ve thought it through, I’ve realized that it has applications to any bipartite graph that can be interpreted in probabilistic terms. More on that later!

The derivation is pure conditional probability manipulation; it doesn’t require anything but algebra and a few identities. But if you’re not familiar with the concepts of conditional probability and marginalization (in the mathematical sense!) you may want to read up on them a bit first. Also, I’m not certain that I’m using conventional notation here — please let me know if I’ve done something odd or confusing. But I’m confident the reasoning itself is correct.

Informally, the aim of this measure is to determine the probability of happening upon one topic while browsing through a corpus for another one. Imagine for a moment that the corpus is a collection of physical books on a bookshelf in front of you. You’re interested in crop circles, and you are looking through the books on the bookshelf to find information about them. What is the probability that in the process, you’ll happen upon something about clock gear ratios?2

Formally, suppose we have random variables X and Y, each representing identical categorical distributions over topics in a model, and suppose we have B, a random variable representing a uniform distribution over all books in the model’s corpus. We’re interested in the probability of happening upon topic x given that we selected a book b based on the proportion of the book discussing topic y. To be as precise as possible, I’ll assume that we use the following process to browse through the corpus:

  1. Pick a topic y of interest from Y.
  2. Pick a book at random, with uniform probability.
  3. Pick a word from the book at random, with uniform probability.
  4. If the word is not labeled y, put the book back on the shelf and repeat steps 2 and 3 until the word you choose is labeled y.
  5. Pick another word from the book at random, again with uniform probability.
  6. Use that word’s topic label x as your new topic of interest.
  7. Repeat steps 2 through 6 indefinitely.

This is roughly equivalent to the less structured process I described above in the informal statement of the problem, and there’s at least some reason to believe that the probabilities will turn out similarly. Suppose, for example, that you know in advance the proportion of each book devoted to each topic, and choose a book based on that information; and that you then choose your new topic using that book’s topic distribution. The probabilities in that case should work out to be the same as if you use the above process. But specifying the above process ensures that we know exactly where to begin our derivation and how to proceed.

In short, what we have here is a generative model — but this time, instead of being a generative model of writing, such as LDA, it’s a generative model of reading.

Now for the derivation. First, some basic identities. The first two give different versions of the definition of conditional probability. The third shows the relationship between the conditional probabilities of three variables and their joint probability; it’s a version of the first identity generalized to three variables. And the fourth gives the definition of marginalization over the joint probability of three variables — which simply eliminates one of them by summing over the probabilities for all its possible values. (You can do the same thing with any number of variables, given a joint distribution.) I’m using the convention that an uppercase variable represents a probability distribution over a support (set of possible values) and a lowercase variable represents one possible value from that distribution. To avoid clutter, I’ve silently elided the =x in P(X=x) unless clarity requires otherwise.

1) \quad p(X,Y) = p(X|Y) \times p(Y)

2) \quad p(X|Y) = p(X,Y) / p(Y)

3) \quad p(X,Y,B) = p(Y) \times p(B|Y) \times p(X|B,Y)

4) \quad p(X,Y) = \sum \limits_{b \in B} p(X,Y,B=b)

Now for the derivation. First, substitute (3) into (4):

5) \quad p(X,Y) = \sum \limits_{b \in B} (p(Y) \times p(B=b|Y) \times p(X|B=b,Y))

The prior probability of Y, p(Y), is constant with respect to b, so we can move it outside the sum:

6) \quad p(X,Y) = p(Y) \times \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

Divide both sides:

7) \quad p(X,Y) / p(Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

And by (2):

8) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

Finally, for any given book b, we can simplify p(X|B=b,Y) to p(X|B=b) because the probability of picking a word labeled with topic x depends only on the given book. The topic that led us to choose that book becomes irrelevant once it has been chosen.

9) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b))

So now we have a formula for p(X|Y) in terms of p(B|Y) and p(X|B). And we know p(X|B) — it’s just the probability of finding topic x in book b, which is part of the output from our topic model. But we don’t yet know p(B|Y).

Here’s how we can determine that value. We’ll introduce a combined version of equations 1 and 2 with the variables swapped as 10 — also known as Bayes’ Theorem:

10) \quad p(B|Y) = p(Y|B) \times p(B) / p(Y)

As well as a combination of the two-variable versions of equations 3 and 4 as 11:

11) \quad p(Y) = \sum \limits_{b \in B} (p(B=b) \times p(Y|B=b))

Starting with 11, we note that for any given b, p(B=b) = 1 / N where N is the number of books. (This is because B is uniformly distributed across all books.) That means p(B=b) is a constant, and so we can move it outside the sum:

12) \quad p(Y) = p(B) \times \sum \limits_{b \in B} p(Y|B=b)

Substituting that into equation 10 we get:3

13) \quad p(B|Y) = p(Y|B) \times p(B) / (p(B) \times \sum \limits_{b \in B} p(Y|B=b))

Conveniently, the p(B) terms now cancel out:

14) \quad p(B|Y) = p(Y|B) / \sum \limits_{b \in B} p(Y|B=b)

And we substitute that into the our previous result (9) above:

15) \quad p(X|Y) = \sum \limits_{b \in B}(p(X|B=b) \times p(Y|B=b) / \sum \limits_{b \in B} p(Y|B=b))

Now we can simplify again by noticing that \sum \limits_{b \in B} p(Y|B=b) is constant with respect to the outer sum, because all the changes in the values of p(Y|B=b) in the outer sum are subsumed by the inner sum. So we can move that out of the sum.4

16) \quad p(X|Y) = \sum \limits_{b \in B} (p(X|B=b) \times p(Y|B=b)) / \sum \limits_{b \in B} p(Y|B=b)

This is a very interesting result, because it looks suspiciously like the formula for cosine similarity. To see that more clearly, suppose that instead of looking at p(X|B=b) and p(Y|B=b) as conditional probabilities, we looked at them as vectors with a dimension for every value of b — that is, as X = [x_1\ x_2\ x_3\ ...\ x_n] and Y = [y_1\ y_2\ y_3\ ...\ y_n]. We’d get this formula:

17) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}

And here’s the formula for cosine similarity for comparison:

18) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}

As you can see, the sum on top is identical in both formulas. It’s a dot product of the vectors X and Y. The difference is on the bottom. In the cosine similarity formula, the bottom is the multiple of the lengths of the two vectors — that is, the Euclidean norm. But in the new formula we derived, it’s a simple sum! In fact, we could even describe it as the Manhattan norm, which makes the relationship between the two formulas even clearer. To convert between them, we can simply replace the Euclidean norm of both vectors with the Manhattan norm of the second vector Y — that is, the conditioning probability distribution in p(X|Y).

So at this point, you might be thinking “That was a lot of trouble for a result that hardly differs from cosine similarity. What was the point again?”

There are two answers.

A diagram illustrating cosine similarity
The cosine similarity between two topics. Topic One accounts for three words in Book One and four words in Book Two. Topic Two accounts for five words in Book One and four words in Book Two. The cosine similarity is equal to the cosine of the angle between them, Theta. If you’re not confused by this, then I think you should be. It’s weird that we’re still modeling texts as vectors.

The first emphasizes the familiar. Although we just derived something that looks almost identical to cosine similarity, we derived it using a specific and well-defined statistical model that invites a new and more concrete set of interpretations. Now we don’t have to think about vectors in an abstract space; we can think about physical people holding physical books. So we’ve come to a better way to theorize what cosine similarity measures in this case, even if we don’t seem to have discovered anything new.

The second emphasizes the unfamiliar. Although what we have derived looks similar to cosine similarity, it is dramatically different in one respect. It’s an asymetric operator.

If you’re not sure what that means, look at the diagrams to the right. They illustrate the vector space model that cosine similarity uses, this time with two simple 2-d vectors. Think of the vectors as representing topics and the axes as representing documents. In that case, the cosine similarity corresponds roughly to the similarity between topics — assuming, that is, that topics that appear together frequently must be similar.

A graph of the cosine function.
Cosine similarity is symmetric because the cosine function is symmetric around 0. Swapping the order of application negates theta, but the cosine of negative theta is the same as the cosine of theta.

The cosine similarity is simply the cosine of the angle between them, theta. As the angle gets larger, the cosine goes down; as it gets smaller, the cosine goes up. If the vectors both point in exactly the same direction, then the cosine is 1. And if they are at right angles, it’s 0. But notice that if you swap them, the value doesn’t change. The cosine similarity of A with respect to B is the same as the cosine similarity of B with respect to A. That’s what it means to say that this is a symmetric operator.

But browsing similarity is not a symmetric operator. Why? Because the denominator only includes the norm of the conditioning variable (the Y in p(X|Y)). The conditioned variable (X in p(X|Y)) isn’t included in the denominator. This means that if the sum of X is different from the sum of Y, the result will be different depending on the order in which the operator is applied. This means that the graph this measure produces will be a directed graph. Look — see the arrows?

The arrows indicate the order in which the similarity operator has been applied to the given topic vectors. If the arrow is pointing towards topic X, then it represents the value p(X|Y) — otherwise, it represents the value p(Y|X).

What’s exciting is that this has a physical interpretation. Concretely, the probability that people who are interested in topic Y will be exposed to topic X is different from the probability that people who are interested in topic X will be exposed to topic Y. This shouldn’t be too surprising; obviously a very popular topic will be more likely to “attract” readers from other topics than an unpopular topic. So this is really how it ought to be: if Y is an unpopular topic, and X is a very popular topic, then p(Y|X) should be much lower than P(X|Y).

This makes me think that we’ve been using the wrong similarity measure to create our topic graphs. We’ve been assuming that any two topics are equally related to one another, but that’s not really a sound assumption, except under a model that’s so abstract that we can’t pin a physical interpretation to it at all.

Furthermore, this measurement isn’t limited to use on topic models. It can be used on any pair of types related to each other by categorical distributions; or, to put it another way, it can be used to collapse any bipartite graph into a non-bipartite graph in a probabilistically sound way. This opens up lots of interesting possibilities that I’ll discuss in later posts. But to give you just one example, consider this: instead of thinking of topics and books, think of books and recommenders. Suppose you have a community of recommenders, each of whom is disposed to recommend a subset of books with different probabilities. The recommenders form one set of nodes; the books form the other. The browsing process would look like this: you talk to someone likely to recommend a book you already like and ask for another recommendation. Then you talk to someone else likely to recommend that book, and so on. This could be a new way to implement recommender systems like those used by Amazon.


  1. I hope someone will tell me it’s not! In that case, my task is easier, because it means I won’t have to do all the theory. I’ve found a few near misses. In a paper on topic model diagnostics, Chuang, et. al. propose a refinement of cosine similarity for finding matches between topics across multiple independently-run models. Unlike the measure I’m proposing, theirs uses topic-word vectors; is not based on a generative model; and is a symmetric operator. However, theirs was better at predicting human judgments than cosine similarity was, and is worth investigating further for that reason. See also this wonderful post by Brendan O’Connor, which lays out a fascinating set of relationships between various measures that all boil down to normed dot products using different norms. This measure could be added to that list. 
  2. Here’s the graph-theoretic interpretation: given that the topics and books together form a complete bipartite graph, in which the edges are weighted by the proportion of the book that is about the linked topic, this is equivalent to building a non-bipartite directed graph describing the probability of moving from one topic to another while browsing. This takes up ideas that I first encountered through Scott Weingart
  3. This is almost the same as what Wikipedia calls the “extended version” of Bayes Theorem. The only difference is a small modification that resulted from the step we took to generate equation 12, which was possible because p(B) is uniform. 
  4. If you’re anything like me, you probably start to feel a little anxious every time a variable moves inside or outside the summation operator. To give myself a bit more confidence that I wasn’t doing something wrong, I wrote this script, which repeatedly simulates the 7-step process specified above a million times on an imaginary corpus, and compares the resulting topic transition probability to the probability calculated by the derived formula. The match is close, with a tight standard deviation, and very similar results for several different averaging methods. The script is a little bit sloppy — my apologies — but I encourage anyone interested in verifying this result to look at the script and check my work. Let me know if you see a potential problem!