Topic Modeling

Using generative models to identify statistically interesting clusters of words in texts, and my thoughts thereon.

Directed Probabilistic Topic Networks

Suppose you’re standing in front of your bookcase, feeling a little bored. You pick up a book at random, and read a few pages on a random topic. It piques your curiosity, so you put the first book away and pick another one that has something to say about the same topic. You read a few pages from it and notice a second topic that interests you. So you pick up a third book on that topic, and that book draws your attention to yet another topic. And you continue moving from book to topic to book to topic — forever.

Wouldn’t be interesting if we could describe that process mathematically?

Books from Yale's Beinecke LibraryFor the for the last few months I’ve been thinking about the best way to create useful networks with topic models. People have been creating network visualizations of topic models for a long time now, but they sometimes feel a bit like window dressing.1 The problem is that we don’t know what these networks actually represent. The topics are just blobs linked together and floating in a mysterious, abstract space. But what if we could create a network with a clear and concrete interpretation in terms of a physical process that we understand? What if we could create a network that represents the process of browsing through the books on a bookshelf?

I have struck upon a formula that I think does just that — it describes the probability of moving from one topic to another while browsing through a corpus. Remarkably, the formula is very similar to the formula for cosine similarity, which is one of the more popular ways of measuring the similarity between topics. But it differs in crucial ways, and it creates a kind of topic network that I haven’t seen before.2 I’d like to hear what others think about it.

I’ve developed two different theoretical arguments that suggest that the networks this formula creates are more useful than the networks that cosine similarity creates. One argument is related to the theory of bimodal networks, and the other is related to the theory of Markov chains. I have several posts queued up that go into the details, but I’ve decided not to post them just yet. Instead, I’m going to let the method speak for itself on practical grounds. I’ll post more once I feel confident that the result is worth the cognitive investment. However, if you’re interested in the fundamental math, I’ve posted a derivation.

A snippet of a topic network diagram.
A snippet from a network diagram of a topic model of PMLA issues by Ted Underwood — part of the preliminary research that led to Goldstone and Underwood’s “The Quiet Transformations of Literary Studies.”

For now, I’ll assume that most readers are already familiar with — or else are profoundly indifferent to — a few background ideas about topic modeling, cosine similarity, and topic networks.3 I hope that won’t exclude too many people from the conversation, because my core argument will be mostly visual and practical: I think that visualizations of these networks look better, and I think the idea of a “browsing similarity” between topics sounds useful — do you?

So feel free to skip past the wonkish bits to the diagrams!

In my own experimentation and research, I’ve found that browsing similarity creates topic networks that differ in several ways from those that cosine similarity creates. First, they distribute links more uniformly between nodes. It’s desirable to simplify topic networks by cutting links with a flat threshold, because the result is easy to reason about. When you do that with this new kind of network, most of the nodes stick together in one loose clump with lots of internal clustering. Second, they invite a probabilistic interpretation with some interesting and well understood theoretical properties.4 Those properties ensure that even some of the more abstract network-theoretical measures, like eigenvalue centrality, have concrete interpretations. And third, they are directed — which says some important things about the relationships between topics in a corpus.

Below are three network diagrams based on a topic model of two thousand eighteenth-century books published between 1757 and 1795.5 Each has 150 nodes (one for each topic in the model). The strengths of the links between each of the nodes are calculated using either cosine similarity or browsing similarity between topic vectors. Each vector is a sequence of book proportions for a given topic. To usefully visualize a topic model as a network, you must cut some links, and the easiest approach is to apply some kind of threshold. Links stronger than some value stay, and the rest are cut. In all the network diagrams below, I’ve selected threshold values that produce exactly 225 links. For layout, I’ve used D3’s force-directed layout routine, so the diagrams will look a little different each time you reload this page.6

In the first diagram, I’ve used cosine similarity with a simple flat threshold. The result is a hairball with a lot of little islands floating around it:

To deal with this problem, Ted Underwood came up with a really clever link-cutting heuristic that produces much cleaner network diagrams. However, it’s a little ad-hoc; it involves retaining at least one link from every node, and then retaining additional links if they’re strong enough. It’s like a compromise between a flat threshold (take all links stronger than x) and a rank-based threshold (take the strongest n links from each node).

In the second diagram, I’ve used cosine similarity again, but applied a variation on Underwood’s heuristic with a tunable base threshold.7

The result is much more coherent, and there’s even a bit of suggestive clustering in places. There are a few isolated archipelagos, but there are no singleton islands, because this method guarantees that each node will link to at least one other node.

Now for the browsing similarity approach. In the third diagram, I’ve used browsing similarity with a simple flat threshold:

Although this diagram has both singleton islands and archipelagos, it’s far more connected than the first, and it has almost as many mainland connections as the second. It also shows a bit more clustering behavior than diagram two does. But what I find most interesting about it is that it represents the concrete browsing process I described above: each of the edges represents a probability that while browsing randomly through the corpus, you will happen upon one topic, given that you are currently reading about another.8 That’s why the edges are directed — you won’t be as likely to move from topic A to topic B as from topic B to topic A. This makes perfect sense: it ought to be harder to move from common topics to rare topics than to move from rare topics to common topics.

Because I wanted to show the shapes of these graphs clearly, I’ve removed the topic labels. But you can also see full-screen versions of the cosine, Underwood, and browsing graphs, complete with topic labels that show more about the kinds of relationships that each of them preserve.

Here’s everything you need to play with the browsing similarity formula. First, the mathematical formula itself:

\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}

You can think of b as standing for “book,” and X and Y as two different topics. x_1 is the proportion of book 1 that is labeled as being about topic X, and so on. The formula is very similar to the forumla for cosine similarity, and the tops are identical. Both calculate the dot product of two topic-book vectors. The difference between them is on the bottom. For browsing similarity, it’s simply the sum of the values in Y, but for cosine similarity, it’s the product of the lengths of the two vectors:

\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}

Here a bit of jargon is actually helpful: cosine similarity uses the “euclidean norm” of both X and Y, while browsing similarity uses only the “manhattan norm” of Y, where “norm” is just a ten dollar word for length. Thinking about these as norms helps clarify the relationship between the two formulas: they both do the same thing, but browsing similarity uses a different concept of length, and ignores the length of X. These changes turn the output of the formula into a probability.

Next, some tools. I’ve written a script that generates Gephi-compatible or D3-compatible graphs from MALLET output. It can use cosine or browsing similarity, and can apply flat, Underwood-style, or rank-based cutoff thresholds. It’s available at GitHub, and it requires numpy and networkx. To use it, simply run MALLET on your corpus of choice, and pass the output to tmtk.py on the command line like so:

./tmtk.py network --remove-self-loops \
    --threshold-value 0.05 \
    --threshold-function flat \
    --similarity-function browsing \
    --output-type gexf \
    --write-network-file browsing_sim_flat \
    --topic-metadata topic_names.csv \
    mallet_output.composition

It should be possible to cut and paste the above command into any bash terminal — including Terminal in OS X under default settings. If you have any difficulties, though, let me know! It may require some massaging to work with Windows. The command should generate a file that can be directly opened by Gephi. I hope the option names are obvious enough; more detailed information about options is available via the --help option.

In case you’d prefer to work this formula into your own code, here is a simplified version of the browsing_similarity function that appears in the above Python script. Here, A is a matrix of topic row vectors. The code here is vectorized to calculate every possible topic combination at once and put them all into a new matrix. You can therefore interpret the output as the weighted adjacency matrix of a fully-connected topic network.

def browsing_similarity(A):
    A = numpy.asarray(A)
    norm = A.T.sum(axis=0)
    return numpy.dot(A, A.T) / norm

And here’s the same thing in R9:

browsingsim norm = rowSums(A)
    dot = A %*% t(A)
    return(t(dot / norm))
}

Matrix calculations like this are a dream when the shapes are right, and a nightmare when they’re wrong. So to be ridiculously explicit, the matrix A should have number_of_topics rows and number_of_books columns.

I have lots more to say about bimodal networks, conditional probability, Markov chains, and — at my most speculative — about the questions we ought to ask as we adapt more sophisticated mathematical techniques for use in the digital humanities.

But until then, comments are open!


  1. Ted Underwood has written that “it’s probably best to view network visualization as a convenience,” and there seems to be an implicit consensus that topic networks are more visually stunning than useful. My hope is that by creating networks with more concrete interpretations, we can use them to produce evidence that supports interesting arguments. There are sure to be many details to work through and test before that’s possible, but I think it’s a research program worth developing further. 
  2. I’ve never seen anything quite like this formula or the networks it produces. But I’d love to hear about other work that people have done along these lines — it would make the theoretical burden much lighter. Please let me know if there’s something I’ve missed. See also a few near-misses in the first footnote to my post on the formula’s derivation. [Update: I found a description of the Markov Cluster Algorithm, which uses a matrix that is similar to the one that browsing similarity produces, but that is created in a slightly different way. I’m investigating this further, and I’ll discuss it when I post on Markov chains.] 
  3. If you’d like to read some background material, and you don’t already have a reading list, I propose the following sequence: Matt Jockers, The LDA Buffet Is Now Open (very introductory); Ted Underwood, Topic Modeling Made Just Simple Enough (simple enough but no simpler); Miriam Posner and Andy Wallace, Very Basic Strategies for Interpreting Results from the Topic Modeling Tool (practical approaches for quick bootstrapping); Scott Weingart, Topic Modeling and Network Analysis (introduction to topic networks); Ted Underwood, Visualizing Topic Models (additional theorization of topic visualization). 
  4. Specifically, their links can be interpreted as transition probabilities in an irreducible, aperiodic Markov chain. That’s true of many networks, strictly speaking. But in this case, the probabilities are not derived from the network itself, but from the definition of the browsing process. 
  5. I have a ton of stuff to say about this corpus in the future. It’s part of a collaborative project that Mae Capozzi and I have been working on. 
  6. Because the force-directed layout strategy is purely heuristic, the layout itself is less important than the way the nodes are interconnected. But the visual intuition that the force-directed layout provides is still helpful. I used the WPD3 WordPress plugin to embed these. It’s a little finicky, so please let me know if something has gone wrong. 
  7. Underwood’s original method kept the first link, the second link if it was stronger than 0.2, and the third link if it was stronger than 0.38. This variation takes a base threshold t, which is multiplied by the rank of a given link to determine the threshold it must meet. So if the nth strongest link from a node is stronger than t * (n - 1), then it stays. 
  8. Because some links have been cut, the diagram doesn’t represent a full set of probabilities. It only represents the strongest links — that is, the topic transitions that the network is most biased towards. But the base network retains all that information, and standard network measurements apply to it in ways that have concrete meanings. 
  9. I had to do some odd transpositions to ensure that the R function generates the same output as the Python function. I’m not sure I used the best method — the additional transpositions make the ideas behind the R code seem less obvious to me. (The Python transpositions might seem odd to others — I guess they look normal to me because I’m used to Python.) Please let me know if there’s a more conventional way to manage that calculation. 

Deriving Browsing Similarity

The following is an extremely deliberate, step-by-step derivation of what I think is a novel vector similarity measure.1 I discuss motivations for it here; this focuses on the math. Describing this as a “vector” similarity measure might be a little deceptive because part of the point is to get away from the vector model that we’ve inherited via cosine similarity. But in its final form, the measure is similar enough to cosine similarity that it’s helpful to hold on to that way of thinking for now.

The initial aim of this similarity measure was to create better topic model graphs, and I think it really does. But as I’ve thought it through, I’ve realized that it has applications to any bipartite graph that can be interpreted in probabilistic terms. More on that later!

The derivation is pure conditional probability manipulation; it doesn’t require anything but algebra and a few identities. But if you’re not familiar with the concepts of conditional probability and marginalization (in the mathematical sense!) you may want to read up on them a bit first. Also, I’m not certain that I’m using conventional notation here — please let me know if I’ve done something odd or confusing. But I’m confident the reasoning itself is correct.

Informally, the aim of this measure is to determine the probability of happening upon one topic while browsing through a corpus for another one. Imagine for a moment that the corpus is a collection of physical books on a bookshelf in front of you. You’re interested in crop circles, and you are looking through the books on the bookshelf to find information about them. What is the probability that in the process, you’ll happen upon something about clock gear ratios?2

Formally, suppose we have random variables X and Y, each representing identical categorical distributions over topics in a model, and suppose we have B, a random variable representing a uniform distribution over all books in the model’s corpus. We’re interested in the probability of happening upon topic x given that we selected a book b based on the proportion of the book discussing topic y. To be as precise as possible, I’ll assume that we use the following process to browse through the corpus:

  1. Pick a topic y of interest from Y.
  2. Pick a book at random, with uniform probability.
  3. Pick a word from the book at random, with uniform probability.
  4. If the word is not labeled y, put the book back on the shelf and repeat steps 2 and 3 until the word you choose is labeled y.
  5. Pick another word from the book at random, again with uniform probability.
  6. Use that word’s topic label x as your new topic of interest.
  7. Repeat steps 2 through 6 indefinitely.

This is roughly equivalent to the less structured process I described above in the informal statement of the problem, and there’s at least some reason to believe that the probabilities will turn out similarly. Suppose, for example, that you know in advance the proportion of each book devoted to each topic, and choose a book based on that information; and that you then choose your new topic using that book’s topic distribution. The probabilities in that case should work out to be the same as if you use the above process. But specifying the above process ensures that we know exactly where to begin our derivation and how to proceed.

In short, what we have here is a generative model — but this time, instead of being a generative model of writing, such as LDA, it’s a generative model of reading.

Now for the derivation. First, some basic identities. The first two give different versions of the definition of conditional probability. The third shows the relationship between the conditional probabilities of three variables and their joint probability; it’s a version of the first identity generalized to three variables. And the fourth gives the definition of marginalization over the joint probability of three variables — which simply eliminates one of them by summing over the probabilities for all its possible values. (You can do the same thing with any number of variables, given a joint distribution.) I’m using the convention that an uppercase variable represents a probability distribution over a support (set of possible values) and a lowercase variable represents one possible value from that distribution. To avoid clutter, I’ve silently elided the =x in P(X=x) unless clarity requires otherwise.

1) \quad p(X,Y) = p(X|Y) \times p(Y)

2) \quad p(X|Y) = p(X,Y) / p(Y)

3) \quad p(X,Y,B) = p(Y) \times p(B|Y) \times p(X|B,Y)

4) \quad p(X,Y) = \sum \limits_{b \in B} p(X,Y,B=b)

Now for the derivation. First, substitute (3) into (4):

5) \quad p(X,Y) = \sum \limits_{b \in B} (p(Y) \times p(B=b|Y) \times p(X|B=b,Y))

The prior probability of Y, p(Y), is constant with respect to b, so we can move it outside the sum:

6) \quad p(X,Y) = p(Y) \times \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

Divide both sides:

7) \quad p(X,Y) / p(Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

And by (2):

8) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

Finally, for any given book b, we can simplify p(X|B=b,Y) to p(X|B=b) because the probability of picking a word labeled with topic x depends only on the given book. The topic that led us to choose that book becomes irrelevant once it has been chosen.

9) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b))

So now we have a formula for p(X|Y) in terms of p(B|Y) and p(X|B). And we know p(X|B) — it’s just the probability of finding topic x in book b, which is part of the output from our topic model. But we don’t yet know p(B|Y).

Here’s how we can determine that value. We’ll introduce a combined version of equations 1 and 2 with the variables swapped as 10 — also known as Bayes’ Theorem:

10) \quad p(B|Y) = p(Y|B) \times p(B) / p(Y)

As well as a combination of the two-variable versions of equations 3 and 4 as 11:

11) \quad p(Y) = \sum \limits_{b \in B} (p(B=b) \times p(Y|B=b))

Starting with 11, we note that for any given b, p(B=b) = 1 / N where N is the number of books. (This is because B is uniformly distributed across all books.) That means p(B=b) is a constant, and so we can move it outside the sum:

12) \quad p(Y) = p(B) \times \sum \limits_{b \in B} p(Y|B=b)

Substituting that into equation 10 we get:3

13) \quad p(B|Y) = p(Y|B) \times p(B) / (p(B) \times \sum \limits_{b \in B} p(Y|B=b))

Conveniently, the p(B) terms now cancel out:

14) \quad p(B|Y) = p(Y|B) / \sum \limits_{b \in B} p(Y|B=b)

And we substitute that into the our previous result (9) above:

15) \quad p(X|Y) = \sum \limits_{b \in B}(p(X|B=b) \times p(Y|B=b) / \sum \limits_{b \in B} p(Y|B=b))

Now we can simplify again by noticing that \sum \limits_{b \in B} p(Y|B=b) is constant with respect to the outer sum, because all the changes in the values of p(Y|B=b) in the outer sum are subsumed by the inner sum. So we can move that out of the sum.4

16) \quad p(X|Y) = \sum \limits_{b \in B} (p(X|B=b) \times p(Y|B=b)) / \sum \limits_{b \in B} p(Y|B=b)

This is a very interesting result, because it looks suspiciously like the formula for cosine similarity. To see that more clearly, suppose that instead of looking at p(X|B=b) and p(Y|B=b) as conditional probabilities, we looked at them as vectors with a dimension for every value of b — that is, as X = [x_1\ x_2\ x_3\ ...\ x_n] and Y = [y_1\ y_2\ y_3\ ...\ y_n]. We’d get this formula:

17) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}

And here’s the formula for cosine similarity for comparison:

18) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}

As you can see, the sum on top is identical in both formulas. It’s a dot product of the vectors X and Y. The difference is on the bottom. In the cosine similarity formula, the bottom is the multiple of the lengths of the two vectors — that is, the Euclidean norm. But in the new formula we derived, it’s a simple sum! In fact, we could even describe it as the Manhattan norm, which makes the relationship between the two formulas even clearer. To convert between them, we can simply replace the Euclidean norm of both vectors with the Manhattan norm of the second vector Y — that is, the conditioning probability distribution in p(X|Y).

So at this point, you might be thinking “That was a lot of trouble for a result that hardly differs from cosine similarity. What was the point again?”

There are two answers.

A diagram illustrating cosine similarity
The cosine similarity between two topics. Topic One accounts for three words in Book One and four words in Book Two. Topic Two accounts for five words in Book One and four words in Book Two. The cosine similarity is equal to the cosine of the angle between them, Theta. If you’re not confused by this, then I think you should be. It’s weird that we’re still modeling texts as vectors.

The first emphasizes the familiar. Although we just derived something that looks almost identical to cosine similarity, we derived it using a specific and well-defined statistical model that invites a new and more concrete set of interpretations. Now we don’t have to think about vectors in an abstract space; we can think about physical people holding physical books. So we’ve come to a better way to theorize what cosine similarity measures in this case, even if we don’t seem to have discovered anything new.

The second emphasizes the unfamiliar. Although what we have derived looks similar to cosine similarity, it is dramatically different in one respect. It’s an asymetric operator.

If you’re not sure what that means, look at the diagrams to the right. They illustrate the vector space model that cosine similarity uses, this time with two simple 2-d vectors. Think of the vectors as representing topics and the axes as representing documents. In that case, the cosine similarity corresponds roughly to the similarity between topics — assuming, that is, that topics that appear together frequently must be similar.

A graph of the cosine function.
Cosine similarity is symmetric because the cosine function is symmetric around 0. Swapping the order of application negates theta, but the cosine of negative theta is the same as the cosine of theta.

The cosine similarity is simply the cosine of the angle between them, theta. As the angle gets larger, the cosine goes down; as it gets smaller, the cosine goes up. If the vectors both point in exactly the same direction, then the cosine is 1. And if they are at right angles, it’s 0. But notice that if you swap them, the value doesn’t change. The cosine similarity of A with respect to B is the same as the cosine similarity of B with respect to A. That’s what it means to say that this is a symmetric operator.

But browsing similarity is not a symmetric operator. Why? Because the denominator only includes the norm of the conditioning variable (the Y in p(X|Y)). The conditioned variable (X in p(X|Y)) isn’t included in the denominator. This means that if the sum of X is different from the sum of Y, the result will be different depending on the order in which the operator is applied. This means that the graph this measure produces will be a directed graph. Look — see the arrows?

The arrows indicate the order in which the similarity operator has been applied to the given topic vectors. If the arrow is pointing towards topic X, then it represents the value p(X|Y) — otherwise, it represents the value p(Y|X).

What’s exciting is that this has a physical interpretation. Concretely, the probability that people who are interested in topic Y will be exposed to topic X is different from the probability that people who are interested in topic X will be exposed to topic Y. This shouldn’t be too surprising; obviously a very popular topic will be more likely to “attract” readers from other topics than an unpopular topic. So this is really how it ought to be: if Y is an unpopular topic, and X is a very popular topic, then p(Y|X) should be much lower than P(X|Y).

This makes me think that we’ve been using the wrong similarity measure to create our topic graphs. We’ve been assuming that any two topics are equally related to one another, but that’s not really a sound assumption, except under a model that’s so abstract that we can’t pin a physical interpretation to it at all.

Furthermore, this measurement isn’t limited to use on topic models. It can be used on any pair of types related to each other by categorical distributions; or, to put it another way, it can be used to collapse any bipartite graph into a non-bipartite graph in a probabilistically sound way. This opens up lots of interesting possibilities that I’ll discuss in later posts. But to give you just one example, consider this: instead of thinking of topics and books, think of books and recommenders. Suppose you have a community of recommenders, each of whom is disposed to recommend a subset of books with different probabilities. The recommenders form one set of nodes; the books form the other. The browsing process would look like this: you talk to someone likely to recommend a book you already like and ask for another recommendation. Then you talk to someone else likely to recommend that book, and so on. This could be a new way to implement recommender systems like those used by Amazon.


  1. I hope someone will tell me it’s not! In that case, my task is easier, because it means I won’t have to do all the theory. I’ve found a few near misses. In a paper on topic model diagnostics, Chuang, et. al. propose a refinement of cosine similarity for finding matches between topics across multiple independently-run models. Unlike the measure I’m proposing, theirs uses topic-word vectors; is not based on a generative model; and is a symmetric operator. However, theirs was better at predicting human judgments than cosine similarity was, and is worth investigating further for that reason. See also this wonderful post by Brendan O’Connor, which lays out a fascinating set of relationships between various measures that all boil down to normed dot products using different norms. This measure could be added to that list. 
  2. Here’s the graph-theoretic interpretation: given that the topics and books together form a complete bipartite graph, in which the edges are weighted by the proportion of the book that is about the linked topic, this is equivalent to building a non-bipartite directed graph describing the probability of moving from one topic to another while browsing. This takes up ideas that I first encountered through Scott Weingart
  3. This is almost the same as what Wikipedia calls the “extended version” of Bayes Theorem. The only difference is a small modification that resulted from the step we took to generate equation 12, which was possible because p(B) is uniform. 
  4. If you’re anything like me, you probably start to feel a little anxious every time a variable moves inside or outside the summation operator. To give myself a bit more confidence that I wasn’t doing something wrong, I wrote this script, which repeatedly simulates the 7-step process specified above a million times on an imaginary corpus, and compares the resulting topic transition probability to the probability calculated by the derived formula. The match is close, with a tight standard deviation, and very similar results for several different averaging methods. The script is a little bit sloppy — my apologies — but I encourage anyone interested in verifying this result to look at the script and check my work. Let me know if you see a potential problem!