Suppose you’re standing in front of your bookcase, feeling a little bored. You pick up a book at random, and read a few pages on a random topic. It piques your curiosity, so you put the first book away and pick another one that has something to say about the same topic. You read a few pages from it and notice a second topic that interests you. So you pick up a third book on that topic, and that book draws your attention to yet another topic. And you continue moving from book to topic to book to topic — forever.
Wouldn’t be interesting if we could describe that process mathematically?
For the for the last few months I’ve been thinking about the best way to create useful networks with topic models. People have been creating network visualizations of topic models for a long time now, but they sometimes feel a bit like window dressing.^{1} The problem is that we don’t know what these networks actually represent. The topics are just blobs linked together and floating in a mysterious, abstract space. But what if we could create a network with a clear and concrete interpretation in terms of a physical process that we understand? What if we could create a network that represents the process of browsing through the books on a bookshelf?
I have struck upon a formula that I think does just that — it describes the probability of moving from one topic to another while browsing through a corpus. Remarkably, the formula is very similar to the formula for cosine similarity, which is one of the more popular ways of measuring the similarity between topics. But it differs in crucial ways, and it creates a kind of topic network that I haven’t seen before.^{2} I’d like to hear what others think about it.
I’ve developed two different theoretical arguments that suggest that the networks this formula creates are more useful than the networks that cosine similarity creates. One argument is related to the theory of bimodal networks, and the other is related to the theory of Markov chains. I have several posts queued up that go into the details, but I’ve decided not to post them just yet. Instead, I’m going to let the method speak for itself on practical grounds. I’ll post more once I feel confident that the result is worth the cognitive investment. However, if you’re interested in the fundamental math, I’ve posted a derivation.
For now, I’ll assume that most readers are already familiar with — or else are profoundly indifferent to — a few background ideas about topic modeling, cosine similarity, and topic networks.^{3} I hope that won’t exclude too many people from the conversation, because my core argument will be mostly visual and practical: I think that visualizations of these networks look better, and I think the idea of a “browsing similarity” between topics sounds useful — do you?
So feel free to skip past the wonkish bits to the diagrams!
In my own experimentation and research, I’ve found that browsing similarity creates topic networks that differ in several ways from those that cosine similarity creates. First, they distribute links more uniformly between nodes. It’s desirable to simplify topic networks by cutting links with a flat threshold, because the result is easy to reason about. When you do that with this new kind of network, most of the nodes stick together in one loose clump with lots of internal clustering. Second, they invite a probabilistic interpretation with some interesting and well understood theoretical properties.^{4} Those properties ensure that even some of the more abstract network-theoretical measures, like eigenvalue centrality, have concrete interpretations. And third, they are directed — which says some important things about the relationships between topics in a corpus.
Below are three network diagrams based on a topic model of two thousand eighteenth-century books published between 1757 and 1795.^{5} Each has 150 nodes (one for each topic in the model). The strengths of the links between each of the nodes are calculated using either cosine similarity or browsing similarity between topic vectors. Each vector is a sequence of book proportions for a given topic. To usefully visualize a topic model as a network, you must cut some links, and the easiest approach is to apply some kind of threshold. Links stronger than some value stay, and the rest are cut. In all the network diagrams below, I’ve selected threshold values that produce exactly 225 links. For layout, I’ve used D3’s force-directed layout routine, so the diagrams will look a little different each time you reload this page.^{6}
In the first diagram, I’ve used cosine similarity with a simple flat threshold. The result is a hairball with a lot of little islands floating around it:
To deal with this problem, Ted Underwood came up with a really clever link-cutting heuristic that produces much cleaner network diagrams. However, it’s a little ad-hoc; it involves retaining at least one link from every node, and then retaining additional links if they’re strong enough. It’s like a compromise between a flat threshold (take all links stronger than ) and a rank-based threshold (take the strongest links from each node).
In the second diagram, I’ve used cosine similarity again, but applied a variation on Underwood’s heuristic with a tunable base threshold.^{7}
The result is much more coherent, and there’s even a bit of suggestive clustering in places. There are a few isolated archipelagos, but there are no singleton islands, because this method guarantees that each node will link to at least one other node.
Now for the browsing similarity approach. In the third diagram, I’ve used browsing similarity with a simple flat threshold:
Although this diagram has both singleton islands and archipelagos, it’s far more connected than the first, and it has almost as many mainland connections as the second. It also shows a bit more clustering behavior than diagram two does. But what I find most interesting about it is that it represents the concrete browsing process I described above: each of the edges represents a probability that while browsing randomly through the corpus, you will happen upon one topic, given that you are currently reading about another.^{8} That’s why the edges are directed — you won’t be as likely to move from topic A to topic B as from topic B to topic A. This makes perfect sense: it ought to be harder to move from common topics to rare topics than to move from rare topics to common topics.
Because I wanted to show the shapes of these graphs clearly, I’ve removed the topic labels. But you can also see full-screen versions of the cosine, Underwood, and browsing graphs, complete with topic labels that show more about the kinds of relationships that each of them preserve.
Here’s everything you need to play with the browsing similarity formula. First, the mathematical formula itself:
You can think of as standing for “book,” and and as two different topics. is the proportion of book 1 that is labeled as being about topic , and so on. The formula is very similar to the forumla for cosine similarity, and the tops are identical. Both calculate the dot product of two topic-book vectors. The difference between them is on the bottom. For browsing similarity, it’s simply the sum of the values in , but for cosine similarity, it’s the product of the lengths of the two vectors:
Here a bit of jargon is actually helpful: cosine similarity uses the “euclidean norm” of both and , while browsing similarity uses only the “manhattan norm” of , where “norm” is just a ten dollar word for length. Thinking about these as norms helps clarify the relationship between the two formulas: they both do the same thing, but browsing similarity uses a different concept of length, and ignores the length of . These changes turn the output of the formula into a probability.
Next, some tools. I’ve written a script that generates Gephi-compatible or D3-compatible graphs from MALLET output. It can use cosine or browsing similarity, and can apply flat, Underwood-style, or rank-based cutoff thresholds. It’s available at GitHub, and it requires numpy
and networkx
. To use it, simply run MALLET on your corpus of choice, and pass the output to tmtk.py
on the command line like so:
./tmtk.py network --remove-self-loops \
--threshold-value 0.05 \
--threshold-function flat \
--similarity-function browsing \
--output-type gexf \
--write-network-file browsing_sim_flat \
--topic-metadata topic_names.csv \
mallet_output.composition
It should be possible to cut and paste the above command into any bash terminal — including Terminal in OS X under default settings. If you have any difficulties, though, let me know! It may require some massaging to work with Windows. The command should generate a file that can be directly opened by Gephi. I hope the option names are obvious enough; more detailed information about options is available via the --help
option.
In case you’d prefer to work this formula into your own code, here is a simplified version of the browsing_similarity
function that appears in the above Python script. Here, A
is a matrix of topic row vectors. The code here is vectorized to calculate every possible topic combination at once and put them all into a new matrix. You can therefore interpret the output as the weighted adjacency matrix of a fully-connected topic network.
def browsing_similarity(A):
A = numpy.asarray(A)
norm = A.T.sum(axis=0)
return numpy.dot(A, A.T) / norm
And here’s the same thing in R^{9}:
browsingsim norm = rowSums(A)
dot = A %*% t(A)
return(t(dot / norm))
}
Matrix calculations like this are a dream when the shapes are right, and a nightmare when they’re wrong. So to be ridiculously explicit, the matrix A
should have number_of_topics
rows and number_of_books
columns.
I have lots more to say about bimodal networks, conditional probability, Markov chains, and — at my most speculative — about the questions we ought to ask as we adapt more sophisticated mathematical techniques for use in the digital humanities.
But until then, comments are open!
- Ted Underwood has written that “it’s probably best to view network visualization as a convenience,” and there seems to be an implicit consensus that topic networks are more visually stunning than useful. My hope is that by creating networks with more concrete interpretations, we can use them to produce evidence that supports interesting arguments. There are sure to be many details to work through and test before that’s possible, but I think it’s a research program worth developing further. ↩
- I’ve never seen anything quite like this formula or the networks it produces. But I’d love to hear about other work that people have done along these lines — it would make the theoretical burden much lighter. Please let me know if there’s something I’ve missed. See also a few near-misses in the first footnote to my post on the formula’s derivation. [Update: I found a description of the Markov Cluster Algorithm, which uses a matrix that is similar to the one that browsing similarity produces, but that is created in a slightly different way. I’m investigating this further, and I’ll discuss it when I post on Markov chains.] ↩
- If you’d like to read some background material, and you don’t already have a reading list, I propose the following sequence: Matt Jockers, The LDA Buffet Is Now Open (very introductory); Ted Underwood, Topic Modeling Made Just Simple Enough (simple enough but no simpler); Miriam Posner and Andy Wallace, Very Basic Strategies for Interpreting Results from the Topic Modeling Tool (practical approaches for quick bootstrapping); Scott Weingart, Topic Modeling and Network Analysis (introduction to topic networks); Ted Underwood, Visualizing Topic Models (additional theorization of topic visualization). ↩
- Specifically, their links can be interpreted as transition probabilities in an irreducible, aperiodic Markov chain. That’s true of many networks, strictly speaking. But in this case, the probabilities are not derived from the network itself, but from the definition of the browsing process. ↩
- I have a ton of stuff to say about this corpus in the future. It’s part of a collaborative project that Mae Capozzi and I have been working on. ↩
- Because the force-directed layout strategy is purely heuristic, the layout itself is less important than the way the nodes are interconnected. But the visual intuition that the force-directed layout provides is still helpful. I used the WPD3 WordPress plugin to embed these. It’s a little finicky, so please let me know if something has gone wrong. ↩
- Underwood’s original method kept the first link, the second link if it was stronger than 0.2, and the third link if it was stronger than 0.38. This variation takes a base threshold , which is multiplied by the rank of a given link to determine the threshold it must meet. So if the th strongest link from a node is stronger than , then it stays. ↩
- Because some links have been cut, the diagram doesn’t represent a full set of probabilities. It only represents the strongest links — that is, the topic transitions that the network is most biased towards. But the base network retains all that information, and standard network measurements apply to it in ways that have concrete meanings. ↩
- I had to do some odd transpositions to ensure that the R function generates the same output as the Python function. I’m not sure I used the best method — the additional transpositions make the ideas behind the R code seem less obvious to me. (The Python transpositions might seem odd to others — I guess they look normal to me because I’m used to Python.) Please let me know if there’s a more conventional way to manage that calculation. ↩