# Directed Probabilistic Topic Networks

Suppose you’re standing in front of your bookcase, feeling a little bored. You pick up a book at random, and read a few pages on a random topic. It piques your curiosity, so you put the first book away and pick another one that has something to say about the same topic. You read a few pages from it and notice a second topic that interests you. So you pick up a third book on that topic, and that book draws your attention to yet another topic. And you continue moving from book to topic to book to topic — forever.

Wouldn’t be interesting if we could describe that process mathematically?

For the for the last few months I’ve been thinking about the best way to create useful networks with topic models. People have been creating network visualizations of topic models for a long time now, but they sometimes feel a bit like window dressing.1 The problem is that we don’t know what these networks actually represent. The topics are just blobs linked together and floating in a mysterious, abstract space. But what if we could create a network with a clear and concrete interpretation in terms of a physical process that we understand? What if we could create a network that represents the process of browsing through the books on a bookshelf?

I have struck upon a formula that I think does just that — it describes the probability of moving from one topic to another while browsing through a corpus. Remarkably, the formula is very similar to the formula for cosine similarity, which is one of the more popular ways of measuring the similarity between topics. But it differs in crucial ways, and it creates a kind of topic network that I haven’t seen before.2 I’d like to hear what others think about it.

I’ve developed two different theoretical arguments that suggest that the networks this formula creates are more useful than the networks that cosine similarity creates. One argument is related to the theory of bimodal networks, and the other is related to the theory of Markov chains. I have several posts queued up that go into the details, but I’ve decided not to post them just yet. Instead, I’m going to let the method speak for itself on practical grounds. I’ll post more once I feel confident that the result is worth the cognitive investment. However, if you’re interested in the fundamental math, I’ve posted a derivation.

For now, I’ll assume that most readers are already familiar with — or else are profoundly indifferent to — a few background ideas about topic modeling, cosine similarity, and topic networks.3 I hope that won’t exclude too many people from the conversation, because my core argument will be mostly visual and practical: I think that visualizations of these networks look better, and I think the idea of a “browsing similarity” between topics sounds useful — do you?

So feel free to skip past the wonkish bits to the diagrams!

In my own experimentation and research, I’ve found that browsing similarity creates topic networks that differ in several ways from those that cosine similarity creates. First, they distribute links more uniformly between nodes. It’s desirable to simplify topic networks by cutting links with a flat threshold, because the result is easy to reason about. When you do that with this new kind of network, most of the nodes stick together in one loose clump with lots of internal clustering. Second, they invite a probabilistic interpretation with some interesting and well understood theoretical properties.4 Those properties ensure that even some of the more abstract network-theoretical measures, like eigenvalue centrality, have concrete interpretations. And third, they are directed — which says some important things about the relationships between topics in a corpus.

Below are three network diagrams based on a topic model of two thousand eighteenth-century books published between 1757 and 1795.5 Each has 150 nodes (one for each topic in the model). The strengths of the links between each of the nodes are calculated using either cosine similarity or browsing similarity between topic vectors. Each vector is a sequence of book proportions for a given topic. To usefully visualize a topic model as a network, you must cut some links, and the easiest approach is to apply some kind of threshold. Links stronger than some value stay, and the rest are cut. In all the network diagrams below, I’ve selected threshold values that produce exactly 225 links. For layout, I’ve used D3’s force-directed layout routine, so the diagrams will look a little different each time you reload this page.6

In the first diagram, I’ve used cosine similarity with a simple flat threshold. The result is a hairball with a lot of little islands floating around it:

To deal with this problem, Ted Underwood came up with a really clever link-cutting heuristic that produces much cleaner network diagrams. However, it’s a little ad-hoc; it involves retaining at least one link from every node, and then retaining additional links if they’re strong enough. It’s like a compromise between a flat threshold (take all links stronger than $x$) and a rank-based threshold (take the strongest $n$ links from each node).

In the second diagram, I’ve used cosine similarity again, but applied a variation on Underwood’s heuristic with a tunable base threshold.7

The result is much more coherent, and there’s even a bit of suggestive clustering in places. There are a few isolated archipelagos, but there are no singleton islands, because this method guarantees that each node will link to at least one other node.

Now for the browsing similarity approach. In the third diagram, I’ve used browsing similarity with a simple flat threshold:

Although this diagram has both singleton islands and archipelagos, it’s far more connected than the first, and it has almost as many mainland connections as the second. It also shows a bit more clustering behavior than diagram two does. But what I find most interesting about it is that it represents the concrete browsing process I described above: each of the edges represents a probability that while browsing randomly through the corpus, you will happen upon one topic, given that you are currently reading about another.8 That’s why the edges are directed — you won’t be as likely to move from topic A to topic B as from topic B to topic A. This makes perfect sense: it ought to be harder to move from common topics to rare topics than to move from rare topics to common topics.

Because I wanted to show the shapes of these graphs clearly, I’ve removed the topic labels. But you can also see full-screen versions of the cosine, Underwood, and browsing graphs, complete with topic labels that show more about the kinds of relationships that each of them preserve.

Here’s everything you need to play with the browsing similarity formula. First, the mathematical formula itself:

$\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}$

You can think of $b$ as standing for “book,” and $X$ and $Y$ as two different topics. $x_1$ is the proportion of book 1 that is labeled as being about topic $X$, and so on. The formula is very similar to the forumla for cosine similarity, and the tops are identical. Both calculate the dot product of two topic-book vectors. The difference between them is on the bottom. For browsing similarity, it’s simply the sum of the values in $Y$, but for cosine similarity, it’s the product of the lengths of the two vectors:

$\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}$

Here a bit of jargon is actually helpful: cosine similarity uses the “euclidean norm” of both $X$ and $Y$, while browsing similarity uses only the “manhattan norm” of $Y$, where “norm” is just a ten dollar word for length. Thinking about these as norms helps clarify the relationship between the two formulas: they both do the same thing, but browsing similarity uses a different concept of length, and ignores the length of $X$. These changes turn the output of the formula into a probability.

Next, some tools. I’ve written a script that generates Gephi-compatible or D3-compatible graphs from MALLET output. It can use cosine or browsing similarity, and can apply flat, Underwood-style, or rank-based cutoff thresholds. It’s available at GitHub, and it requires numpy and networkx. To use it, simply run MALLET on your corpus of choice, and pass the output to tmtk.py on the command line like so:

./tmtk.py network --remove-self-loops \
--threshold-value 0.05 \
--threshold-function flat \
--similarity-function browsing \
--output-type gexf \
--write-network-file browsing_sim_flat \
--topic-metadata topic_names.csv \
mallet_output.composition


It should be possible to cut and paste the above command into any bash terminal — including Terminal in OS X under default settings. If you have any difficulties, though, let me know! It may require some massaging to work with Windows. The command should generate a file that can be directly opened by Gephi. I hope the option names are obvious enough; more detailed information about options is available via the --help option.

In case you’d prefer to work this formula into your own code, here is a simplified version of the browsing_similarity function that appears in the above Python script. Here, A is a matrix of topic row vectors. The code here is vectorized to calculate every possible topic combination at once and put them all into a new matrix. You can therefore interpret the output as the weighted adjacency matrix of a fully-connected topic network.

def browsing_similarity(A):
A = numpy.asarray(A)
norm = A.T.sum(axis=0)
return numpy.dot(A, A.T) / norm


And here’s the same thing in R9:

browsingsim norm = rowSums(A)
dot = A %*% t(A)
return(t(dot / norm))
}


Matrix calculations like this are a dream when the shapes are right, and a nightmare when they’re wrong. So to be ridiculously explicit, the matrix A should have number_of_topics rows and number_of_books columns.

I have lots more to say about bimodal networks, conditional probability, Markov chains, and — at my most speculative — about the questions we ought to ask as we adapt more sophisticated mathematical techniques for use in the digital humanities.

But until then, comments are open!

1. Ted Underwood has written that “it’s probably best to view network visualization as a convenience,” and there seems to be an implicit consensus that topic networks are more visually stunning than useful. My hope is that by creating networks with more concrete interpretations, we can use them to produce evidence that supports interesting arguments. There are sure to be many details to work through and test before that’s possible, but I think it’s a research program worth developing further.
2. I’ve never seen anything quite like this formula or the networks it produces. But I’d love to hear about other work that people have done along these lines — it would make the theoretical burden much lighter. Please let me know if there’s something I’ve missed. See also a few near-misses in the first footnote to my post on the formula’s derivation. [Update: I found a description of the Markov Cluster Algorithm, which uses a matrix that is similar to the one that browsing similarity produces, but that is created in a slightly different way. I’m investigating this further, and I’ll discuss it when I post on Markov chains.]
3. If you’d like to read some background material, and you don’t already have a reading list, I propose the following sequence: Matt Jockers, The LDA Buffet Is Now Open (very introductory); Ted Underwood, Topic Modeling Made Just Simple Enough (simple enough but no simpler); Miriam Posner and Andy Wallace, Very Basic Strategies for Interpreting Results from the Topic Modeling Tool (practical approaches for quick bootstrapping); Scott Weingart, Topic Modeling and Network Analysis (introduction to topic networks); Ted Underwood, Visualizing Topic Models (additional theorization of topic visualization).
4. Specifically, their links can be interpreted as transition probabilities in an irreducible, aperiodic Markov chain. That’s true of many networks, strictly speaking. But in this case, the probabilities are not derived from the network itself, but from the definition of the browsing process.
5. I have a ton of stuff to say about this corpus in the future. It’s part of a collaborative project that Mae Capozzi and I have been working on.
6. Because the force-directed layout strategy is purely heuristic, the layout itself is less important than the way the nodes are interconnected. But the visual intuition that the force-directed layout provides is still helpful. I used the WPD3 WordPress plugin to embed these. It’s a little finicky, so please let me know if something has gone wrong.
7. Underwood’s original method kept the first link, the second link if it was stronger than 0.2, and the third link if it was stronger than 0.38. This variation takes a base threshold $t$, which is multiplied by the rank of a given link to determine the threshold it must meet. So if the $n$th strongest link from a node is stronger than $t * (n - 1)$, then it stays.
8. Because some links have been cut, the diagram doesn’t represent a full set of probabilities. It only represents the strongest links — that is, the topic transitions that the network is most biased towards. But the base network retains all that information, and standard network measurements apply to it in ways that have concrete meanings.
9. I had to do some odd transpositions to ensure that the R function generates the same output as the Python function. I’m not sure I used the best method — the additional transpositions make the ideas behind the R code seem less obvious to me. (The Python transpositions might seem odd to others — I guess they look normal to me because I’m used to Python.) Please let me know if there’s a more conventional way to manage that calculation.

# The Markov Chains of La Grande Jatte A Short Introduction to Gibbs Sampling

Topic modeling has been attracting the attention of scholars in the digital humanities for several years now, and quite a few substantive introductions to the subject have been written. Ben Schmidt offered a brief overview of the genre in 2012, and the list he provided is still fairly comprehensive, as far as I can tell.1 My current favorite is an entry from Miriam Posner and Andy Wallace that emphasizes the practical side of topic modeling — it’s great for bootstrapping if you’re new to the subject.

This post will cover something slightly different. When I started to delve into the details of topic modeling, I quickly realized that I needed to create my own implementation of Latent Dirichlet Allocation (LDA) to begin understanding how it worked. I eventually did, but even with all the terrific resources available, I ran into several significant roadblocks.2 The biggest one for me was figuring out Gibbs sampling. A lot of introductions to topic modeling don’t spend much time on Gibbs sampling, for understandable reasons. It’s not part of LDA properly speaking, so you don’t need to understand how it works to understand the fundamentals of LDA. In fact, in his original description of LDA, David Blei didn’t even talk about Gibbs sampling — he used a thing called “variational inference,” which is a wall of abstraction that I still haven’t managed to scale.

Fortunately, Gibbs sampling yielded to my efforts more readily. And although it’s not strictly necessary to understand Gibbs sampling to understand LDA, I think it’s worth understanding for other reasons. In fact, I’ve come to believe that Gibbs sampling is a wonderful introduction to the rapidly evolving world of machine learning — a world that I think at least a subset of digital humanists should have much broader knowledge of.

### What is Gibbs sampling?

Here’s my attempt at a definition: Gibbs sampling is a way to build a picture of a global probability distribution when you only have local information about that distribution. That’s more of a description than a definition; other techniques do that too. But I like it because it shows what Gibbs sampling is good at. You can use it to take lots of little bits of information — like individual word counts — and construct a global view of those bits.

Suppose that you are temporarily Georges Seurat, but you couldn’t make it to the Island of La Grande Jatte today. Instead of seeing it for yourself or looking at someone else’s picture, you decide to consult with Sam, your omniscient imaginary friend. Sam supplies you with some probabilities like so:

Given that you have just put a green dot here (Sam points at a spot on the canvas):

• The probability is $\mu$ that your next dot will be an orange dot there.
• The probability is $\eta$ that your next dot will be a blue dot over there.
• The probability is … [more tiny numbers]

This list goes on until every possible location on the canvas and every possible color has a probability associated with it. It turns out they all add up to one. (They’re probabilities, after all!) Then Sam gives you another list that starts with another location and possible color. You get lists from every possible point and color on the canvas, to every possible point and color on the canvas. Now, at any moment while painting, you can look up the dot you’ve just painted in the table. You can then use that dot’s transition probabilities to decide how to paint the next one.

So you just start painting dots. And lo and behold, after a really long time, you’re looking at a picture of La Grande Jatte.

What I’ve just described is called a Markov chain.3 Gibbs sampling adds just one more little twist. But before I get to that, I want to explain why this is possible. Sam’s table of probabilities has to meet three conditions for this to work. The first two dictate the kinds of movements between points and colors that the table of probabilities must allow. First, the table of probabilities must allow you to get from any point and color in the painting to any other. It doesn’t have to allow you to get from one to the other in a single step, but it has to allow you to get there eventually. And second, the table of probabilities must allow you to get from one point and color to another at irregular intervals. So if it aways takes you two, or four, or eight steps to get from node A to node B, and never any other number of steps, then the table doesn’t satisfy this condition, because the number of steps required to get from A to B is always a multiple of two.

Together, these conditions tell us that the Markov chain has what’s called a stationary distribution.4 It’s a probability distribution over every point and possible color on the canvas. It tells you how often you will paint a particular dot, on average, if you keep painting forever. If Sam’s table meets these first two conditions, then we can prove that it has a stationary distribution, and we can even prove that its stationary distribution is unique. At that point, it only has to meet one more condition: its stationary distribution must be a painting of La Grande Jatte.

What’s neat about this is that none of the individual transition probabilities know much about the painting. It’s only when they get together and “talk” to one another for a while that they start to realize what’s actually going on.5 That’s what Gibbs sampling allows.

### The Catch

The difficult part of using Markov chains this way is figuring out the transition probabilities. How many coordinates and color codes would you need to create an adequate representation of a Seurat painting? I’m not sure, but I bet it’s a number with a lot of zeros at the end. Call it $N$. And to create the full transition table, you’d have to calculate and store probabilities from each of those values to each of those values. That’s a big square table with $N$ rows and columns. These numbers get mind-bogglingly huge for even relatively simple problems.

Gibbs sampling uses a clever trick to get around that issue. It’s based on the simple insight that you don’t have to change every dimension at once. Instead of jumping directly from one point and color to another — from $(x_1, y_1, c_1)$ to $(x_2, y_2, c_2)$ — you can move along one dimension at a time, jumping from $(x_1, y_1, c_1)$ to $(x_2, y_1, c_1)$ to $(x_2, y_2, c_1)$ to $(x_2, y_2, c_2)$, and so on. It turns out that calculating probabilities for those transitions is often much easier and faster — and the stationary distribution stays the same.

In effect, this means that although you might not be able to calculate all the transition probabilities in the table, you can calculate all the relevant translation probabilities pretty easily. This makes almost no practical difference to you as you paint La Grande Jatte. It just means you do three lookups instead of one before painting the next dot. (It also might mean you don’t paint the dot every time, but only every fifth or tenth time, so that your dots aren’t too tightly correlated with one another, and come closer to being genuinely independent samples from the stationary distribution.)

In the context of the LDA model, this means that you don’t have to leap from one set of hypothetical topic labels to an entirely different one. That makes a huge difference now, because instead of working with a canvas, we’re working with a giant topic hypercube with a dimension for every single word in the corpus. Given that every word is labeled provisionally with a topic, we can just change each topic label individually, over and over, using transition probabilities from this formula that some really smart people have helpfully derived for us. And every time we save a set of topic labels, we’ve painted a single dot on the canvas of La Grande Jatte.6

### So What?

I began this post with a promise that you’d get something valuable out of this explanation of Gibbs sampling, even though it isn’t part of the core of LDA. I’m going to offer three brief payoffs now, which I hope to expand in later posts.

First, most implementations of LDA use Gibbs sampling, and at least some of the difficulties that LDA appears to have — including some identified by Ben Schmidt — are probably more related to issues with Gibbs sampling than with LDA. Think back to the requirement that to have a stationary distribution, a Markov chain has to be able to reach every possible state from every other possible state. That’s strictly true in LDA, because the LDA model assumes that every word has a nonzero probability of appearing in every topic, and every topic has a nonzero probability of appearing in every document. But in some cases, those probabilities are extremely small. This is particularly true for word distributions in topics, which tend to be very sparse. That suggests that although the Markov chain has a stationary distribution, it may be hard to approximate quickly, because it will take a very long time for the chain to move from one set of states to another. For all we know, it could take only hours to reach a result that looks plausible, but years to reach a result that’s close to the actual stationary distribution. Returning to the Grand Jatte example, this would be a bit like getting a really clear picture of the trees in the upper-right-hand corner of the canvas and concluding that the rest must be a picture of a forest. The oddly conjoined and split topics that Schmidt and others have identified in their models seem a little less mysterious once you understand the quirks of Gibbs sampling.

Second, Gibbs sampling could be very useful for solving other kinds of problems. For some time now, I’ve had an eccentric obsession with encoding text into prime numbers and back into text again. The source of this obsession has to do with copyright law and some of the strange loopholes that the idea-expression dichotomy creates.7 I’m going to leave that somewhat mysterious for now, and jump to the point: part of my obsession has involved trying to figure out how to automatically break simple substitution cyphers. I’ve found that Gibbs sampling is surprisingly good at it. This is, I’ll admit, a somewhat peripheral concern. But I can’t get rid of the sense that there are other interesting things that Gibbs sampling could do that are more directly relevant to digital humanists. It’s a surprisingly powerful and flexible technique, and I think its power comes from that ability to take little bits of fragmentary information and assemble them into a gestalt.8

Third, I think Gibbs sampling is — or should be — theoretically interesting for humanists of all stripes. The theoretical vistas opened up by LDA are fairly narrow because there’s something a little bit single-purpose about it. Although it’s remarkably flexible in some ways, it makes strong assumptions about the structure of the data that it analyzes. Those assumptions limit its possible uses as a model for more speculative thinking. Gibbs sampling makes fewer such assumptions; or to be more precise, it accommodates a wider range of possible assumptions. MALLET is a tool for pounding, and it does a great job at it. But Gibbs sampling is more like the handle of a bit-driver. It’s only half-complete — assembly is required to get it to do something interesting — but it’s the foundation of a million different possible tools.

It’s the kind of tool a bricoleur ought to own.

1. If you know of new or notable entries that are missing, let me know and I’ll add them to a list here.
2. You can take a look here. Caveat emptor! I called it ldazy for a reason — it stands for “LDA implementation by someone who is too lazy” to make further improvements. It’s poorly-commented, inefficient, and bad at estimating hyperparameters. (At least it tries!) Its only strength is that it is short and written in pure Python, which means that its code is somewhat legible without additional comment.
3. After writing this, I did some Googling to see if anybody else had thought about Markov chains in terms of pointillism. I didn’t find anything that takes quite the same approach, but I did find an article describing a way to use Markov chains to model brushstrokes for the purpose of attribution!
4. In case you want to talk to math people about this, these conditions are respectively called “irreducibility” and “aperiodicity.”
5. Sorry, I couldn’t resist.
6. I’m risking just a bit of confusion by extending this analogy so far, because it’s tempting to liken colors to topics. But that’s not quite right. To perfect this analogy, expand the canvas into a three-dimensional space in which all green dots occupy one plane, all orange dots occupy another, and so on. In this scheme, the dots are only present or absent — they are themselves “colorless,” and only take on a color insofar as one of the dimensions is interpreted as a color dimension. And suppose the $x$, $y$, and $c$ variables can take values between 1 and 50. Now each dimension could just as easily represent a single word in a three-word corpus, and each dot in this three-dimensional space could represent a sequence of topic assignments for a fifty-topic model — with a value between 1 and 50 for each word in the corpus.
7. “In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work” 17 USC 102
8. For another way of thinking about the possibilities of Gibbs sampling and other so-called Monte Carlo Markov chain (MCMC) methods, see the wonderful sub-subroutine post on using MCMC to learn about bread prices during the napoleonic wars.

# A Sentimental Derivative

Ben Schmidt’s terrific insight into the assumptions that the Fourier transform imposes on sentiment data has been sinking in, and I have a left-field suggestion for anyone who cares to check it out. I plan to investigate it myself when I have the time, but I’ve decided to broadcast it now.

In the imaginary universe of Fourier land, all texts start and end at the same sentiment amplitude. This is clearly incorrect, as I see it.1 But what could we say about the beginning and end of texts that might hold up?

One possibility is that all texts might start and end with a flat sentiment curve. That is, at the very beginning and end of a text, we can assume that the valence of words won’t shift dramatically. That’s not clearly incorrect. I think it’s even plausible.

Now consider how we talk about plot most of the time: we speak of rising action (slope positive), falling action (slope negative), and climaxes (local and global maxima). That’s first derivative talk! And the first derivative of a flat curve is always zero. So if the first derivative of a sentiment curve always starts and ends at zero, then at least one objection to the Fourier transform approach can be worked around. For example, we could simply take the first finite difference of a text’s sentiment time series, perform a DFT and low-pass filter, do a reverse transform, and then do a cumulative sum (i.e. a discrete integration) of the result.2

What would that look like?

1. Nonetheless, I think there’s some value to remaining agnostic about this for some time still — even now, after the dust has settled a bit.
2. You might be able to skip a step or two.

# What’s a Sine Wave of Sentiment?

Over the last month a fascinating series of debates has unfolded over Matt Jockers’ Syuzhet package. The debates have focused on whether Syuzhet’s smoothing strategy, which involves using a Fourier transform and low-pass filter, is appropriate. Annie Swafford has produced several compelling arguments that this strategy doesn’t work. And Ted Underwood has responded with what is probably the most accurate assessment of our current state of knowledge: we don’t know enough yet to say anything at all.

I have something to add to these debates, but I’ll begin by admitting that I haven’t used Syuzhet. I’m only just now starting to learn R. (I’ve been using Python / Numpy / Scipy / Pandas in my DH work so far.) My objection is not based on data or validation, statistical or otherwise. It’s based on a more theoretical line of reasoning.1

I broadly agree with Annie Swafford’s assessment: it looks to me like this strategy is producing unwanted ringing artifacts.2 But Matt Jockers’ counterargument troubles her line of reasoning — he suggests that ringing artifacts are what we want. That doesn’t sound right to me, but that argumentative move shows what’s really at stake here. The question is not whether ringing artifacts distort the data relative to some ground truth. There’s no doubt about that — this is, after all, a way of distorting rough data to make it smooth. The question is whether we want this particular kind of distortion. My issue with using Fourier transforms to represent sentiment time series data is that we have no clear theoretical justification to do so. We have no theoretical reason to want the kind of distortion it produces.

If we hope to use data mining tools to produce evidence, we need to think about ways to model data that are suited to our own fields. This is a point Ted Underwood made early on in the conversation about LDA, well before much had been published by literary scholars on the subject. The point he made is as important now as then: we should do our best to ensure that the mathematical models we use have clear and concrete interpretations in terms of the physical processes that we study and seek to understand: reading, writing, textual distribution, influence, and so on. This is what Syuzhet fails to do at the smoothing and filtering stage right now. I think the overall program of Syuzhet is a promising one (though there may be other important aspects of the thing-that-is-not-fabula that it ignores). But I think the choice of Fourier analysis for smoothing is not a good choice — not yet.

A Fourier transform models time series data as a weighted sum of sine waves of different frequencies. But I can think of no compelling reason to represent a sequence of sentiment measurements as a sum of sine waves. Consider LDA as a point of comparison (as Jockers has). There’s a clear line of reasoning that supports our using the Dirichlet distribution as a prior. One could certainly argue that the Dirichlet density has the wrong shape, but its support — the set of values over which it is defined — has the right shape.3 It’s a set of N distinct real-valued variables that always sum to one. (In other words, it’s a distribution over the ways to break a stick into N parts.) Since we have good reasons to think about language as being made of distinct words, thinking in terms of categorical probability distributions over those words makes sense. And the Dirichlet distribution is a reasonable prior for categorical distributions because its support consists entirely of categorical probability distributions (ways to break a stick). Even if we were to decide that we need a different prior distribution, we would probably still choose a distribution defined over the same support.4

But the support of the function produced by a Fourier transform is the frequency domain of a sinusoidal curve. Is that the right support for this purpose? Setting aside the fact that we’re no longer talking about a probability distribution, I think it’s still important to ask that question. If we could have confidence that it makes sense to represent a sentiment time series as a sum of sinusoidal curves, then we might be able to get somewhere with this approach. The support would be correct, even if the shape of the curve over the frequency domain weren’t quite right. But why should we accept that? Why shouldn’t we be looking at functions over domains of wavelets or chirplets or any number of other possibilities? Why should the sentimental valence of the words in a novel be best represented by sine waves in particular?

I think this is a bit like using a Gaussian mixture model (GMM) to do topic modeling. You can use Gaussian distributions as priors for topic models. It might even be a good idea to do so, because it could allow us to get good results faster. But it’s not going to help us understand how topic modeling works in the first place. The Gaussian prior obscures what’s really going on under the hood.5 Even if we all moved over to Gaussian priors in our topic models, we’d probably still use classic LDA to get a handle on the algorithm. In this case, I think the GMM is best understood as a way to approximate LDA.

And now, notice that we can use a Fourier transform to approximate any function at all. But what does doing so tell us about the function? Does it tell us what we want to know? I have no idea in this case, and I don’t think anyone else does either. It’s anyone’s guess whether the sine waves that this transform uses will correspond to anything at all familiar to us.

I think this is a crucial issue, and it’s one we can frame in terms of disciplinary continuity. Whenever we do any kind of informal reasoning based on word counts alone, we’re essentially thinking in terms of categorical distributions. And I think literary scholars would have paid attention to a well-reasoned argument based on word counts thirty years ago. If that’s right, then LDA simply gives us a way to accelerate modes of reasoning about language that we already understand. But if thirty years ago someone had argued that the movement of sentiment in a novel should be understood through sinusoidal shapes, I don’t think they would have been taken very seriously.

Admittedly, I have no strong justification for this claim, and if there’s widespread disagreement about it, then this debate will probably continue for some time. But either way, we need to start thinking very concretely about what it means to represent sentiment specifically as a sine wave. We will then be able to trust our intuitions about our own field of study to guide us.

1. This means that to a certain degree, I’m not taking Syuzhet in the spirit with which it was offered. Jockers writes that his “primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories.” I’ve not done that, even though I think it’s a sound method. We can’t depend only on statistical measurements; our conclusions need intuitive support. But I also think the theoretical questions I’m asking here are necessary to build that kind of support.
2. I suspected as much the moment I read about the package, though I’m certain I couldn’t have articulated my concerns without Swafford’s help. Update: And I hope it’s clear to everyone, now that the dust has settled, that Swafford has principal investigator status in this case. If she hadn’t started it, the conversation probably wouldn’t have happened at all.
3. The support of a function is the set of inputs that it maps to nonzero outputs.
4. The logic of this argument is closely related to the theory of types in computer programming. One could say that a categorical sampling algorithm accepts variables of the “broken stick” type and samples from them; and one could say that when we sample from a Dirichlet distribution, the output is a variable of the “broken stick” type.
5. The truth of this is strongly suggested to me by the fact that the above cited paper on GMM-based topic modeling initially proposes a model based on “cut points” — a move I will admit that I understand only in vague terms as a way of getting discrete output from a continuous function. That paper looks to me like an attempt to use a vector space model for topic modeling. But as I’ll discuss in a later post, I don’t find vector space models of language especially compelling because I can’t develop a concrete interpretation of them in terms of authors, texts, and readers.

# A Random Entry

There’s a way of telling a history of the digital humanities that does not follow the well known trajectory from Father Busa’s Index Thomisticus, Mosteller and Wallace’s study of the Federalist Papers, and the Text Encoding Initiative — to Distant Reading, data mining, and the present day. It does not describe the slow transformation of a once-peripheral field into an increasingly mainstream one. Instead, it describes a series of missed opportunities.

It’s a polemical history that inverts many unspoken assumptions about the relationship between the humanities and the sciences. I’m not sure I entirely believe it myself. But I think it’s worth telling.

It starts like this: there once was a guy named Frank Rosenblatt. In 1957, Rosenblatt created the design for a device he called the perceptron. It was an early attempt at simulating the behavior of networks of biological neurons, and it initially provoked a frenzy of interest, including the following New York Times report:

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

Needless to say, the perceptron never managed to do any of those things. But what it did do was exceptional in its own way. It used to great practical effect the following insight: that many small, inaccurate rules can be combined in simple ways to create a larger, more accurate rule. This insight is now central to statistical learning theory.1 But it was not understood as particularly important at the time.

In fact, when people began to realize that the perceptron in its simplest form was limited, a backlash ensued. Marvin Minsky and Seymour Papert wrote a book called Perceptrons that enumerated the limits of simple two-layer perceptrons2; people misunderstood the book’s arguments as applying to all neural networks; and the early promise of perceptrons was forgotten.

This turn of events may have delayed the emergence of interesting machine learning technologies by a couple of decades. After the perceptron backlash, artificial intelligence researchers focused on using logic to model thought — creating ever more complex sets of logical rules that could be combined to generate new rules within a unified and coherent system of concepts. This approach was closely related to the kinds of transformational grammars that Noam Chomsky has been exploring since the 1950s, and it largely displaced statistical approaches — with a few exceptions — until the 1990s.

Unsurprisingly, Chomsky remains hostile to statistical and probabilistic approaches to machine learning and artificial intelligence. Nonetheless, there does seem to be some evidence that those approaches have gotten something right. Peter Norvig offers the following summary:

Chomsky said words to the effect that statistical language models have had some limited success in some application areas. Let’s look at computer systems that deal with language, and at the notion of “success” defined by “making accurate predictions about the world.” First, the major application areas:

• Search engines: 100% of major players are trained and probabilistic. Their operation cannot be described by a simple function.
• Speech recognition: 100% of major systems are trained and probabilistic…
• Machine translation: 100% of top competitors in competitions such as NIST use statistical methods…
• Question answering: this application is less well-developed, and many systems build heavily on the statistical and probabilistic approach used by search engines…

Now let’s look at some components that are of interest only to the computational linguist, not to the end user:

• Word sense disambiguation: 100% of top competitors at the SemEval-2 competition used statistical techniques; most are probabilistic…
• Coreference resolution: The majority of current systems are statistical…
• Part of speech tagging: Most current systems are statistical…
• Parsing: There are many parsing systems, using multiple approaches. Almost all of the most successful are statistical, and the majority are probabilistic…

Clearly, it is inaccurate to say that statistical models (and probabilistic models) have achieved limited success; rather they have achieved a dominant (although not exclusive) position.

In the past fifteen years, these approaches to machine learning have produced a number of substantial leaps forward — consider Google’s famous creation of a neural network that (in at least some sense) reinvented the concept of “cat,” or this recurrent neural network capable of imitating various styles of human handwriting. These extraordinary successes have been made possible by a dramatic increase in computing power. But without an equally dramatic shift in ways of thinking about what constitutes knowledge, that increase in computing power would have accomplished far less. What has changed is that the people doing the math have stopped trying to find logical models of knowledge by hand, and have started trying to find probabilistic models of knowledge — models that embrace heterogeneity, invite contradiction, and tolerate or even seek out ambiguity and uncertainty. As machine learning researchers have discovered, the forms these models take can be defined with mathematical precision, but the models themselves tolerate inconsistencies in ways that appear to be unbound by rigid logic.3

I’d like to suggest that by embracing that kind of knowledge, computer scientists have started walking down a trail that humanists were blazing fifty years ago.

The kind of knowledge that these machines have does not take the form of a rich, highly structured network of immutable concepts and relations with precise and predictable definitions. It takes the form of a loose assembly of inconsistent and mutually incompatible half-truths, always open to revision and transformation, and definable only by the particular distinctions it can make or elide at any given moment. It’s the kind of knowledge that many literary scholars and humanists have found quite interesting for the last few decades.

Since the decline of structuralism, humanists have been driven by a conviction that the loosely or multiply structured behaviors that constitute human culture produce important knowledge that cannot be produced in more structured ways. Those humanities scholars who remained interested in structured ways of producing knowledge — like many of the early practitioners of humanities computing — were often excluded from conversations in the humanistic mainstream.

Now something has changed. The change has certainly brought computational methods closer to the mainstream of the humanities. But we mustn’t mistake the change by imagining that humanists have somehow adopted a new scientism. A better explanation of this change is that computer scientists, as they have learned to embrace the kinds of knowledge produced by randomness, have reached a belated understanding of the value of — dare I say it? — post-structuralist ways of knowing.

It’s a shame it didn’t happen earlier.

1. I first encountered the above formulation of this idea in the first video in Geoffrey Hinton’s online course on neural networks. But you can see it being used by other researchers (p. 21) working in machine learning on a regular basis.
2. In machine learning lingo, it could not learn nonlinear decision boundaries. It didn’t even have the ability to calculate the logical XOR operation on two inputs, which at the time probably made logic-based approaches look far more promising.
3. I say “appear” because it’s not entirely clear what it would mean to be unbound by rigid logic. The mathematical formulation of machine learning models is itself perfectly strict and internally consistent, and if it weren’t, it would be irreparably broken. Why don’t the statistical models represented by that formulation break in the same way? I suspect that it has something to do with the curse blessing of dimensionality. They don’t break because every time a contradiction appears, a new dimension appears to accommodate it in an ad-hoc way — at least until the model’s “capacity” for such adjustments is exhausted. I’m afraid I’m venturing a bit beyond my existential pay grade with these questions — but I hope this sliver of uncertainty doesn’t pierce to the core of my argument.