A Sentimental Derivative

Ben Schmidt’s terrific insight into the assumptions that the Fourier transform imposes on sentiment data has been sinking in, and I have a left-field suggestion for anyone who cares to check it out. I plan to investigate it myself when I have the time, but I’ve decided to broadcast it now.

In the imaginary universe of Fourier land, all texts start and end at the same sentiment amplitude. This is clearly incorrect, as I see it.1 But what could we say about the beginning and end of texts that might hold up?

One possibility is that all texts might start and end with a flat sentiment curve. That is, at the very beginning and end of a text, we can assume that the valence of words won’t shift dramatically. That’s not clearly incorrect. I think it’s even plausible.

Now consider how we talk about plot most of the time: we speak of rising action (slope positive), falling action (slope negative), and climaxes (local and global maxima). That’s first derivative talk! And the first derivative of a flat curve is always zero. So if the first derivative of a sentiment curve always starts and ends at zero, then at least one objection to the Fourier transform approach can be worked around. For example, we could simply take the first finite difference of a text’s sentiment time series, perform a DFT and low-pass filter, do a reverse transform, and then do a cumulative sum (i.e. a discrete integration) of the result.2

What would that look like?

  1. Nonetheless, I think there’s some value to remaining agnostic about this for some time still — even now, after the dust has settled a bit. 
  2. You might be able to skip a step or two. 

What’s a Sine Wave of Sentiment?

Over the last month a fascinating series of debates has unfolded over Matt Jockers’ Syuzhet package. The debates have focused on whether Syuzhet’s smoothing strategy, which involves using a Fourier transform and low-pass filter, is appropriate. Annie Swafford has produced several compelling arguments that this strategy doesn’t work. And Ted Underwood has responded with what is probably the most accurate assessment of our current state of knowledge: we don’t know enough yet to say anything at all.

I have something to add to these debates, but I’ll begin by admitting that I haven’t used Syuzhet. I’m only just now starting to learn R. (I’ve been using Python / Numpy / Scipy / Pandas in my DH work so far.) My objection is not based on data or validation, statistical or otherwise. It’s based on a more theoretical line of reasoning.1

I broadly agree with Annie Swafford’s assessment: it looks to me like this strategy is producing unwanted ringing artifacts.2 But Matt Jockers’ counterargument troubles her line of reasoning — he suggests that ringing artifacts are what we want. That doesn’t sound right to me, but that argumentative move shows what’s really at stake here. The question is not whether ringing artifacts distort the data relative to some ground truth. There’s no doubt about that — this is, after all, a way of distorting rough data to make it smooth. The question is whether we want this particular kind of distortion. My issue with using Fourier transforms to represent sentiment time series data is that we have no clear theoretical justification to do so. We have no theoretical reason to want the kind of distortion it produces.

If we hope to use data mining tools to produce evidence, we need to think about ways to model data that are suited to our own fields. This is a point Ted Underwood made early on in the conversation about LDA, well before much had been published by literary scholars on the subject. The point he made is as important now as then: we should do our best to ensure that the mathematical models we use have clear and concrete interpretations in terms of the physical processes that we study and seek to understand: reading, writing, textual distribution, influence, and so on. This is what Syuzhet fails to do at the smoothing and filtering stage right now. I think the overall program of Syuzhet is a promising one (though there may be other important aspects of the thing-that-is-not-fabula that it ignores). But I think the choice of Fourier analysis for smoothing is not a good choice — not yet.

The Dirichlet Distribution
The Dirichlet distribution for three categories is defined over values of X, Y, and Z adding up to 1

A Fourier transform models time series data as a weighted sum of sine waves of different frequencies. But I can think of no compelling reason to represent a sequence of sentiment measurements as a sum of sine waves. Consider LDA as a point of comparison (as Jockers has). There’s a clear line of reasoning that supports our using the Dirichlet distribution as a prior. One could certainly argue that the Dirichlet density has the wrong shape, but its support — the set of values over which it is defined — has the right shape.3 It’s a set of N distinct real-valued variables that always sum to one. (In other words, it’s a distribution over the ways to break a stick into N parts.) Since we have good reasons to think about language as being made of distinct words, thinking in terms of categorical probability distributions over those words makes sense. And the Dirichlet distribution is a reasonable prior for categorical distributions because its support consists entirely of categorical probability distributions (ways to break a stick). Even if we were to decide that we need a different prior distribution, we would probably still choose a distribution defined over the same support.4

Wavelets and Chirplets
Wavelets and Chirplets — via Wikimedia Commons

But the support of the function produced by a Fourier transform is the frequency domain of a sinusoidal curve. Is that the right support for this purpose? Setting aside the fact that we’re no longer talking about a probability distribution, I think it’s still important to ask that question. If we could have confidence that it makes sense to represent a sentiment time series as a sum of sinusoidal curves, then we might be able to get somewhere with this approach. The support would be correct, even if the shape of the curve over the frequency domain weren’t quite right. But why should we accept that? Why shouldn’t we be looking at functions over domains of wavelets or chirplets or any number of other possibilities? Why should the sentimental valence of the words in a novel be best represented by sine waves in particular?

I think this is a bit like using a Gaussian mixture model (GMM) to do topic modeling. You can use Gaussian distributions as priors for topic models. It might even be a good idea to do so, because it could allow us to get good results faster. But it’s not going to help us understand how topic modeling works in the first place. The Gaussian prior obscures what’s really going on under the hood.5 Even if we all moved over to Gaussian priors in our topic models, we’d probably still use classic LDA to get a handle on the algorithm. In this case, I think the GMM is best understood as a way to approximate LDA.

Chirplets are good at modeling objects in perspective
Chirplets are good at modeling objects in perspective — via Wikimedia Commons

And now, notice that we can use a Fourier transform to approximate any function at all. But what does doing so tell us about the function? Does it tell us what we want to know? I have no idea in this case, and I don’t think anyone else does either. It’s anyone’s guess whether the sine waves that this transform uses will correspond to anything at all familiar to us.

I think this is a crucial issue, and it’s one we can frame in terms of disciplinary continuity. Whenever we do any kind of informal reasoning based on word counts alone, we’re essentially thinking in terms of categorical distributions. And I think literary scholars would have paid attention to a well-reasoned argument based on word counts thirty years ago. If that’s right, then LDA simply gives us a way to accelerate modes of reasoning about language that we already understand. But if thirty years ago someone had argued that the movement of sentiment in a novel should be understood through sinusoidal shapes, I don’t think they would have been taken very seriously.

Admittedly, I have no strong justification for this claim, and if there’s widespread disagreement about it, then this debate will probably continue for some time. But either way, we need to start thinking very concretely about what it means to represent sentiment specifically as a sine wave. We will then be able to trust our intuitions about our own field of study to guide us.

  1. This means that to a certain degree, I’m not taking Syuzhet in the spirit with which it was offered. Jockers writes that his “primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories.” I’ve not done that, even though I think it’s a sound method. We can’t depend only on statistical measurements; our conclusions need intuitive support. But I also think the theoretical questions I’m asking here are necessary to build that kind of support. 
  2. I suspected as much the moment I read about the package, though I’m certain I couldn’t have articulated my concerns without Swafford’s help. Update: And I hope it’s clear to everyone, now that the dust has settled, that Swafford has principal investigator status in this case. If she hadn’t started it, the conversation probably wouldn’t have happened at all.
  3. The support of a function is the set of inputs that it maps to nonzero outputs. 
  4. The logic of this argument is closely related to the theory of types in computer programming. One could say that a categorical sampling algorithm accepts variables of the “broken stick” type and samples from them; and one could say that when we sample from a Dirichlet distribution, the output is a variable of the “broken stick” type. 
  5. The truth of this is strongly suggested to me by the fact that the above cited paper on GMM-based topic modeling initially proposes a model based on “cut points” — a move I will admit that I understand only in vague terms as a way of getting discrete output from a continuous function. That paper looks to me like an attempt to use a vector space model for topic modeling. But as I’ll discuss in a later post, I don’t find vector space models of language especially compelling because I can’t develop a concrete interpretation of them in terms of authors, texts, and readers.