Theory, of both the literary and mathematical varieties.

Brownian Noise and Plot Arcs

A plot of Brownian noise.
A tenth of a second of Brownian noise in PCM format. Large samples of Brownian noise give results similar to those reported by Jockers and Reagan et. al.

A couple of months ago, a research group released a paper on the ArXiv titled “The emotional arcs of stories are dominated by six basic shapes.” In it, they replicate results similar to those first described by Matt Jockers, using a somewhat different technique.

I’ve written a jupyter notebook that raises doubts about their argument. They claim that their work has shown that there are “basic shapes” that dominate human stories, but the results they’ve described provide no basis for such generalizations. Given what we know so far, it’s much more likely that the emotional arcs that these techniques reveal are, in general, noise. The notebook is available for examination and reuse as a github repository.

What does it mean to say that these plot shapes are “noise”? The notebook linked above focuses on technical issues, but I want to write a few brief words here about the broader implications my argument has — if it’s correct. Initially, it may seem that if sentiment data is “noise,” then these measurements must be entirely meaningless. And yet many researchers have now done at least some preliminary work validating these measurements against responses from human readers. Jockers’ most recent work shows fairly strong correlations between human sentiment assessments and those produced by his Syuzhet package. If these sentiment measurements are meaningless, does that mean that human assessments are meaningless as well?

That conclusion does not sit well with me, and I think it is based on an incorrect understanding of the relationship between noise and meaning. In fact, according to one point of view, the most “meaningful” data is precisely the most random data, since maximally random data is in some sense maximally dense with information — provided one can extract the information in a coherent way. Should we find that sentiment data from novels does indeed amount to “mere noise,” literary critics will have some very difficult questions to ask themselves about the conditions under which noise signifies.

The Radical Potential of RDF Dimension Reduction?

Note (2016-06-03): This revises an earlier post by removing some dated information and expanding the conclusion.

This brief post was inspired by the abstract for a talk by Hanna Wallach at the fourth IPAM Culture Analytics workshop. I didn’t even attend it, so take the first part of this with a grain of salt! In her talk, Wallach discussed a new Bayesian dimension-reduction technique that operates over timestamped triples. In other words — as she and her co-authors put it in the paper summarizing their research — “records of the form ‘country i took action a toward country j at time t’ — known as dyadic events.”

The abstract as well as the paper framed this as a way of analyzing international relations. But what struck me immediately about this model is that it works with data that could be represented very naturally as RDF triples (if you add in a timestamp, that is). That means that this method might be able to do for RDF triples what topic modeling does for texts.

This probably seems like an odd thing to care about to people who haven’t read Miriam Posner‘s keynote on the radical potential of DH together with Matthew Lincoln‘s response. In her keynote, Posner poses a question to practitioners of DH: why do we so willingly accept data models that rely on simplistic categories? She observes, for example, that the Getty’s Union List of Artist Names relies on a purely binary model of gender. But this is strangely regressive in the wider context of the humanities:

no self-respecting humanities scholar would ever get away with such a crude representation of gender… So why do we allow widely shared, important databases like ULAN to deal so naively with identity?”

She elaborates on this point using the example of context-dependent racial categories:

a useful data model for race would have to be time- and place-dependent, so that as a person moved from Brazil to the United States, she might move from white to black. Or perhaps the categories themselves would be time- and place-dependent, so that certain categories would edge into whiteness over time. Or! Perhaps you could contrast the racial makeup of a place as the Census understands it with the way it’s articulated by the people who live there.

Matt Lincoln’s brilliant response takes this idea and gives it a concrete computational structure: RDF. Rather than having fixed categories of race, we can represent multiple different conceptualizations of race within the same data structure. The records of these conceptualizations take the form of {Subject, Verb, Object} triples, which can then form a network:

A diagram of a network of perceived racial categories.

Given that Posner’s initial model included time as well, adding timestamps to these verbs seems natural, even if it’s not, strictly speaking, included in the RDF standard. (Or is it? I don’t know RDF that well!) But once we have actors, timestamped verbs, and objects, then I think we can probably use this new dimension reduction technique on networks of this kind.1

What would be the result? Think about what topic modeling does with language: it finds clusters of words that appear together in ways that seem coherent to human readers. But it does so in a way that is not predictable from the outset; it produces different clusters for different sets of texts, and those differences are what make it so valuable. They allow us to pick out the most salient concepts and discourses within a particular corpus, which might differ case-by-case. This technique appears to do the very same thing, but with relationships between groups of people over time. We might be able to capture local variations in models of identity within different communities.

I am not entirely certain that this would work, and I’d love to hear any feedback about difficulties that this approach might face! I also doubt I’ll get around to exploring the possibilities more thoroughly right now. But I would really like to see more humanists actively seeking out collaborators in statistics and computer science to work on projects like this. We have an opportunity in the next decade to actively influence research in those fields, which will have widespread influence in turn over political structures in the future. By abstaining from that kind of collaboration, we are reinforcing existing power structures. Let’s not pretend otherwise.

  1. With a few small adjustments. Since actors and objects are of the same kind in the model, the verbs would need to have a slightly different structure — possibly linking individuals through identity perceptions or acts of self-identification. 

Neoliberal Citations (and Singularities)

A chart showing the use of the word postmodernism over time.
Ever wonder where “postmodernism” went? I suspect “digital humanities” is headed there too. (Both Google Ngrams and COCA show the same pattern. COCA even lets you limit your search to academic prose!)

My first impulse upon reading last week’s essay in the LA Review of Books was to pay no attention. Nobody I know especially likes the name “digital humanities.” Many people are already adopting backlash-avoidant stances against it. “‘Digital Humanities’ means nothing” (Moretti); “I avoid the phrase when I can” (Underwood). As far as I can tell, the main advantage of “digital humanities” has been that it sounds better than “humanities computing.” Is it really worth defending? It’s an umbrella term, and umbrella terms are fairly easy to jettison when they become the targets of umbrella critiques.

Still, we don’t have a replacement for it. I hope we’ll find in a few years that we don’t need one. In the meanwhile we’re stuck in terminological limbo, which is all the more reason to walk away from debates like this. Daniel Allington, Sarah Brouillette, and David Golumbia (ABG hereafter) have not really written an essay about the digital humanities, because no single essay could ever be about something so broadly defined.

That’s what I told myself last week. But something about the piece has continued to nag at me. To figure out what it was, I did what any self-respecting neoliberal apologist would do: I created a dataset.

A number of responses to the essay have discussed its emphasis on scholars associated with the University of Virginia (Alan Jacobs), on its focus on English departments (Brian Greenspan), and on its strangely dismissive attitude towards collaboration and librarianship (Stewart Varner, Ted Underwood). Building on comments from Schuyler Esprit, Roopika Risam has drawn attention to a range of other occlusions — of scholars of color, scholars outside the US, and scholars at regional institutions — that obscure the very kinds of work that ABG want to see more of. To get an overview of these occlusions, I created a list of all the scholars ABG mention in the course of their essay, along with the scholars’ gender, field, last degree or first academic publication, year of degree or publication, and granting institution or current affiliation.1

The list isn’t long, and you might ask what the point is of creating such a dataset, given that we can all just — you know — read the article. But I found that in addition to supporting all the critiques described above, this list reveals another occlusion: ABG almost entirely ignore early-career humanists. With the welcome exception of Miriam Posner, they do not cite a single humanities scholar who received a PhD after 2004. Instead, they cite three scholars trained in the sciences:

Two, Erez Aiden and Jean-Baptiste Michel, are biostatisticians who were involved with the “culturomics” paper that — well, let’s just say it has some problems. The other, Michael Dalvean, is a political scientist who seems to claim that when used to train a regression algorithm, the subjective beliefs of anthology editors suddenly become objective facts about poetic value.2 Are these really the most representative examples of DH work by scholars entering the field?

I’m still not entirely certain what to make of this final occlusion. Given the polemical character of their essay, I’m not surprised that ABG emphasize scholars from prestigious universities, and that, given the position of those scholars, most of them wind up being white men. ABG offer a rationale for their exclusions:

Exceptions too easily function as alibis. “Look, not everyone committed to Digital Humanities is a white man.” “Look, there are Digital Humanities projects committed to politically engaged scholarly methods and questions.” We are not negating the value of these exceptions when we ask: What is the dominant current supported even by the invocation of these exceptions?

I disagree with their strategy, because I don’t think invoking exceptions inevitably supports an exclusionary mainstream. But I see the logic behind it.

When it comes to early-career scholars, I no longer see the logic. Among all the humanities scholars who received a PhD in the last ten years, I would expect there to be representatives of the dominant current. I would also expect them to feature prominently in an essay like this, since they would be likely to play important roles directing that current in the future. The fact that they are almost entirely absent casts some doubt on one of the essay’s central arguments. Where there are only exceptions, no dominant current exists.

I share with ABG the institutional concerns they discuss in their essay. I do not believe that all value can be reduced to monetary value, and I am not interested in the digital humanities because it increases ROI. Universities and colleges are changing in ways that worry me. I just think those changes have little to do with technology in particular — they are fundamentally social and political changes. Reading the essay with that in mind, the absence of early-career humanists looks like a symptom of a more global problem. Let’s provisionally accept the limited genealogy that ABG offer, despite all it leaves out. Should we then assume that the field’s future will follow a linear trajectory determined only by its past? That a field that used to create neoliberal tools will mechanically continue to do so in spite of all efforts to the contrary? That would be a terrible failure of imagination.

In 1993, Vernor Vinge wrote a short paper called “The Coming Technological Singularity,” which argued that the rate of technological change will eventually outstrip our ability to predict or control that change. In it, he offers a quotation from Stanislaw Ulam’s “Tribute to John von Neumann” in which Ulam describes a conversation with von Neuman that

centered on the ever accelerating progress of technology and changes in the mode of human life, which gives the appearance of approaching some essential singularity in the history of the race beyond which human affairs, as we know them, could not continue.

Over the last decade, many optimistic technologists have described this concept with an evangelistic fervor that the staunchest Marxist revolutionary could admire. And yet this transformation is always framed as a technological transformation, rather than a social or political one. The governing fantasy of the singularity is the fantasy of an apolitical revolution, a transformation of society that requires no social intervention at all.3 In this fantasy, the realms of technology and politics are separate, and remain so even after all recognizable traces of the preceding social order have been erased. “Neoliberal Tools (and Archives)” seems to work with the same fantasy, transformed into a nightmare. In both versions, to embrace technology is to leave conscious political engagement behind.

But the singularity only looks like a singularity to technologists because they are used to being able to predict the behavior of technology. From the perspective of this humanist, things look very different: the singularity already happened, and we call it human society. When in the last five-hundred years has it ever been possible for human affairs, as known at a given moment, to continue? What have the last five centuries of human history been if not constant, turbulent, unpredictable change? If a technological singularity arrives, it will arrive because our technological lives will have become as complex and unpredictable as our social and political lives already are. If that time comes, the work of technologists and humanists will be the same.

That might be an unsettling prospect. But we can’t resist it by refusing to build tools, or assuming that the politics of tool-building are predetermined from the outset. Instead, we should embrace building and using tools as inherently complex, unpredictable social and political acts. If they aren’t already, they will be soon.

  1. Please let me know if you see any errors in this data. 
  2. This would be like claiming that when carefully measured, an electorate’s subjective beliefs about a candidate become objective truths about that candidate. To be fair, I’m sure Dalvean would recognize the error when framed that way. I actually like the paper otherwise! 
  3. The genius of The 100 (yes, that show on CW) is that it breaks through this mindset to show what a technological singularity looks like when you add the politics of human societies back in. 

The Size of the Signifier

What is a feature?

It’s worth thinking about the way machine learning researchers use the word “feature”. They speak of “feature selection,” “feature engineering,” and even “automated feature learning.” These processes generally produce a structured body of input — a set of “feature vectors” that machine learning algorithms use to make predictions. But the definition of “feature” is remarkably loose. It amounts to something like “a value we can measure.”

Convolutional Neural Network feature visualization.
Three levels of features learned by a convolutional neural network. From Lee, Grosse, Raganath, Ng, “Convolutional Deep Belief Networks”

People unfamiliar with machine learning might imagine that in the context of (say) face recognition, a “feature” would be the computational equivalent of “high cheekbones” or “curly hair.” And they wouldn’t be far off the mark to think so, in a way — but they might be surprised to learn that often, the features used by image recognition software are nothing more than the raw pixels of an image, in no particular order. It’s hard to imagine recognizing something in an image based on pixels alone. But for many machine learning algorithms, the measurements that count as features can be so small as to seem almost insignificant.

This is possible because researchers have found ways to train computers to assemble higher-level features from lower-level ones. And we do the same thing ourselves, in our own way — we just don’t have conscious access to the raw data from our retinas. The features that we recognize as recognizable are higher-level features. But those had to be assembled too, and that’s part of what our brain does.1 So features at one level are composites of smaller features. Those composite features might be composed to form even larger features, and so on.

When do things stop being features?

One answer is that they stop when we stop caring about their feature-ness, and start caring about what they mean. For example, let’s change our domain of application and talk about word counts. Word counts are vaguely interesting to humanists, but they aren’t very interesting until we start to build arguments with them. And even then, they’re just features. For example, suppose we have an authorship attribution program. We feed passages labeled with their authors into the program; it counts the words in the passages and does some calculations. Now the program can examine texts it hasn’t seen before and guess the author. In this context, that looks like an interpretive move: the word counts are just features, but guess the program makes is an interpretation.

We can imagine another context in which the author of a text is a feature of that text. In such a context, we would be interested in some other property of the text: a higher-level property that depends somehow on the identity of the text’s author. But that property would not itself be a feature — unless we keep moving up the feature hierarchy. This is an anthropocentric definition of the word “feature”: it depends on what humans do, and so gives up some generality. But for that same reason, it’s useful: it shows that we determine by our actions what counts as a feature. If we try to force ourselves into a non-anthropocentric perspective, it might start to look like everything is a feature, which would render the word altogether useless.

I think this line of reasoning is useful for thinking through this moment in the humanities. What does it mean for the humanities to be “digital,” and when will the distinction fade? I would guess that it will have something to do with coming shifts in the things that humanists consider to be features.

In my examples above, I described a hierarchy of features, and without saying so directly, suggested that we should want to move up that hierarchy. But I don’t actually think we should always want that — quite the opposite. It may become necessary to move back down a level or two at times, and I think this is one of those times. Things that used to be just features are starting to feel more like interpretations again. This is how I’m inclined to think of the idea of “Surface Reading” as articulated by Stephen Best and Sharon Marcus a few years ago. The metaphor of surface and depth is useful; pixels are surface features, and literary pixels have become interesting again.

Why should that be so? When a learning algorithm isn’t able to answer a given question, sometimes it makes sense to keep using the same algorithm, but return to the data to look for more useful features. I suspect that this approach is as useful to humans as to computers; in fact, I think many literary scholars have adopted it in the last few years. And I don’t think it’s the first time this has happened. Consider this observation that Best and Marcus make about surface reading in that article:

This valorization of surface reading as willed, sustained proximity to the text recalls the aims of New Criticism, which insisted that the key to understanding a text’s meaning lay within the text itself, particularly in its formal properties.

From a twenty-first century perspective, it’s sometimes tempting to see the New Critics as conservative or even reactionary — as rejecting social and political questions as unsuited to the discipline. But if the analogy I’m drawing is sound, then there’s a case to be made that the New Critics were laying the ground necessary to ask and answer those very questions.2

In his recent discussion of the new sociologies of literature, Ted Underwood expresses concern that “if social questions can only be addressed after you solve all the linguistic ones, you never get to any social questions.” I agree with Underwood’s broader point — that machine learning techniques are allowing us to ask and answer social questions more effectively. But I would flip the argument on its head. The way we were answering linguistic questions before was no longer helping us answer social questions of interest. By reexamining the things we consider features — by reframing the linguistic surfaces we study — we are enabling new social questions to be asked and answered.

  1. Although neural networks can recognize images from unordered pixels, it does help to impose some order on them. One way to do that is to use a convolutional neural network. There is evidence that the human visual cortex, too, has a convolutional structure
  2. In the early twentieth century, the relationships between physics, chemistry, biology, and psychology were under investigation. The question at hand was whether such distinctions were indeed meaningful — were chemists and biologists effectively “physicists in disguise”? A group of philosophers sometimes dubbed the “British Emergentists” developed a set of arguments defending the distinction, arguing that some kinds of properties are emergent, and cannot be meaningfully investigated at the atomic or chemical level. It seems to me that linguistics, literary study, and sociology have a similar relationship; linguists are the physicists, literary scholars the chemists, and sociologists the biologists. We all study language as a social object, differing more in the levels at which we divide things into features and interpretations — morpheme, text, or field. And in this schema, I think the emergent relationships do not “stack” on top of psychology and go up from there. That would suggest that sociology is more distant from psychology than linguistics. But I don’t think that’s true! All three fields seem to me to depend on psychology in similar ways. The emergent relationships between them are orthogonal to emergent relationships in the sciences. 

Old English Phonotactic Constraints?

Something interesting happens when you train a neural network to predict the next character given a text string. Or at least I think it might be interesting. Whether it’s actually interesting depends on whether the following lines of text obey the phonotactic constraints of Old English:

amancour of whad sorn on thabenval ty are orid ingcowes puth lee sonlilte te ther ars iufud it ead irco side mureh

It’s gibberish, mind you — totally meaningless babble that the network produces before it has learned to produce proper English words. Later in the training process, the network tends to produce words that are not actually English words, but that look like English words — “foppion” or “ondish.” Or phrases like this:

so of memmed the coutled

That looks roughly like modern English, even though it isn’t. But the earlier lines are clearly (to me) not even pseudo-English. Could they be pseudo-Old-English (the absence of thorns and eths notwithstanding)? Unfortunately I don’t know a thing about Old English, so I am uncertain how one might test this vague hunch.

Nonetheless, it seems plausible to me that the network might be picking up on the rudiments of Old English lingering in modern (-but-still-weird) English orthography. And it makes sense — of at least the dream-logic kind — that the oldest phonotactic constraints might be the ones the network learns first. Perhaps they are in some sense more fully embedded in the written language, and so are more predictable than other patterns that are distinctive to modern English.

It might be possible to test this hypothesis by looking at which phonotactic constraints the network learns first. If it happened to learn “wrong” constraints that are “right” in Old English — if such constraints even exist — that might provide evidence in favor of this hypothesis.

If you’d like to investigate this and see the kind of output the network produces, I’ve put all the necessary tools online. I’ve only tested this code on OS X; if you have a Mac, you should be able to get this up and running pretty quickly. All the below commands can be copied and pasted directly into Terminal. (Look for it in Applications/Utilities if you’re not sure where to find it.)

  1. Install homebrew — their site has instructions, or you can just trust me:
    ruby -e "$(curl -fsSL"
  2. Use homebrew to install hdf5:
    brew tap homebrew/science
    echo 'Now you are a scientist!'
    brew install hdf5
  3. Use pip to install PyTables:
    echo 'To avoid `sudo pip`, get to know'
    echo 'virtualenv & virtualenvmanager.'
    echo 'But this will do for now.'
    sudo -H pip install tables
  4. Get pann:
    git clone
  5. Generate the training data from text files of your choice (preferably at least 2000k of text):
    cd pann
    ./ default_alphabet new_table.npy
    ./ -t new_table.npy \
        -a default_alphabet -c 101 \
        -s new_features.h5 \ 
        SOME_TEXT.txt SOME_MORE_TEXT.txt
  6. Train the neural network:
    ./ -I new_features.h5 \ 
        -C new_features.h5 -s 4000 1200 300 40 \
        -S new_theta.npy -g 0.1 -b 100 -B 1000 \
        -i 1 -o 500 -v markov
  7. Watch it learn!

It mesmerizes me to watch it slowly deduce new rules. It never quite gets to the level of sophistication that a true recurrent neural network might, but it gets close. If you don’t get interesting results after a good twenty-four or forty-eight hours of training, play with the settings — or contact me!

Meaning, Context, and Algebraic Data Types

A few years ago, I read a paper making a startling argument. Its title was “The Derivative of a Regular Type is its Type of One-Hole Contexts.” I’m not entirely sure how I found it or why I started reading it, but by the time I was halfway though it, I was slapping myself on the forehead: it was so brilliant, and yet so obvious as to be almost trivial — how could nobody have thought of it before?

The idea was that if you take an algebraic data type — say something simple like a plain old list — and poke a hole in it, you get a new data type that looks like a derivative of the first data type. That probably sounds very abstract and uninteresting at first, especially if you aren’t very familiar with algebraic data types. But once you understand them, the idea is simple. And I think it has ramifications for people interested in natural language: it tells us something about the relationship between meaning and context. In particular, it could give us a new way to think about how semantic analysis programs like word2vec function: they perform a kind of calculus on language.

What’s an Algebraic Data Type?

An algebraic data type is really just a regular data type — seen through a particular lens. So it’s worth taking a moment to think through what a regular data type is, even if you’re already familiar with the concept.

A regular data type is just an abstract representation of the kind of data being stored or manipulated by a computer. Let’s consider, as a first example, the simplest possible item of data to be found in a modern computer: a bit. It can take just two values, zero and one. Most programming languages provide a type describing this kind of value — a boolean type. And values of this type are like computational atoms: every piece of data that modern computers store or manipulate is made of combinations of boolean values.1

So how do we combine boolean values? There are many ways, but here are two simple ones that cover a lot of ground. In the first way, we take two boolean values and declare that they are joined, but can vary independently. So if the first boolean value is equal to zero, the second boolean value may be equal to either zero or one, and if the first boolean value is equal to one, the second boolean value may still be equal to either zero or one. And vice versa. They’re joined, but they don’t pay attention to each other at all. Like this:

1) 0 0
2) 0 1
3) 1 0
4) 1 1

In the second way, we take two boolean values and declare that they are joined, but cannot vary independently — only one of them can be active at a given time. The others are disabled (represented by “*” here) — like this:

1) * 0
2) * 1
3) 0 *
4) 1 *

Now suppose we wanted to combine three boolean values. In the first case, we get this:

1) 0 0 0
2) 0 0 1
3) 0 1 0
4) 0 1 1
5) 1 0 0 
6) 1 0 1
7) 1 1 0
8) 1 1 1

And in the second case, we get this:

1) * * 0
2) * * 1
3) * 0 *
4) * 1 *
5) 0 * *
6) 1 * *

At this point, you might start to notice a pattern. Using the first way of combining boolean values, every additional bit doubles the number of possible combined values. Using the second way, every additional bit only adds two to the number of possible combined values.

This observation is the basis of algebraic type theory. In the language of algebraic data types, the first way of combining bits produces a product type, and the second way of combining bits produces a sum type. To get the number of possible combined values of a product type, you simply multiply together the number of possible values that each component type can take. For two bits, that’s 2 \times 2; for three bits, that’s 2 \times 2 \times 2; and for n bits, that’s 2 ^ n. And to get the number of possible combined values of a sum type, you simply add together the number of possible values that each component type can take: 2 + 2, or 2 + 2 + 2, or 2 \times n.

Pretty much all of the numerical types that a computer stores are product types of bits. For example, a 32-bit integer is just the product type of 32 boolean values. It can represent numbers between 0 and 2 ^{32} - 1 (inclusive), for a total of 2 ^{32} values. Larger numbers require more binary digits to store.

Sum types are a little more esoteric, but they become useful when you want to represent things that can come in multiple categories. For example, suppose you want to represent a garden plot, and you want to have a variable that stores the number of plants in the plot. But you also want to indicate what kind of plants are in the plot — basil or thyme, say. A sum type provides a compact way to represent this state of affairs. A plot can be either a basil plot, or a thyme plot, but not both. If there can be up to twenty thyme plants in a thyme plot, and up to sixteen basil plants in a basil plot, the final sum type has a total of thirty-six possible values:

Basil   Thyme
*       1
*       2
...     ...
*       19    
*       20
1       *
2       *
...     ...
15      *
16      *

If you’ve done any programming, product types probably look pretty familiar, but sum types might be new to you. If you want to try a language that explicitly includes sum types, take a look at Haskell.

Poking Holes in Types

Given this understanding of algebraic data types, here’s what it means to “poke a hole” in a type. Suppose you have a list of five two-bit integers:

| a | b | c | d | e |

This new variable’s type is a product type. It can take this value:

| 0 | 0 | 0 | 0 | 0 |

Or this value:

| 0 | 0 | 0 | 0 | 1 |

Or this value…

| 0 | 0 | 0 | 0 | 2 |

And so on, continuing through values like these:

| 0 | 0 | 0 | 0 | 3 |
| 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 | 2 |
| 0 | 0 | 0 | 1 | 3 |
| 0 | 0 | 0 | 2 | 0 |

All the way to these final values:

| 3 | 3 | 3 | 2 | 3 |
| 3 | 3 | 3 | 3 | 0 |
| 3 | 3 | 3 | 3 | 1 |
| 3 | 3 | 3 | 3 | 2 |
| 3 | 3 | 3 | 3 | 3 |

The total number of possible values for a variable of this type is (2 ^ 2) ^ 5 = 4 ^ 5 = 4 * 4 * 4 * 4 * 4 = 1024.

Now think about what happens if you disable a single slot. The result is a list type with a hole in it. What does a variable with this new type look like? Well, say you disable the first slot. You get something that looks like this — a list with a hole in the first slot:

| * | 0 | 0 | 0 | 0 |

Now that there’s a hole in the first slot, there are only 4 * 4 * 4 * 4 = 256 possible values this variable can take.

| * | 0 | 0 | 0 | 1 |
| * | 0 | 0 | 0 | 2 |

We could also poke a hole in any of the other slots:

| 0 | * | 0 | 0 | 0 |
| 0 | 0 | * | 0 | 0 |
| 0 | 0 | 0 | * | 0 |
| 0 | 0 | 0 | 0 | * |

But the hole can only be in one of the slots at any given time. Sound familiar? You could say that this is the “sum type” of each of those one-hole types. Each possible placement of the hole adds to (rather than multiplying) the number of possible values. There are five different places for the hole to go, and for each of those, there are 256 possible values. That’s a total of 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 = 5 \times 4 ^ 4.

Now suppose we make this a little more general, by replacing that 4 with an x. This way we can do this same thing with slots of any size — slots that can hold ten values, or 256 values, or 65536 values, or whatever. For the plain list type, that’s a total of x ^ 5 possible values. And for the list-with-a-hole, that’s 5 x^4 possible values. And now let’s go even further and replace the number of slots with an n. That way we can have any number of slots — five, ten, fifty, whatever you like. Now, the plain list has a total of x ^ n possible values, and the list-with-a-hole has n x ^ {(n - 1)} possible values.

If you ever took calculus, that probably looks very familiar. It’s just the power rule! This means you can take data types and do calculus with them.

One-Hole Contexts

I think that’s pretty wild, and it kept my head spinning for a couple of years when I first read about it. But I moved on. Fast forward to a few weeks ago when I started really reading about word2vec and learning how it works: it finds vectors that are good at predicting the word in the middle of an n-gram, or are good at predicting an n-gram given a word in the middle. Let’s start with the first case — say you have this incomplete 5-gram:

I went _ the store

What word is likely to be in there? Well, “to” is a good candidate — I’d venture a guess that it was the first word that popped into your mind. But there are other possibilities. “Past” might work. If the n-gram is part of a longer sentence, “from” is a possibility. And there are many others. So word2vec will group those words together in its vector space, because they all fit nicely in this context, and in many others.

But look again at that sentence — it’s a sequence of words with a hole in it. So if in this model, a word’s meaning is defined by the n-grams in which it may be embedded, then the type of a word’s meaning is a list-with-a-hole — just like the derivative of the list type that we were looking at above.

The basic idea of word2vec isn’t extremely surprising; from a certain perspective, this is an elaboration of the concept of a “minimal pair” in linguistics. If you know that “bat” means a thing you hit a ball with, and “mat” means a thing you wipe your feet on, then you can tell that /b/ and /m/ are different phonemes, even though they both involve putting your lips together: the difference between their sounds makes a difference in meaning. Conversely, if the difference between two sounds doesn’t make a difference in meaning, then the sounds must represent the same phoneme. “Photograph” pronounced with a distinct “oh” sound in the second syllable is the same word as “photograph” pronounced with an “uh” sound in the second syllable. They’re different sounds, but they both make the same word here, so they represent the same phoneme.

The thinking behind word2vec resembles the second case. If you can put one word in place of another in many different contexts and get similar meanings, then the two words must be relatively close together in meaning themselves.

What is surprising is that by pairing these two concepts, we can link the idea of derivatives in calculus to the idea of meaning. We might even be able to develop a model of meaning in which the meaning of a sentence has the same type as the derivative of that sentence’s type. Or — echoing the language of the paper I began with — the meaning type of a sentence type is its type of one-hole contexts.

  1. But don’t take this for granted. There have been — and perhaps will be in the future — ternary computers

Directed Probabilistic Topic Networks

Suppose you’re standing in front of your bookcase, feeling a little bored. You pick up a book at random, and read a few pages on a random topic. It piques your curiosity, so you put the first book away and pick another one that has something to say about the same topic. You read a few pages from it and notice a second topic that interests you. So you pick up a third book on that topic, and that book draws your attention to yet another topic. And you continue moving from book to topic to book to topic — forever.

Wouldn’t be interesting if we could describe that process mathematically?

Books from Yale's Beinecke LibraryFor the for the last few months I’ve been thinking about the best way to create useful networks with topic models. People have been creating network visualizations of topic models for a long time now, but they sometimes feel a bit like window dressing.1 The problem is that we don’t know what these networks actually represent. The topics are just blobs linked together and floating in a mysterious, abstract space. But what if we could create a network with a clear and concrete interpretation in terms of a physical process that we understand? What if we could create a network that represents the process of browsing through the books on a bookshelf?

I have struck upon a formula that I think does just that — it describes the probability of moving from one topic to another while browsing through a corpus. Remarkably, the formula is very similar to the formula for cosine similarity, which is one of the more popular ways of measuring the similarity between topics. But it differs in crucial ways, and it creates a kind of topic network that I haven’t seen before.2 I’d like to hear what others think about it.

I’ve developed two different theoretical arguments that suggest that the networks this formula creates are more useful than the networks that cosine similarity creates. One argument is related to the theory of bimodal networks, and the other is related to the theory of Markov chains. I have several posts queued up that go into the details, but I’ve decided not to post them just yet. Instead, I’m going to let the method speak for itself on practical grounds. I’ll post more once I feel confident that the result is worth the cognitive investment. However, if you’re interested in the fundamental math, I’ve posted a derivation.

A snippet of a topic network diagram.
A snippet from a network diagram of a topic model of PMLA issues by Ted Underwood — part of the preliminary research that led to Goldstone and Underwood’s “The Quiet Transformations of Literary Studies.”

For now, I’ll assume that most readers are already familiar with — or else are profoundly indifferent to — a few background ideas about topic modeling, cosine similarity, and topic networks.3 I hope that won’t exclude too many people from the conversation, because my core argument will be mostly visual and practical: I think that visualizations of these networks look better, and I think the idea of a “browsing similarity” between topics sounds useful — do you?

So feel free to skip past the wonkish bits to the diagrams!

In my own experimentation and research, I’ve found that browsing similarity creates topic networks that differ in several ways from those that cosine similarity creates. First, they distribute links more uniformly between nodes. It’s desirable to simplify topic networks by cutting links with a flat threshold, because the result is easy to reason about. When you do that with this new kind of network, most of the nodes stick together in one loose clump with lots of internal clustering. Second, they invite a probabilistic interpretation with some interesting and well understood theoretical properties.4 Those properties ensure that even some of the more abstract network-theoretical measures, like eigenvalue centrality, have concrete interpretations. And third, they are directed — which says some important things about the relationships between topics in a corpus.

Below are three network diagrams based on a topic model of two thousand eighteenth-century books published between 1757 and 1795.5 Each has 150 nodes (one for each topic in the model). The strengths of the links between each of the nodes are calculated using either cosine similarity or browsing similarity between topic vectors. Each vector is a sequence of book proportions for a given topic. To usefully visualize a topic model as a network, you must cut some links, and the easiest approach is to apply some kind of threshold. Links stronger than some value stay, and the rest are cut. In all the network diagrams below, I’ve selected threshold values that produce exactly 225 links. For layout, I’ve used D3’s force-directed layout routine, so the diagrams will look a little different each time you reload this page.6

In the first diagram, I’ve used cosine similarity with a simple flat threshold. The result is a hairball with a lot of little islands floating around it:

To deal with this problem, Ted Underwood came up with a really clever link-cutting heuristic that produces much cleaner network diagrams. However, it’s a little ad-hoc; it involves retaining at least one link from every node, and then retaining additional links if they’re strong enough. It’s like a compromise between a flat threshold (take all links stronger than x) and a rank-based threshold (take the strongest n links from each node).

In the second diagram, I’ve used cosine similarity again, but applied a variation on Underwood’s heuristic with a tunable base threshold.7

The result is much more coherent, and there’s even a bit of suggestive clustering in places. There are a few isolated archipelagos, but there are no singleton islands, because this method guarantees that each node will link to at least one other node.

Now for the browsing similarity approach. In the third diagram, I’ve used browsing similarity with a simple flat threshold:

Although this diagram has both singleton islands and archipelagos, it’s far more connected than the first, and it has almost as many mainland connections as the second. It also shows a bit more clustering behavior than diagram two does. But what I find most interesting about it is that it represents the concrete browsing process I described above: each of the edges represents a probability that while browsing randomly through the corpus, you will happen upon one topic, given that you are currently reading about another.8 That’s why the edges are directed — you won’t be as likely to move from topic A to topic B as from topic B to topic A. This makes perfect sense: it ought to be harder to move from common topics to rare topics than to move from rare topics to common topics.

Because I wanted to show the shapes of these graphs clearly, I’ve removed the topic labels. But you can also see full-screen versions of the cosine, Underwood, and browsing graphs, complete with topic labels that show more about the kinds of relationships that each of them preserve.

Here’s everything you need to play with the browsing similarity formula. First, the mathematical formula itself:

\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}

You can think of b as standing for “book,” and X and Y as two different topics. x_1 is the proportion of book 1 that is labeled as being about topic X, and so on. The formula is very similar to the forumla for cosine similarity, and the tops are identical. Both calculate the dot product of two topic-book vectors. The difference between them is on the bottom. For browsing similarity, it’s simply the sum of the values in Y, but for cosine similarity, it’s the product of the lengths of the two vectors:

\frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}

Here a bit of jargon is actually helpful: cosine similarity uses the “euclidean norm” of both X and Y, while browsing similarity uses only the “manhattan norm” of Y, where “norm” is just a ten dollar word for length. Thinking about these as norms helps clarify the relationship between the two formulas: they both do the same thing, but browsing similarity uses a different concept of length, and ignores the length of X. These changes turn the output of the formula into a probability.

Next, some tools. I’ve written a script that generates Gephi-compatible or D3-compatible graphs from MALLET output. It can use cosine or browsing similarity, and can apply flat, Underwood-style, or rank-based cutoff thresholds. It’s available at GitHub, and it requires numpy and networkx. To use it, simply run MALLET on your corpus of choice, and pass the output to on the command line like so:

./ network --remove-self-loops \
    --threshold-value 0.05 \
    --threshold-function flat \
    --similarity-function browsing \
    --output-type gexf \
    --write-network-file browsing_sim_flat \
    --topic-metadata topic_names.csv \

It should be possible to cut and paste the above command into any bash terminal — including Terminal in OS X under default settings. If you have any difficulties, though, let me know! It may require some massaging to work with Windows. The command should generate a file that can be directly opened by Gephi. I hope the option names are obvious enough; more detailed information about options is available via the --help option.

In case you’d prefer to work this formula into your own code, here is a simplified version of the browsing_similarity function that appears in the above Python script. Here, A is a matrix of topic row vectors. The code here is vectorized to calculate every possible topic combination at once and put them all into a new matrix. You can therefore interpret the output as the weighted adjacency matrix of a fully-connected topic network.

def browsing_similarity(A):
    A = numpy.asarray(A)
    norm = A.T.sum(axis=0)
    return, A.T) / norm

And here’s the same thing in R9:

browsingsim norm = rowSums(A)
    dot = A %*% t(A)
    return(t(dot / norm))

Matrix calculations like this are a dream when the shapes are right, and a nightmare when they’re wrong. So to be ridiculously explicit, the matrix A should have number_of_topics rows and number_of_books columns.

I have lots more to say about bimodal networks, conditional probability, Markov chains, and — at my most speculative — about the questions we ought to ask as we adapt more sophisticated mathematical techniques for use in the digital humanities.

But until then, comments are open!

  1. Ted Underwood has written that “it’s probably best to view network visualization as a convenience,” and there seems to be an implicit consensus that topic networks are more visually stunning than useful. My hope is that by creating networks with more concrete interpretations, we can use them to produce evidence that supports interesting arguments. There are sure to be many details to work through and test before that’s possible, but I think it’s a research program worth developing further. 
  2. I’ve never seen anything quite like this formula or the networks it produces. But I’d love to hear about other work that people have done along these lines — it would make the theoretical burden much lighter. Please let me know if there’s something I’ve missed. See also a few near-misses in the first footnote to my post on the formula’s derivation. [Update: I found a description of the Markov Cluster Algorithm, which uses a matrix that is similar to the one that browsing similarity produces, but that is created in a slightly different way. I’m investigating this further, and I’ll discuss it when I post on Markov chains.] 
  3. If you’d like to read some background material, and you don’t already have a reading list, I propose the following sequence: Matt Jockers, The LDA Buffet Is Now Open (very introductory); Ted Underwood, Topic Modeling Made Just Simple Enough (simple enough but no simpler); Miriam Posner and Andy Wallace, Very Basic Strategies for Interpreting Results from the Topic Modeling Tool (practical approaches for quick bootstrapping); Scott Weingart, Topic Modeling and Network Analysis (introduction to topic networks); Ted Underwood, Visualizing Topic Models (additional theorization of topic visualization). 
  4. Specifically, their links can be interpreted as transition probabilities in an irreducible, aperiodic Markov chain. That’s true of many networks, strictly speaking. But in this case, the probabilities are not derived from the network itself, but from the definition of the browsing process. 
  5. I have a ton of stuff to say about this corpus in the future. It’s part of a collaborative project that Mae Capozzi and I have been working on. 
  6. Because the force-directed layout strategy is purely heuristic, the layout itself is less important than the way the nodes are interconnected. But the visual intuition that the force-directed layout provides is still helpful. I used the WPD3 WordPress plugin to embed these. It’s a little finicky, so please let me know if something has gone wrong. 
  7. Underwood’s original method kept the first link, the second link if it was stronger than 0.2, and the third link if it was stronger than 0.38. This variation takes a base threshold t, which is multiplied by the rank of a given link to determine the threshold it must meet. So if the nth strongest link from a node is stronger than t * (n - 1), then it stays. 
  8. Because some links have been cut, the diagram doesn’t represent a full set of probabilities. It only represents the strongest links — that is, the topic transitions that the network is most biased towards. But the base network retains all that information, and standard network measurements apply to it in ways that have concrete meanings. 
  9. I had to do some odd transpositions to ensure that the R function generates the same output as the Python function. I’m not sure I used the best method — the additional transpositions make the ideas behind the R code seem less obvious to me. (The Python transpositions might seem odd to others — I guess they look normal to me because I’m used to Python.) Please let me know if there’s a more conventional way to manage that calculation. 

A Sentimental Derivative

Ben Schmidt’s terrific insight into the assumptions that the Fourier transform imposes on sentiment data has been sinking in, and I have a left-field suggestion for anyone who cares to check it out. I plan to investigate it myself when I have the time, but I’ve decided to broadcast it now.

In the imaginary universe of Fourier land, all texts start and end at the same sentiment amplitude. This is clearly incorrect, as I see it.1 But what could we say about the beginning and end of texts that might hold up?

One possibility is that all texts might start and end with a flat sentiment curve. That is, at the very beginning and end of a text, we can assume that the valence of words won’t shift dramatically. That’s not clearly incorrect. I think it’s even plausible.

Now consider how we talk about plot most of the time: we speak of rising action (slope positive), falling action (slope negative), and climaxes (local and global maxima). That’s first derivative talk! And the first derivative of a flat curve is always zero. So if the first derivative of a sentiment curve always starts and ends at zero, then at least one objection to the Fourier transform approach can be worked around. For example, we could simply take the first finite difference of a text’s sentiment time series, perform a DFT and low-pass filter, do a reverse transform, and then do a cumulative sum (i.e. a discrete integration) of the result.2

What would that look like?

  1. Nonetheless, I think there’s some value to remaining agnostic about this for some time still — even now, after the dust has settled a bit. 
  2. You might be able to skip a step or two. 

What’s a Sine Wave of Sentiment?

Over the last month a fascinating series of debates has unfolded over Matt Jockers’ Syuzhet package. The debates have focused on whether Syuzhet’s smoothing strategy, which involves using a Fourier transform and low-pass filter, is appropriate. Annie Swafford has produced several compelling arguments that this strategy doesn’t work. And Ted Underwood has responded with what is probably the most accurate assessment of our current state of knowledge: we don’t know enough yet to say anything at all.

I have something to add to these debates, but I’ll begin by admitting that I haven’t used Syuzhet. I’m only just now starting to learn R. (I’ve been using Python / Numpy / Scipy / Pandas in my DH work so far.) My objection is not based on data or validation, statistical or otherwise. It’s based on a more theoretical line of reasoning.1

I broadly agree with Annie Swafford’s assessment: it looks to me like this strategy is producing unwanted ringing artifacts.2 But Matt Jockers’ counterargument troubles her line of reasoning — he suggests that ringing artifacts are what we want. That doesn’t sound right to me, but that argumentative move shows what’s really at stake here. The question is not whether ringing artifacts distort the data relative to some ground truth. There’s no doubt about that — this is, after all, a way of distorting rough data to make it smooth. The question is whether we want this particular kind of distortion. My issue with using Fourier transforms to represent sentiment time series data is that we have no clear theoretical justification to do so. We have no theoretical reason to want the kind of distortion it produces.

If we hope to use data mining tools to produce evidence, we need to think about ways to model data that are suited to our own fields. This is a point Ted Underwood made early on in the conversation about LDA, well before much had been published by literary scholars on the subject. The point he made is as important now as then: we should do our best to ensure that the mathematical models we use have clear and concrete interpretations in terms of the physical processes that we study and seek to understand: reading, writing, textual distribution, influence, and so on. This is what Syuzhet fails to do at the smoothing and filtering stage right now. I think the overall program of Syuzhet is a promising one (though there may be other important aspects of the thing-that-is-not-fabula that it ignores). But I think the choice of Fourier analysis for smoothing is not a good choice — not yet.

The Dirichlet Distribution
The Dirichlet distribution for three categories is defined over values of X, Y, and Z adding up to 1

A Fourier transform models time series data as a weighted sum of sine waves of different frequencies. But I can think of no compelling reason to represent a sequence of sentiment measurements as a sum of sine waves. Consider LDA as a point of comparison (as Jockers has). There’s a clear line of reasoning that supports our using the Dirichlet distribution as a prior. One could certainly argue that the Dirichlet density has the wrong shape, but its support — the set of values over which it is defined — has the right shape.3 It’s a set of N distinct real-valued variables that always sum to one. (In other words, it’s a distribution over the ways to break a stick into N parts.) Since we have good reasons to think about language as being made of distinct words, thinking in terms of categorical probability distributions over those words makes sense. And the Dirichlet distribution is a reasonable prior for categorical distributions because its support consists entirely of categorical probability distributions (ways to break a stick). Even if we were to decide that we need a different prior distribution, we would probably still choose a distribution defined over the same support.4

Wavelets and Chirplets
Wavelets and Chirplets — via Wikimedia Commons

But the support of the function produced by a Fourier transform is the frequency domain of a sinusoidal curve. Is that the right support for this purpose? Setting aside the fact that we’re no longer talking about a probability distribution, I think it’s still important to ask that question. If we could have confidence that it makes sense to represent a sentiment time series as a sum of sinusoidal curves, then we might be able to get somewhere with this approach. The support would be correct, even if the shape of the curve over the frequency domain weren’t quite right. But why should we accept that? Why shouldn’t we be looking at functions over domains of wavelets or chirplets or any number of other possibilities? Why should the sentimental valence of the words in a novel be best represented by sine waves in particular?

I think this is a bit like using a Gaussian mixture model (GMM) to do topic modeling. You can use Gaussian distributions as priors for topic models. It might even be a good idea to do so, because it could allow us to get good results faster. But it’s not going to help us understand how topic modeling works in the first place. The Gaussian prior obscures what’s really going on under the hood.5 Even if we all moved over to Gaussian priors in our topic models, we’d probably still use classic LDA to get a handle on the algorithm. In this case, I think the GMM is best understood as a way to approximate LDA.

Chirplets are good at modeling objects in perspective
Chirplets are good at modeling objects in perspective — via Wikimedia Commons

And now, notice that we can use a Fourier transform to approximate any function at all. But what does doing so tell us about the function? Does it tell us what we want to know? I have no idea in this case, and I don’t think anyone else does either. It’s anyone’s guess whether the sine waves that this transform uses will correspond to anything at all familiar to us.

I think this is a crucial issue, and it’s one we can frame in terms of disciplinary continuity. Whenever we do any kind of informal reasoning based on word counts alone, we’re essentially thinking in terms of categorical distributions. And I think literary scholars would have paid attention to a well-reasoned argument based on word counts thirty years ago. If that’s right, then LDA simply gives us a way to accelerate modes of reasoning about language that we already understand. But if thirty years ago someone had argued that the movement of sentiment in a novel should be understood through sinusoidal shapes, I don’t think they would have been taken very seriously.

Admittedly, I have no strong justification for this claim, and if there’s widespread disagreement about it, then this debate will probably continue for some time. But either way, we need to start thinking very concretely about what it means to represent sentiment specifically as a sine wave. We will then be able to trust our intuitions about our own field of study to guide us.

  1. This means that to a certain degree, I’m not taking Syuzhet in the spirit with which it was offered. Jockers writes that his “primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories.” I’ve not done that, even though I think it’s a sound method. We can’t depend only on statistical measurements; our conclusions need intuitive support. But I also think the theoretical questions I’m asking here are necessary to build that kind of support. 
  2. I suspected as much the moment I read about the package, though I’m certain I couldn’t have articulated my concerns without Swafford’s help. Update: And I hope it’s clear to everyone, now that the dust has settled, that Swafford has principal investigator status in this case. If she hadn’t started it, the conversation probably wouldn’t have happened at all.
  3. The support of a function is the set of inputs that it maps to nonzero outputs. 
  4. The logic of this argument is closely related to the theory of types in computer programming. One could say that a categorical sampling algorithm accepts variables of the “broken stick” type and samples from them; and one could say that when we sample from a Dirichlet distribution, the output is a variable of the “broken stick” type. 
  5. The truth of this is strongly suggested to me by the fact that the above cited paper on GMM-based topic modeling initially proposes a model based on “cut points” — a move I will admit that I understand only in vague terms as a way of getting discrete output from a continuous function. That paper looks to me like an attempt to use a vector space model for topic modeling. But as I’ll discuss in a later post, I don’t find vector space models of language especially compelling because I can’t develop a concrete interpretation of them in terms of authors, texts, and readers. 

A Random Entry

There’s a way of telling a history of the digital humanities that does not follow the well known trajectory from Father Busa’s Index Thomisticus, Mosteller and Wallace’s study of the Federalist Papers, and the Text Encoding Initiative — to Distant Reading, data mining, and the present day. It does not describe the slow transformation of a once-peripheral field into an increasingly mainstream one. Instead, it describes a series of missed opportunities.

It’s a polemical history that inverts many unspoken assumptions about the relationship between the humanities and the sciences. I’m not sure I entirely believe it myself. But I think it’s worth telling.

It starts like this: there once was a guy named Frank Rosenblatt. In 1957, Rosenblatt created the design for a device he called the perceptron. It was an early attempt at simulating the behavior of networks of biological neurons, and it initially provoked a frenzy of interest, including the following New York Times report:

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

Needless to say, the perceptron never managed to do any of those things. But what it did do was exceptional in its own way. It used to great practical effect the following insight: that many small, inaccurate rules can be combined in simple ways to create a larger, more accurate rule. This insight is now central to statistical learning theory.1 But it was not understood as particularly important at the time.

In fact, when people began to realize that the perceptron in its simplest form was limited, a backlash ensued. Marvin Minsky and Seymour Papert wrote a book called Perceptrons that enumerated the limits of simple two-layer perceptrons2; people misunderstood the book’s arguments as applying to all neural networks; and the early promise of perceptrons was forgotten.

This turn of events may have delayed the emergence of interesting machine learning technologies by a couple of decades. After the perceptron backlash, artificial intelligence researchers focused on using logic to model thought — creating ever more complex sets of logical rules that could be combined to generate new rules within a unified and coherent system of concepts. This approach was closely related to the kinds of transformational grammars that Noam Chomsky has been exploring since the 1950s, and it largely displaced statistical approaches — with a few exceptions — until the 1990s.

Unsurprisingly, Chomsky remains hostile to statistical and probabilistic approaches to machine learning and artificial intelligence. Nonetheless, there does seem to be some evidence that those approaches have gotten something right. Peter Norvig offers the following summary:

Chomsky said words to the effect that statistical language models have had some limited success in some application areas. Let’s look at computer systems that deal with language, and at the notion of “success” defined by “making accurate predictions about the world.” First, the major application areas:

  • Search engines: 100% of major players are trained and probabilistic. Their operation cannot be described by a simple function.
  • Speech recognition: 100% of major systems are trained and probabilistic…
  • Machine translation: 100% of top competitors in competitions such as NIST use statistical methods…
  • Question answering: this application is less well-developed, and many systems build heavily on the statistical and probabilistic approach used by search engines…

Now let’s look at some components that are of interest only to the computational linguist, not to the end user:

  • Word sense disambiguation: 100% of top competitors at the SemEval-2 competition used statistical techniques; most are probabilistic…
  • Coreference resolution: The majority of current systems are statistical…
  • Part of speech tagging: Most current systems are statistical…
  • Parsing: There are many parsing systems, using multiple approaches. Almost all of the most successful are statistical, and the majority are probabilistic…

Clearly, it is inaccurate to say that statistical models (and probabilistic models) have achieved limited success; rather they have achieved a dominant (although not exclusive) position.

In the past fifteen years, these approaches to machine learning have produced a number of substantial leaps forward — consider Google’s famous creation of a neural network that (in at least some sense) reinvented the concept of “cat,” or this recurrent neural network capable of imitating various styles of human handwriting. These extraordinary successes have been made possible by a dramatic increase in computing power. But without an equally dramatic shift in ways of thinking about what constitutes knowledge, that increase in computing power would have accomplished far less. What has changed is that the people doing the math have stopped trying to find logical models of knowledge by hand, and have started trying to find probabilistic models of knowledge — models that embrace heterogeneity, invite contradiction, and tolerate or even seek out ambiguity and uncertainty. As machine learning researchers have discovered, the forms these models take can be defined with mathematical precision, but the models themselves tolerate inconsistencies in ways that appear to be unbound by rigid logic.3

I’d like to suggest that by embracing that kind of knowledge, computer scientists have started walking down a trail that humanists were blazing fifty years ago.

The kind of knowledge that these machines have does not take the form of a rich, highly structured network of immutable concepts and relations with precise and predictable definitions. It takes the form of a loose assembly of inconsistent and mutually incompatible half-truths, always open to revision and transformation, and definable only by the particular distinctions it can make or elide at any given moment. It’s the kind of knowledge that many literary scholars and humanists have found quite interesting for the last few decades.

Since the decline of structuralism, humanists have been driven by a conviction that the loosely or multiply structured behaviors that constitute human culture produce important knowledge that cannot be produced in more structured ways. Those humanities scholars who remained interested in structured ways of producing knowledge — like many of the early practitioners of humanities computing — were often excluded from conversations in the humanistic mainstream.

Now something has changed. The change has certainly brought computational methods closer to the mainstream of the humanities. But we mustn’t mistake the change by imagining that humanists have somehow adopted a new scientism. A better explanation of this change is that computer scientists, as they have learned to embrace the kinds of knowledge produced by randomness, have reached a belated understanding of the value of — dare I say it? — post-structuralist ways of knowing.

It’s a shame it didn’t happen earlier.

  1. I first encountered the above formulation of this idea in the first video in Geoffrey Hinton’s online course on neural networks. But you can see it being used by other researchers (p. 21) working in machine learning on a regular basis. 
  2. In machine learning lingo, it could not learn nonlinear decision boundaries. It didn’t even have the ability to calculate the logical XOR operation on two inputs, which at the time probably made logic-based approaches look far more promising. 
  3. I say “appear” because it’s not entirely clear what it would mean to be unbound by rigid logic. The mathematical formulation of machine learning models is itself perfectly strict and internally consistent, and if it weren’t, it would be irreparably broken. Why don’t the statistical models represented by that formulation break in the same way? I suspect that it has something to do with the curse blessing of dimensionality. They don’t break because every time a contradiction appears, a new dimension appears to accommodate it in an ad-hoc way — at least until the model’s “capacity” for such adjustments is exhausted. I’m afraid I’m venturing a bit beyond my existential pay grade with these questions — but I hope this sliver of uncertainty doesn’t pierce to the core of my argument.