Digital Humanities

Probably everything on this blog!

A Random Entry

There’s a way of telling a history of the digital humanities that does not follow the well known trajectory from Father Busa’s Index Thomisticus, Mosteller and Wallace’s study of the Federalist Papers, and the Text Encoding Initiative — to Distant Reading, data mining, and the present day. It does not describe the slow transformation of a once-peripheral field into an increasingly mainstream one. Instead, it describes a series of missed opportunities.

It’s a polemical history that inverts many unspoken assumptions about the relationship between the humanities and the sciences. I’m not sure I entirely believe it myself. But I think it’s worth telling.

It starts like this: there once was a guy named Frank Rosenblatt. In 1957, Rosenblatt created the design for a device he called the perceptron. It was an early attempt at simulating the behavior of networks of biological neurons, and it initially provoked a frenzy of interest, including the following New York Times report:

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

Needless to say, the perceptron never managed to do any of those things. But what it did do was exceptional in its own way. It used to great practical effect the following insight: that many small, inaccurate rules can be combined in simple ways to create a larger, more accurate rule. This insight is now central to statistical learning theory.1 But it was not understood as particularly important at the time.

In fact, when people began to realize that the perceptron in its simplest form was limited, a backlash ensued. Marvin Minsky and Seymour Papert wrote a book called Perceptrons that enumerated the limits of simple two-layer perceptrons2; people misunderstood the book’s arguments as applying to all neural networks; and the early promise of perceptrons was forgotten.

This turn of events may have delayed the emergence of interesting machine learning technologies by a couple of decades. After the perceptron backlash, artificial intelligence researchers focused on using logic to model thought — creating ever more complex sets of logical rules that could be combined to generate new rules within a unified and coherent system of concepts. This approach was closely related to the kinds of transformational grammars that Noam Chomsky has been exploring since the 1950s, and it largely displaced statistical approaches — with a few exceptions — until the 1990s.

Unsurprisingly, Chomsky remains hostile to statistical and probabilistic approaches to machine learning and artificial intelligence. Nonetheless, there does seem to be some evidence that those approaches have gotten something right. Peter Norvig offers the following summary:

Chomsky said words to the effect that statistical language models have had some limited success in some application areas. Let’s look at computer systems that deal with language, and at the notion of “success” defined by “making accurate predictions about the world.” First, the major application areas:

  • Search engines: 100% of major players are trained and probabilistic. Their operation cannot be described by a simple function.
  • Speech recognition: 100% of major systems are trained and probabilistic…
  • Machine translation: 100% of top competitors in competitions such as NIST use statistical methods…
  • Question answering: this application is less well-developed, and many systems build heavily on the statistical and probabilistic approach used by search engines…

Now let’s look at some components that are of interest only to the computational linguist, not to the end user:

  • Word sense disambiguation: 100% of top competitors at the SemEval-2 competition used statistical techniques; most are probabilistic…
  • Coreference resolution: The majority of current systems are statistical…
  • Part of speech tagging: Most current systems are statistical…
  • Parsing: There are many parsing systems, using multiple approaches. Almost all of the most successful are statistical, and the majority are probabilistic…

Clearly, it is inaccurate to say that statistical models (and probabilistic models) have achieved limited success; rather they have achieved a dominant (although not exclusive) position.

In the past fifteen years, these approaches to machine learning have produced a number of substantial leaps forward — consider Google’s famous creation of a neural network that (in at least some sense) reinvented the concept of “cat,” or this recurrent neural network capable of imitating various styles of human handwriting. These extraordinary successes have been made possible by a dramatic increase in computing power. But without an equally dramatic shift in ways of thinking about what constitutes knowledge, that increase in computing power would have accomplished far less. What has changed is that the people doing the math have stopped trying to find logical models of knowledge by hand, and have started trying to find probabilistic models of knowledge — models that embrace heterogeneity, invite contradiction, and tolerate or even seek out ambiguity and uncertainty. As machine learning researchers have discovered, the forms these models take can be defined with mathematical precision, but the models themselves tolerate inconsistencies in ways that appear to be unbound by rigid logic.3

I’d like to suggest that by embracing that kind of knowledge, computer scientists have started walking down a trail that humanists were blazing fifty years ago.

The kind of knowledge that these machines have does not take the form of a rich, highly structured network of immutable concepts and relations with precise and predictable definitions. It takes the form of a loose assembly of inconsistent and mutually incompatible half-truths, always open to revision and transformation, and definable only by the particular distinctions it can make or elide at any given moment. It’s the kind of knowledge that many literary scholars and humanists have found quite interesting for the last few decades.

Since the decline of structuralism, humanists have been driven by a conviction that the loosely or multiply structured behaviors that constitute human culture produce important knowledge that cannot be produced in more structured ways. Those humanities scholars who remained interested in structured ways of producing knowledge — like many of the early practitioners of humanities computing — were often excluded from conversations in the humanistic mainstream.

Now something has changed. The change has certainly brought computational methods closer to the mainstream of the humanities. But we mustn’t mistake the change by imagining that humanists have somehow adopted a new scientism. A better explanation of this change is that computer scientists, as they have learned to embrace the kinds of knowledge produced by randomness, have reached a belated understanding of the value of — dare I say it? — post-structuralist ways of knowing.

It’s a shame it didn’t happen earlier.


  1. I first encountered the above formulation of this idea in the first video in Geoffrey Hinton’s online course on neural networks. But you can see it being used by other researchers (p. 21) working in machine learning on a regular basis. 
  2. In machine learning lingo, it could not learn nonlinear decision boundaries. It didn’t even have the ability to calculate the logical XOR operation on two inputs, which at the time probably made logic-based approaches look far more promising. 
  3. I say “appear” because it’s not entirely clear what it would mean to be unbound by rigid logic. The mathematical formulation of machine learning models is itself perfectly strict and internally consistent, and if it weren’t, it would be irreparably broken. Why don’t the statistical models represented by that formulation break in the same way? I suspect that it has something to do with the curse blessing of dimensionality. They don’t break because every time a contradiction appears, a new dimension appears to accommodate it in an ad-hoc way — at least until the model’s “capacity” for such adjustments is exhausted. I’m afraid I’m venturing a bit beyond my existential pay grade with these questions — but I hope this sliver of uncertainty doesn’t pierce to the core of my argument.