Brownian Noise and Plot Arcs

A plot of Brownian noise.
A tenth of a second of Brownian noise in PCM format. Large samples of Brownian noise give results similar to those reported by Jockers and Reagan et. al.

A couple of months ago, a research group released a paper on the ArXiv titled “The emotional arcs of stories are dominated by six basic shapes.” In it, they replicate results similar to those first described by Matt Jockers, using a somewhat different technique.

I’ve written a jupyter notebook that raises doubts about their argument. They claim that their work has shown that there are “basic shapes” that dominate human stories, but the results they’ve described provide no basis for such generalizations. Given what we know so far, it’s much more likely that the emotional arcs that these techniques reveal are, in general, noise. The notebook is available for examination and reuse as a github repository.

What does it mean to say that these plot shapes are “noise”? The notebook linked above focuses on technical issues, but I want to write a few brief words here about the broader implications my argument has — if it’s correct. Initially, it may seem that if sentiment data is “noise,” then these measurements must be entirely meaningless. And yet many researchers have now done at least some preliminary work validating these measurements against responses from human readers. Jockers’ most recent work shows fairly strong correlations between human sentiment assessments and those produced by his Syuzhet package. If these sentiment measurements are meaningless, does that mean that human assessments are meaningless as well?

That conclusion does not sit well with me, and I think it is based on an incorrect understanding of the relationship between noise and meaning. In fact, according to one point of view, the most “meaningful” data is precisely the most random data, since maximally random data is in some sense maximally dense with information — provided one can extract the information in a coherent way. Should we find that sentiment data from novels does indeed amount to “mere noise,” literary critics will have some very difficult questions to ask themselves about the conditions under which noise signifies.

The Radical Potential of RDF Dimension Reduction?

Note (2016-06-03): This revises an earlier post by removing some dated information and expanding the conclusion.

This brief post was inspired by the abstract for a talk by Hanna Wallach at the fourth IPAM Culture Analytics workshop. I didn’t even attend it, so take the first part of this with a grain of salt! In her talk, Wallach discussed a new Bayesian dimension-reduction technique that operates over timestamped triples. In other words — as she and her co-authors put it in the paper summarizing their research — “records of the form ‘country i took action a toward country j at time t’ — known as dyadic events.”

The abstract as well as the paper framed this as a way of analyzing international relations. But what struck me immediately about this model is that it works with data that could be represented very naturally as RDF triples (if you add in a timestamp, that is). That means that this method might be able to do for RDF triples what topic modeling does for texts.

This probably seems like an odd thing to care about to people who haven’t read Miriam Posner‘s keynote on the radical potential of DH together with Matthew Lincoln‘s response. In her keynote, Posner poses a question to practitioners of DH: why do we so willingly accept data models that rely on simplistic categories? She observes, for example, that the Getty’s Union List of Artist Names relies on a purely binary model of gender. But this is strangely regressive in the wider context of the humanities:

no self-respecting humanities scholar would ever get away with such a crude representation of gender… So why do we allow widely shared, important databases like ULAN to deal so naively with identity?”

She elaborates on this point using the example of context-dependent racial categories:

a useful data model for race would have to be time- and place-dependent, so that as a person moved from Brazil to the United States, she might move from white to black. Or perhaps the categories themselves would be time- and place-dependent, so that certain categories would edge into whiteness over time. Or! Perhaps you could contrast the racial makeup of a place as the Census understands it with the way it’s articulated by the people who live there.

Matt Lincoln’s brilliant response takes this idea and gives it a concrete computational structure: RDF. Rather than having fixed categories of race, we can represent multiple different conceptualizations of race within the same data structure. The records of these conceptualizations take the form of {Subject, Verb, Object} triples, which can then form a network:

A diagram of a network of perceived racial categories.

Given that Posner’s initial model included time as well, adding timestamps to these verbs seems natural, even if it’s not, strictly speaking, included in the RDF standard. (Or is it? I don’t know RDF that well!) But once we have actors, timestamped verbs, and objects, then I think we can probably use this new dimension reduction technique on networks of this kind.1

What would be the result? Think about what topic modeling does with language: it finds clusters of words that appear together in ways that seem coherent to human readers. But it does so in a way that is not predictable from the outset; it produces different clusters for different sets of texts, and those differences are what make it so valuable. They allow us to pick out the most salient concepts and discourses within a particular corpus, which might differ case-by-case. This technique appears to do the very same thing, but with relationships between groups of people over time. We might be able to capture local variations in models of identity within different communities.

I am not entirely certain that this would work, and I’d love to hear any feedback about difficulties that this approach might face! I also doubt I’ll get around to exploring the possibilities more thoroughly right now. But I would really like to see more humanists actively seeking out collaborators in statistics and computer science to work on projects like this. We have an opportunity in the next decade to actively influence research in those fields, which will have widespread influence in turn over political structures in the future. By abstaining from that kind of collaboration, we are reinforcing existing power structures. Let’s not pretend otherwise.


  1. With a few small adjustments. Since actors and objects are of the same kind in the model, the verbs would need to have a slightly different structure — possibly linking individuals through identity perceptions or acts of self-identification. 

Neoliberal Citations (and Singularities)

A chart showing the use of the word postmodernism over time.
Ever wonder where “postmodernism” went? I suspect “digital humanities” is headed there too. (Both Google Ngrams and COCA show the same pattern. COCA even lets you limit your search to academic prose!)

My first impulse upon reading last week’s essay in the LA Review of Books was to pay no attention. Nobody I know especially likes the name “digital humanities.” Many people are already adopting backlash-avoidant stances against it. “‘Digital Humanities’ means nothing” (Moretti); “I avoid the phrase when I can” (Underwood). As far as I can tell, the main advantage of “digital humanities” has been that it sounds better than “humanities computing.” Is it really worth defending? It’s an umbrella term, and umbrella terms are fairly easy to jettison when they become the targets of umbrella critiques.

Still, we don’t have a replacement for it. I hope we’ll find in a few years that we don’t need one. In the meanwhile we’re stuck in terminological limbo, which is all the more reason to walk away from debates like this. Daniel Allington, Sarah Brouillette, and David Golumbia (ABG hereafter) have not really written an essay about the digital humanities, because no single essay could ever be about something so broadly defined.

That’s what I told myself last week. But something about the piece has continued to nag at me. To figure out what it was, I did what any self-respecting neoliberal apologist would do: I created a dataset.

A number of responses to the essay have discussed its emphasis on scholars associated with the University of Virginia (Alan Jacobs), on its focus on English departments (Brian Greenspan), and on its strangely dismissive attitude towards collaboration and librarianship (Stewart Varner, Ted Underwood). Building on comments from Schuyler Esprit, Roopika Risam has drawn attention to a range of other occlusions — of scholars of color, scholars outside the US, and scholars at regional institutions — that obscure the very kinds of work that ABG want to see more of. To get an overview of these occlusions, I created a list of all the scholars ABG mention in the course of their essay, along with the scholars’ gender, field, last degree or first academic publication, year of degree or publication, and granting institution or current affiliation.1

The list isn’t long, and you might ask what the point is of creating such a dataset, given that we can all just — you know — read the article. But I found that in addition to supporting all the critiques described above, this list reveals another occlusion: ABG almost entirely ignore early-career humanists. With the welcome exception of Miriam Posner, they do not cite a single humanities scholar who received a PhD after 2004. Instead, they cite three scholars trained in the sciences:

Two, Erez Aiden and Jean-Baptiste Michel, are biostatisticians who were involved with the “culturomics” paper that — well, let’s just say it has some problems. The other, Michael Dalvean, is a political scientist who seems to claim that when used to train a regression algorithm, the subjective beliefs of anthology editors suddenly become objective facts about poetic value.2 Are these really the most representative examples of DH work by scholars entering the field?

I’m still not entirely certain what to make of this final occlusion. Given the polemical character of their essay, I’m not surprised that ABG emphasize scholars from prestigious universities, and that, given the position of those scholars, most of them wind up being white men. ABG offer a rationale for their exclusions:

Exceptions too easily function as alibis. “Look, not everyone committed to Digital Humanities is a white man.” “Look, there are Digital Humanities projects committed to politically engaged scholarly methods and questions.” We are not negating the value of these exceptions when we ask: What is the dominant current supported even by the invocation of these exceptions?

I disagree with their strategy, because I don’t think invoking exceptions inevitably supports an exclusionary mainstream. But I see the logic behind it.

When it comes to early-career scholars, I no longer see the logic. Among all the humanities scholars who received a PhD in the last ten years, I would expect there to be representatives of the dominant current. I would also expect them to feature prominently in an essay like this, since they would be likely to play important roles directing that current in the future. The fact that they are almost entirely absent casts some doubt on one of the essay’s central arguments. Where there are only exceptions, no dominant current exists.

I share with ABG the institutional concerns they discuss in their essay. I do not believe that all value can be reduced to monetary value, and I am not interested in the digital humanities because it increases ROI. Universities and colleges are changing in ways that worry me. I just think those changes have little to do with technology in particular — they are fundamentally social and political changes. Reading the essay with that in mind, the absence of early-career humanists looks like a symptom of a more global problem. Let’s provisionally accept the limited genealogy that ABG offer, despite all it leaves out. Should we then assume that the field’s future will follow a linear trajectory determined only by its past? That a field that used to create neoliberal tools will mechanically continue to do so in spite of all efforts to the contrary? That would be a terrible failure of imagination.

In 1993, Vernor Vinge wrote a short paper called “The Coming Technological Singularity,” which argued that the rate of technological change will eventually outstrip our ability to predict or control that change. In it, he offers a quotation from Stanislaw Ulam’s “Tribute to John von Neumann” in which Ulam describes a conversation with von Neuman that

centered on the ever accelerating progress of technology and changes in the mode of human life, which gives the appearance of approaching some essential singularity in the history of the race beyond which human affairs, as we know them, could not continue.

Over the last decade, many optimistic technologists have described this concept with an evangelistic fervor that the staunchest Marxist revolutionary could admire. And yet this transformation is always framed as a technological transformation, rather than a social or political one. The governing fantasy of the singularity is the fantasy of an apolitical revolution, a transformation of society that requires no social intervention at all.3 In this fantasy, the realms of technology and politics are separate, and remain so even after all recognizable traces of the preceding social order have been erased. “Neoliberal Tools (and Archives)” seems to work with the same fantasy, transformed into a nightmare. In both versions, to embrace technology is to leave conscious political engagement behind.

But the singularity only looks like a singularity to technologists because they are used to being able to predict the behavior of technology. From the perspective of this humanist, things look very different: the singularity already happened, and we call it human society. When in the last five-hundred years has it ever been possible for human affairs, as known at a given moment, to continue? What have the last five centuries of human history been if not constant, turbulent, unpredictable change? If a technological singularity arrives, it will arrive because our technological lives will have become as complex and unpredictable as our social and political lives already are. If that time comes, the work of technologists and humanists will be the same.

That might be an unsettling prospect. But we can’t resist it by refusing to build tools, or assuming that the politics of tool-building are predetermined from the outset. Instead, we should embrace building and using tools as inherently complex, unpredictable social and political acts. If they aren’t already, they will be soon.


  1. Please let me know if you see any errors in this data. 
  2. This would be like claiming that when carefully measured, an electorate’s subjective beliefs about a candidate become objective truths about that candidate. To be fair, I’m sure Dalvean would recognize the error when framed that way. I actually like the paper otherwise! 
  3. The genius of The 100 (yes, that show on CW) is that it breaks through this mindset to show what a technological singularity looks like when you add the politics of human societies back in. 

To Conquer All Mysteries by Rule and Line

I was very excited last week to read a preprint from Tal Yarkoni and Jacob Westfall on using predictive modeling to validate results in psychology. Their main target is a practice that they refer to — at first — as p-hacking. Some people might be more familiar with the practice under a different name, data dredging. In short, to p-hack is to manipulate data after seeing at least some portion of the results, with the aim (conscious or unconscious) of inflating significance scores.1

In their paper, Yarkoni and Westfall argue that the methodological apparatus of machine learning provides a solution to the problem that allows p-hacking. In machine learning lingo, the name of that problem is overfitting. I find the argument very persuasive, in part based on my own experience. After a brief description of the paper, I’d like to share a negative result from my research that perfectly illustrates the overlap they describe between overfitting by machines and overfitting by humans.

Paranoid Android

An overfit dataset.
The green line tries too hard to fit the data perfectly; the black line makes more errors, but comes closer to the truth. Via Wikimedia Commons.

As an alternative to p-hacking, Yarkoni and Westfall offer a new term: procedural overfitting. This term is useful because it draws attention to the symmetry between the research practices of humans and the learning processes of machines. When a powerful machine learning algorithm trains on noisy data, it may assign too much significance to the noise. As a result, it invents a hypothesis that is far more complicated than the data really justifies. The hypothesis appears to explain that first set of data perfectly, but when tested against new data, it falters.

After laying out the above ideas in more detail, Yarkoni and Westfall make this claim: when researchers p-hack, they do exactly the same thing. They take noisy data and misappropriate the noise, building a theory more complex or nuanced than the evidence really justifies.

If that claim is true, then the tools machine learning researchers have developed to deal with overfitting can be reused in other fields. Some of those fields might have nothing to do with predicting categories based on features; they may be more concerned with explaining physical or biological phenomena, or with interpreting texts. But insofar as they are hampered by procedural overfitting, researchers in those fields can benefit from predictive methods, even if they throw out the predictions afterwards.

Others have articulated similar ideas before, framed in narrower ways. But the paper’s illustration of the cross-disciplinary potential of these ideas is quite wonderful, and it explains the fundamental concepts from machine learning lucidly, without requiring too much prior knowledge of any of the underlying algorithms.

A Cautionary Tale

This is all especially relevant to me because I was recently both a perpetrator and a victim of inadvertent procedural overfitting. Fortunately, using the exact techniques Yarkoni and Westfall talk about, I caught the error before reporting it triumphantly as a positive result. I’m sharing this now because I think it might be useful as a concrete example of procedural overfitting, and as a demonstration that it can indeed happen even if you think you are being careful.

At the beginning of the year, I started tinkering with Ted Underwood and Jordan Sellers’ pace-of-change dataset, which contains word frequency counts for 720 volumes of nineteenth-century poetry. Half of them were sampled from a pool of books that were taken seriously enough to receive reviews — whether positive or negative — from influential periodicals. The other half were sampled from a much larger pool of works from HathiTrust. Underwood and Sellers found that those word frequency counts provide enough evidence to predict, with almost 80% accuracy, whether or not a given volume was in the reviewed subset. They used a Logistic Regression algorithm that incorporated regularization methods similar to the one Yarkoni and Westfall describe in their paper. You can read more about the corpus and the associated project on Ted Underwood’s blog.

Inspired by Andrew Goldstone’s replication of their model, I started playing with the model’s regularization parameters. Underwood and Sellers had used an L2 regularization penalty.2 In the briefest possible terms, this penalty measures the model’s distance from zero, where distance is defined in a space of possible models, and each dimension of the space corresponds to a feature used by the model. Models that are further from zero on a particular dimension put more predictive weight on the corresponding feature. The larger the model’s total distance from zero, the higher the regularization penalty.

Goldstone observed that there might be a good reason to use a different penalty, the L1 penalty.3 This measures the model’s distance from zero too, but it does so using a different concept of distance. Whereas the L2 distance is plain old euclidean distance, the L1 distance is a simple sum over the distances for each dimension.4 What’s nice about L1 regularization is that it produces sparse models. That simply means that the model learns to ignore many features, focusing only on the most useful ones. Goldstone’s sparse model of the pace-of-change corpus does indeed learn to throw out many of the word frequency counts, focusing on a subset that does a pretty good job at predicting whether a volume was reviewed. However, it’s not quite as accurate as the model based on L2 regularization.

I wondered if it would be possible to improve on that result.5 A model that pays attention to fewer features is easier to interpret, but if it’s less accurate, we might still prefer to pay attention to the weights from the more accurate model. Additionally, it seemed to me that if we want to look at the weights produced by the model to make interpretations, we should also look at the weights produced by the model at different regularization settings. The regularization penalty can be turned up or down; as you turn it up, the overall distance of the model from zero goes down. What happens to the individual word weights as you do so?

It turns out that for many of the word weights, the result is uninteresting. They just go down. As the L2 regularization goes up, they always go down, and as the L1 regularization goes up, they always go down. But a few words do something different occasionally: as the L1 regularization goes up, they go up too. This is surprising at first, because we’re penalizing higher values. When those weights go up, they are pushing the model further away from zero, not closer to zero, as expected. This is a bit like watching a movie, turning down the volume, and finding that some of the voices get louder instead of softer.

For these steps away from zero to be worthwhile, the model must be taking even larger steps towards zero for some other set of words. That suggests that these words might have especially high explanatory power (at least for the given dataset). And when you collect a set of words like this together, they seem not to correspond to the specific words picked out by any single L1-regularized model. So what happens if we train a model using just those words? I settled on an automated way to select words like this, and I whittled it down to a 400-word list. Then I ran a new model using just those words as features. And after some tinkering, I found that the model was able to successfully classify 97.5% of the volumes:

We have 8 volumes missing in metadata, and
0 volumes missing in the directory.

We have 360 positive, and
360 negative instances.
Beginning multiprocessing.
0
[...]
700
Multiprocessing concluded.

If we divide the dataset with a horizontal line at 0.5, accuracy is: 0.975
Divided with a line fit to the data trend, it's 0.976388888889

I was astonished. This test was based on the original code from Underwood and Sellers, which does a very rigorous version of k-fold cross-validation. For every prediction it makes, it trains a new model, holding out the single data point it’s trying to predict, as well as data points corresponding to other works in the corpus by the same author. That seems rock-solid, right? So this can’t possibly be overfitting the data, I thought to myself.

Then, a couple of hours later, I remembered this:

This was part of a conversation about a controversial claim about epigenetic markers associated with sexual orientation in men. It was receiving criticism for using faulty methods: they had modified their model based on information from their test set. And that means it’s no longer a test set.

I realized I had just done the same thing.

Fortunately for me, I had done it in a slightly different way: I had used an automated feature selection process, which meant I could go back and test the process in a way that did not violate that rule. So I wrote a script that followed the same steps, but used only the training data to select features. I ran that script using the same rigorous cross-validation strategy. And the amazing performance went away:

If we divide the dataset with a horizontal line at 0.5, accuracy is: 0.7916666666666666
Divided with a line fit to the data trend, it's 0.798611111111

This is a scanty improvement on the standard model from Underwood and Sellers — so scanty that it could mean nothing. That incredible performance boost was only possible because the feature selector could see all the answers in advance.

I still think this result is a little bit interesting because it uses a dramatically reduced feature set for prediction — it throws out roughly 90% of the available word frequency data, retaining a similar (but slightly different) subset for each prediction it makes. That should give us some more confidence that we aren’t throwing out very much important information by focusing on reduced feature sets in our interpretations. Furthermore, the reduced sets this approach produces aren’t the same as the reduced sets you’d get by simply taking the highest-weighted features from any one run. This approach is probably providing some kind of distinctive insight into the data. But there’s nothing earth-shaking about it — it’s just a run-of-the-mill iterative improvement.6

My Intentions Are Good, I Use My Intuition

In this case, my error wasn’t too hard to catch. To find that first, magical feature set, I had used an automated process, which meant I could test it in an automated way — I could tell the computer to start from scratch. Computers are good at unlearning things that way. But humans aren’t.

Suppose that instead of using an automated process to pick out those features, I had used intuition. Intuition can overfit too — that’s the essence of the concept of procedural overfitting. So I would have needed to cross-validate it — but I wouldn’t have been able to replicate the intuitive process exactly. I don’t know how my intuition works, so I couldn’t have guaranteed that I was repeating the exact same steps. And what’s worse, I can’t unsee results the way a machine can. The patterns I had noticed in those first results would have unavoidably biased my perception of future results. To rigorously test my intuitive feature selections, I would have needed to get completely new data.

I’ve had conversations with people who are skeptical of the need to maintain strict separation between training, cross-validation, and test data. It’s an especially onerous restriction when you have a small dataset; what do you do when you have rigorously cross-validated your approach, and then find that it still does poorly on the test set? If you respect the distinction between cross-validation and test data, then you can’t change your model based on the results from the test without demoting the test data. If you use that data for tuning, it’s just not test data anymore — it’s a second batch of cross-validation data.

Skeptics may insist that this is overly restrictive. But my experiences with this model make me certain that it’s vital if you aren’t tuning your model in a precisely specified way. Even now, I feel a little uneasy about this feature selection method. It’s automated, and it performs as well as the full model in the tests I’ve run, but I developed it after looking at all the data. There was still a small flash of intuition that led me to notice that some words were getting louder instead of softer as I turned down the volume. And it remains possible that future data won’t behave the same way! In an ideal world, I would have held out a final 20% for testing, just to be safe. But at least in this case, I have the comfort of knowing that even without having done that, I was able to find and correct a significant overfitting problem, because I used an automated feature selection method.


  1. Discussions of this practice pop up occasionally in debates between Bayesians and advocates of classical significance testing. Bayesians sometimes argue that p-values can be manipulated more easily (or at least with less transparency) than tests based on Bayesian techniques. As someone who had long been suspicious of certain kinds of scientific results, I found my preconceptions flattered by that argument when I first heard it, and I became very excited about Bayesian methods for a while. But that excitement wore off, for reasons I could not have articulated before. 
  2. A lot of people call this ridge regression, but I always think — ridge of what? L2 refers to something specific, a measure of distance. 
  3. Also known as the LASSO. But again — huh? 
  4. This one has a nickname I understand: the manhattan distance. Like most city blocks in Manhattan, it disallows diagonal shortcuts. 
  5. After writing this, I realized that Ben Schmidt has also been thinking about similar questions
  6. I’m still getting the code that does this into shape, and once I have managed that, I will write more about the small positive result buried in this large negative result. But adventurous tinkerers can find the code, such as it is, on github, under the “feature/reg-and-penalty” branch of my fork of the original paceofchange repository. Caveat lector! It’s terribly documented, and there are several hard-coded parameters that require tweaking. The final result will have at least a primitive argparse-based console interface. 

The Volume and Surface Area of Computer Programs

A cube.

I’ve been refactoring some software. It’s not a fun task, exactly, but there’s something strangely satisfying about it — it’s a bit like folding socks. In the past, I have found the concepts of “coupling” (bad) and “cohesion” (good) useful for this kind of work. But the definitions of those concepts have always seemed a bit vague to me. The Wikipedia page linked above offers the following definition of cohesion: “the degree to which the elements of a module belong together.” The definition of coupling isn’t much better: “a measure of how closely connected two routines or modules are.”

What does that mean? Belong together in what sense? Closely connected how? Aren’t those essentially synonyms?

If I were in a less lazy mood, I’d look up the sources cited for those definitions, and find other, more reliable (?) sources than Wikipedia. But I’ve done that before, and couldn’t find anything better. And as I’ve gained experience, I’ve developed a sense that, yes, some code is tightly coupled — and just awful to maintain — and some code is cohesive without being tightly coupled — and a joy to work with. I still can’t give good definitions of coupling or cohesion. I just know them when I see them. Actually, I’m not even sure I always know them when I see them. So I’ve spent a lot of time trying to figure out a more precise way to describe these two concepts. And recently, I’ve been thinking about an analogy that might help explain the difference — it links the relationship between cohesion and coupling to the relationship between volume and surface area.

The analogy begins with the idea of interfaces. Programmers often spend long hours thinking about how to define precise channels of communication between pieces of software. And many programming styles — object-oriented development, for example — emphasize the distinction between private and public components of a computer program. Programs that don’t respect that distinction are often troublesome because there are many different ways to modify the behavior of those programs. If there are too many ways, and if all those ways are used, then it becomes much more difficult to predict the final behavior that will result.

The net of a cube.
The net (unfolded surface) of a cube.

Suppose we think of interfaces and public variables as the surface of a program, and think of private variables and methods as being part of the program’s interior — as contributing to its volume. In a very complex program, enforcing this distinction between public and private becomes like minimizing the surface area of the program. As the program gets more and more complex, it becomes more and more important to hide that complexity behind a much simpler interface, and if the interface is to remain simple, its complexity must increase more slowly than the complexity of the overall program.

What if the relationship between these two rates — the rate of increase in complexity of the interface and the rate of increase in complexity of the overall program — is governed by a law similar to the law governing the relationship between surface area and volume? Consider a cube. As the cube grows in size, its volume grows faster than its surface area. To be precise (and forgive me if this seems too obvious to be worth stating) its volume is x^3, while its surface area is 6 \times x^2 for a given edge-length x.

Biologists have used this pattern to explain why the cells of organisms rarely grow beyond a certain size. As they grow larger, the nutrient requirements of the interior of the cell increase more quickly than the surface’s capacity to transmit nutrients; if the cell keeps growing, its interior will eventually starve. The cell can avoid that fate for a while by changing its shape to expand its surface area. But the cell can only do that for so long, because the expanded surface area costs more and more to maintain. Eventually, it must divide into two smaller cells or die.

Might this not also give us a model that explains why it’s difficult to develop large, complex programs without splitting them into smaller parts? On one hand, we have pressure to minimize interface complexity; on the other, we have pressure to transmit information into the program more efficiently. As the program grows, the amount of information it needs increases, but if the number of information inlets increases proportionally, then soon, it becomes too complex to understand or maintain. For a while, we can increase the complexity of the program while keeping the interface simple enough just by being clever. But eventually, the program’s need for external information overwhelms even the most clever design. The program has to be divided into smaller parts.

So what do “coupling” and “cohesion” correspond to in this model? I’m not sure exactly; the terms might not be defined clearly enough to have precise analogs. But I think coupling is closely related to — returning to the cell analogy now — the nutrient demand of the interior of the cell. If that demand goes unchecked, the cell will keep expanding its surface area. It will wrinkle its outer membrane into ever more complex and convoluted shapes, attempting to expose more of its interior to external nutrient sources. At this point in the analogy, the underlying concept becomes visible; here, the cell is tightly coupled to its environment. After this coupling exceeds some threshold, the cell’s outer membrane becomes too large and complex to maintain.A drop of water.

In turn, cohesion is closely related to the contrary impulse — the impulse to reduce surface area.

Imagine a drop of water on a smooth surface. It rests lightly on the surface seeming almost to lift itself up. Why? It turns out that surfaces take more energy to maintain than volumes; they cost more. Molecules in the interior of the drop can take configurations that molecules on the surface can’t. Some of those configurations have lower energy than any possible configuration on the surface, so molecules on the surface will tend to “fall” into the interior. The drop will try to minimize its surface area, in much the way a marble in a bowl will roll to the bottom. And the shape with the lowest surface-area-to-volume ratio is a sphere.

We have different words for this depending on context. This phenomenon is the same phenomenon we sometimes name “surface tension.” Water striders can glide across the surface of a pond because the water wants to minimize its surface area. The water does not adhere to their limbs; it coheres; it remains decoupled. These ways of thinking about coherence and coupling make the concepts seem a bit less mysterious to me.

The Size of the Signifier

What is a feature?

It’s worth thinking about the way machine learning researchers use the word “feature”. They speak of “feature selection,” “feature engineering,” and even “automated feature learning.” These processes generally produce a structured body of input — a set of “feature vectors” that machine learning algorithms use to make predictions. But the definition of “feature” is remarkably loose. It amounts to something like “a value we can measure.”

Convolutional Neural Network feature visualization.
Three levels of features learned by a convolutional neural network. From Lee, Grosse, Raganath, Ng, “Convolutional Deep Belief Networks”

People unfamiliar with machine learning might imagine that in the context of (say) face recognition, a “feature” would be the computational equivalent of “high cheekbones” or “curly hair.” And they wouldn’t be far off the mark to think so, in a way — but they might be surprised to learn that often, the features used by image recognition software are nothing more than the raw pixels of an image, in no particular order. It’s hard to imagine recognizing something in an image based on pixels alone. But for many machine learning algorithms, the measurements that count as features can be so small as to seem almost insignificant.

This is possible because researchers have found ways to train computers to assemble higher-level features from lower-level ones. And we do the same thing ourselves, in our own way — we just don’t have conscious access to the raw data from our retinas. The features that we recognize as recognizable are higher-level features. But those had to be assembled too, and that’s part of what our brain does.1 So features at one level are composites of smaller features. Those composite features might be composed to form even larger features, and so on.

When do things stop being features?

One answer is that they stop when we stop caring about their feature-ness, and start caring about what they mean. For example, let’s change our domain of application and talk about word counts. Word counts are vaguely interesting to humanists, but they aren’t very interesting until we start to build arguments with them. And even then, they’re just features. For example, suppose we have an authorship attribution program. We feed passages labeled with their authors into the program; it counts the words in the passages and does some calculations. Now the program can examine texts it hasn’t seen before and guess the author. In this context, that looks like an interpretive move: the word counts are just features, but guess the program makes is an interpretation.

We can imagine another context in which the author of a text is a feature of that text. In such a context, we would be interested in some other property of the text: a higher-level property that depends somehow on the identity of the text’s author. But that property would not itself be a feature — unless we keep moving up the feature hierarchy. This is an anthropocentric definition of the word “feature”: it depends on what humans do, and so gives up some generality. But for that same reason, it’s useful: it shows that we determine by our actions what counts as a feature. If we try to force ourselves into a non-anthropocentric perspective, it might start to look like everything is a feature, which would render the word altogether useless.

I think this line of reasoning is useful for thinking through this moment in the humanities. What does it mean for the humanities to be “digital,” and when will the distinction fade? I would guess that it will have something to do with coming shifts in the things that humanists consider to be features.

In my examples above, I described a hierarchy of features, and without saying so directly, suggested that we should want to move up that hierarchy. But I don’t actually think we should always want that — quite the opposite. It may become necessary to move back down a level or two at times, and I think this is one of those times. Things that used to be just features are starting to feel more like interpretations again. This is how I’m inclined to think of the idea of “Surface Reading” as articulated by Stephen Best and Sharon Marcus a few years ago. The metaphor of surface and depth is useful; pixels are surface features, and literary pixels have become interesting again.

Why should that be so? When a learning algorithm isn’t able to answer a given question, sometimes it makes sense to keep using the same algorithm, but return to the data to look for more useful features. I suspect that this approach is as useful to humans as to computers; in fact, I think many literary scholars have adopted it in the last few years. And I don’t think it’s the first time this has happened. Consider this observation that Best and Marcus make about surface reading in that article:

This valorization of surface reading as willed, sustained proximity to the text recalls the aims of New Criticism, which insisted that the key to understanding a text’s meaning lay within the text itself, particularly in its formal properties.

From a twenty-first century perspective, it’s sometimes tempting to see the New Critics as conservative or even reactionary — as rejecting social and political questions as unsuited to the discipline. But if the analogy I’m drawing is sound, then there’s a case to be made that the New Critics were laying the ground necessary to ask and answer those very questions.2

In his recent discussion of the new sociologies of literature, Ted Underwood expresses concern that “if social questions can only be addressed after you solve all the linguistic ones, you never get to any social questions.” I agree with Underwood’s broader point — that machine learning techniques are allowing us to ask and answer social questions more effectively. But I would flip the argument on its head. The way we were answering linguistic questions before was no longer helping us answer social questions of interest. By reexamining the things we consider features — by reframing the linguistic surfaces we study — we are enabling new social questions to be asked and answered.


  1. Although neural networks can recognize images from unordered pixels, it does help to impose some order on them. One way to do that is to use a convolutional neural network. There is evidence that the human visual cortex, too, has a convolutional structure
  2. In the early twentieth century, the relationships between physics, chemistry, biology, and psychology were under investigation. The question at hand was whether such distinctions were indeed meaningful — were chemists and biologists effectively “physicists in disguise”? A group of philosophers sometimes dubbed the “British Emergentists” developed a set of arguments defending the distinction, arguing that some kinds of properties are emergent, and cannot be meaningfully investigated at the atomic or chemical level. It seems to me that linguistics, literary study, and sociology have a similar relationship; linguists are the physicists, literary scholars the chemists, and sociologists the biologists. We all study language as a social object, differing more in the levels at which we divide things into features and interpretations — morpheme, text, or field. And in this schema, I think the emergent relationships do not “stack” on top of psychology and go up from there. That would suggest that sociology is more distant from psychology than linguistics. But I don’t think that’s true! All three fields seem to me to depend on psychology in similar ways. The emergent relationships between them are orthogonal to emergent relationships in the sciences. 

On Thomas Bayes

This post is part of a larger piece on Thomas Bayes, and two other nonconformist ministers, Richard Price and William Godwin (!). It started life as a footnote and grew, so I gave it a new home, with a bit more room to stretch its legs. The first part is a brief introduction to Bayesian probability, and the second part talks about Bayes himself and (very briefly) about Price, Bayes’ literary executor.

Bayesian Probability

A lot of people have been talking about Bayesian probability over the last decade or so. I don’t think this is a new thing, but it seems to have taken on a new kind of urgency, perhaps because a lot of important ideas in machine learning and text mining are Bayesian, or at least Bayesianish. If you’re not familiar with Bayesianism, it boils down to the assertion that probability measures our ignorance about the world, rather than the world’s own irreducible uncertainty. This is sometimes called a “subjectivist” way of thinking about probability, although I’ve never found that an especially useful term. What does it mean to say that ignorance is subjective?

I like the way Thomas Bayes himself put it, at least once it’s rephrased into straightforward language. It has to do with expectation values — say, the expected weight of a randomly chosen apple. According to an ordinary way of thinking about probability, if we wanted to find out the weight of “the average apple,” we’d start with a probability distribution over apple weights, and then calculate the weighted mean. Let’s reduce the size of the problem and say we’re bobbing for apples. There are five eight-ounce apples and three four-ounce apples in the barrel; that’s a total of 5 \times 8 + 3 \times 4 = 52 ounces, distributed over eight apples, for an expected value of six-and-a-half ounces. Assuming you have no particular apple-bobbing skills, but that you don’t stop until you get an apple, you’d probably get a good one five out of eight times. That means the expected value of one of the good ones is \frac{5}{8} \times 8 = 5, and the expected value of one of the bad ones is \frac{3}{8} \times 4 = 1.5. This sums to 6.5, the weight of “the average apple” in the barrel.

Bayes turned that on its head. He suggested that we start not with probabilities, but with the expectations themselves:

The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed, and the value of the thing expected upon it’s [sic] happening.

In other words, Bayes defined probability by starting with expected values, and calculated the resulting probabilities based on them. If you are used to being offered probabilities as givens, this may look like a strange way of doing things. But mathematically, it’s no different from the more familiar approach. To find out the probability of getting a good apple, we just take the expected value of a good apple — the proportion of apple we can expect, in the long run, to come from good apples if we keep on bobbing1 — and divide it by the actual value of that good apple we just got: \frac{5}{8}. The result is the same as the probability we started with above.

A skeptic might now ask “but how do we know what the expected value is if we don’t already know the probability?” This is a great question. To answer it, I’m going to use a different example: a fiery plane crash.

We are told all the time that the probability of dying in a fiery plane crash is very low. “You’re more likely to be struck by lightning!” people say. Fair enough. But do we really know that? Isn’t that just hearsay? Somebody told somebody else told somebody else some numbers, and they did some math and told somebody else, and so on. I didn’t see it with my own eyes; why should I trust those people? When I step into an airplane cabin and momentarily imagine myself and all the people around me screaming in terror as we plunge towards our doom, you’ll forgive me if I have a moment of doubt.

For Bayes, the question is this: how much doubt? And the answer, in my case, is “not much.” I’ve ridden on planes many times now and nothing has gone wrong. I’m still alive. Everyone I know has done the same — I’ve never known someone who has died in a plane crash. I’ve never even known someone who knows someone who has died in a plane crash, which is probably why I’m being so insensitive to plane crash victims and their close relations right now. (Apologies!) I am strongly plane-crash averse, but I keep riding planes, and I don’t worry about it very much. This tells my inner Bayesian that plane crashes aren’t very probable. In fact,

\begin{aligned}probability\ of\ a\\plane\ crash\end{aligned} = \frac{\begin{aligned}how\ worried\ I\\am\ about\ dying\ in\\a\ fiery\ plane\ crash\end{aligned}}{\begin{aligned}how\ unpleasant\ it\ would\\be\ to\ actually\ die\ that\ way\end{aligned}}

According to this point of view, we can tell plane crashes aren’t very probable because I’m not worried about dying in one, even though it would be a truly awful way to die.

This way of thinking might seem bonkers, epistemologically speaking. But it’s mathematically identical to the other way of thinking about probability. And to tell the truth, I’ve made it sound more bonkers than it is by suggesting that my level of worry directly influences the probability of a plane crash.

The reason it sounds that way is that I left out a word: “ought.” If you look back at Bayes’ definition, you’ll notice that he didn’t talk about the expectation that we speculate is correct — he talked about the expectation that “ought to be computed.” That is a wonderful sleight of hand, isn’t it? Look what happens if I encounter someone who is extremely worried about dying in a plane crash. Armed with Bayes’ definition, I simply say “you’re more worried than you ought to be.”

\begin{aligned}probability\ of\ a\\plane\ crash\end{aligned} = \frac{\begin{aligned}how\ worried\ \\\boldsymbol{you\ ought\ to\ be}\ about\\ dying\ in\ a\ fiery\ plane\ crash\end{aligned}}{\begin{aligned}how\ unpleasant\ it\ would\\be\ to\ actually\ die\ that\ way\end{aligned}}

How do we know how worried we ought to be about dying in a plane crash? One way might be to add up all the pain and suffering caused by plane crashes, and divide it by the number of times everyone on earth has ever ridden a plane. Then we could compare that to the pain and suffering caused by lightning strikes, automobile accidents, the scratching of one’s finger, and so on. This again requires us to believe in some hearsay, but now it’s not so abstract. We’re talking about more than numbers now.

On the other hand, it requires us to quantify some very hard-to-quantify things, like human suffering. Maybe that’s another reason why people call this a “subjectivist” interpretation of probability. But a value isn’t subjective just because it’s hard to quantify, and an interpretation of probability isn’t epistemologically faulty just because it’s hard to act on. As someone who believes in the reality of human suffering, and who is interested in reducing it, I find this way of thinking about probability quite appealing, because it tells me something specific about the stakes.

Note also that the haziness of this “subjective” factor divides out of the equation. It ought to affect both the top of the fraction — our worry — and the bottom of the fraction — our actual suffering — with equal proportion. So in other words, if you know with great certainty that you won’t suffer in a crash because you will just pass out the instant the turbulence gets bad, then you ought to be proportionally less worried about a crash. (You might then be more worried about other aspects of flying though!)

In a short post like this, it’s hard to cover all the disagreements that arise over Bayesian interpretations of probability, and I have used a different starting point than most introductions. They generally start with Bayes rule,

p(C|E) = \frac{p(E|C) * p(C)}{p(E)}

which I left until now, because it tells you less about Bayesian probability than many people seem to think. This formula just tells us how to combine prior beliefs about a claim (p(C)) with new evidence (E and its probabilities p(E) and p(E|C)) to get an updated belief (p(C|E)). It’s just as useful to non-Bayesians as Bayesians — as long as you have a prior belief. The disagreement between Bayesians and non-Bayesians is over what makes a good prior belief. Thinking back to the plane crash example, this is the belief you have before you ever step on a plane. And here, Bayes’ “ought” comes in again; Bayesians are generally committed to the point of view that we can choose a correct (or good enough) prior belief without direct evidence, and let repeated applications of the above rule take care of the rest. You have to have a prior belief that allows you to step onto the plane at all. But once you’ve taken that first step, you can adjust your level of worry about clalefactostratospheric mortality in a way that’s governed by the evidence you’ve collected.

But we may not ever agree about the choice of prior — it’s “subjective,” at least if you think that anything we can’t agree on is subjective. This means that if you are really, really, really worried about dying in a plane crash, there’s nothing I can do to persuade you to change your mind.

But then — that’s the simple truth. There isn’t anything I can do to change your mind. Non-Bayesians want a more principled way of choosing a prior distribution, one that isn’t “subjective.” They want to be able to change your mind about plane crashes by reasoning with you. But since there’s disagreement about what principles to use, and there probably always will be, I’m not sure anything is gained by that approach. At least the Bayesian framework formalizes the disagreement, showing us what we can and what we can’t agree about, and why.

So that’s Bayesian probability in brief — probably too brief, but so be it.

Thomas Bayes and Richard Price

Thomas Bayes himself has a strange position in history. On Wikipedia, the list of things named after Bayes is fully half as long as the list of things named after Einstein, and almost as long as the list of things named after Newton. I’m not quite sure what that measures, but it must not be productivity: Bayes published just two works in his lifetime, and two short papers of his were published posthumously. One of those, “An Essay towards solving a Problem in the Doctrine of Chances,” is the source of all his fame, and of the quotation about expectations and oughts above. It contained a proof of the above formula for conditional probability, and used it to give a mathematically precise way to reason about the conclusions that a given body of evidence can support. His other works have been almost entirely forgotten.

That might be because Bayes was not really a mathematician, so much as a minister with a challenging hobby. One of the two works published while he was alive was a learned treatise on God’s Benevolence — not a genre that has much currency today. There has been some speculation that the second work, an explication and defense of Newton’s formulation of calculus, was well-received, and may have led to his election to the Royal Society2, but I haven’t found much primary reception evidence.

Even that work appears to have been motivated to some degree by Bayes’ religious views. The text is nominally a defense of Newton’s formulation of calculus against George Berkeley’s criticisms in The Analyst. But remarkably (at least to me), the book begins with a discussion of theology. In his preface, Bayes attempts to disentangle questions about mathematics and questions about religion. Berkeley, he says, had inappropriately linked the two:

Had it been his only design to bring this point to a fair issue, whether a demonstration by the method of Fluxions be truly scientific or not, I should have heartily applauded his conduct… but the invidious light in which he has put this debate, by representing it as of consequence to the interests of religion, is, I think, truly unjustifiable, as well as highly imprudent.

In Bayes’ account, Berkeley argued that because calculus was based on mysterious infinitesimal values, belief in the truth of calculus is no different from religious belief. This account of Berkeley’s argument strikes me as oversimplified, or at least a bit glib. Many of Berkeley’s basic mathematical objections were substantive, and would not be fully addressed by mathematicians until more than a century later. Bayes largely ignored these objections. But the broader argument that he made in the preface — that the mathematical questions at hand have no bearing on religious questions at all — has a modern ring. It reminds me a bit of Stephen Jay Gould’s famous characterization of science and religion as “non-overlapping magisteria.”

Thirty years after he published his work on calculus, Bayes’ most famous work appeared in the Transactions of the Royal Society, with a preface by Richard Price, Bayes’ literary executor, fellow minister, and long-time friend. Price’s preface to the essay made the remarkable claim that Bayes’ proof could be the basis of a probabilistic argument for the existence of a creator. The goal of such an argument would be this:

to shew what reason we have for believing that there are in the constitution of things fixt laws according to which things happen, and that, therefore, the frame of the world must be the effect of the wisdom and power of an intelligent cause; and thus to confirm the argument taken from final causes for the existence of the Deity.

Bayes had disentangled mathematical and theological questions; Price re-entangled them by arguing from intelligent design to the existence of an intelligent creator.

Was Price speaking for Bayes? It’s difficult to believe that Bayes would have chosen a literary executor who he knew would distort or misrepresent his argument. But if Bayes really did change his mind this way, it would be a bit like Gould dropping out of the evolutionary theory business, and setting up shop as a creationist. It would not have been quite as extreme, since Bayes was already a minister. But it nonetheless would represent a significant philosophical shift.

Three possible ways of explaining this tension occur to me. The first is that Bayes’ views did indeed shift, and that he came to the conclusion that mathematical questions really can have theological relevance. The second is that Price is simply using the publication of Bayes’ paper as an opportunity to push his own philosophical agenda. And the third is that Bayes’ argument in the preface to his defense of Newton’s calculus was not a global epistemological claim, as Gould’s was, but a local intervention without bearing on other possible links between mathematical and theological debates.

That third possibility intrigues me the most. I have no particular interest in developing an argument for the existence of a creator deity. But suppose Bayes did feel that while questions about the foundation of calculus had no relevance to theology, his work on probability did. That suggests to me that even when it was first proposed, Bayesian probability had a kind of philosophical oomph that mathematical theorems don’t always have.

Price himself would go on to write a number of influential works, including one on the statistics of annuities and life insurance. Apparently Edmund Burke thought this was an unseemly topic for a member of the clergy, and gave Price a nickname: “the calculating divine.” Some of the actuarial tables Price created, the so-called Northampton tables, were in wide use for almost a century. At the same time, Price was an influential moral philosopher; his Review of the Principal Questions and Difficulties in Morals probably influenced William Godwin’s Enquiry Concerning Political Justice. This leaves us with a final puzzle: one might expect an actuary and statistician to be an empiricist — a skeptic like Hume for whom no evidence is fully persuasive. But Price argued against the mainstay of empiricist moral philosophy, moral sense theory. Like Godwin, Price was a moral rationalist.


  1. This requires you to replace the apples, so bite gingerly. Also, as a non-apple-bobber, I have no idea how the size of an apple affects its… um… bobbability. I take it for granted here that large and small apples are equally bobbable. 
  2. See Bellhouse, “The Reverend Thomas Bayes, FRS: A Biography to Celebrate the Tercentenary of His Birth,” Statistical Science 2004, 19:1, 3–43. The work is attributed to James Hodgson in one Google Books copy, but I know of no reason to doubt the usual attribution represented here. However, if you want to actually read it, I recommend the misattributed copy; the scan quality is much higher. 

Old English Phonotactic Constraints?

Something interesting happens when you train a neural network to predict the next character given a text string. Or at least I think it might be interesting. Whether it’s actually interesting depends on whether the following lines of text obey the phonotactic constraints of Old English:

amancour of whad sorn on thabenval ty are orid ingcowes puth lee sonlilte te ther ars iufud it ead irco side mureh

It’s gibberish, mind you — totally meaningless babble that the network produces before it has learned to produce proper English words. Later in the training process, the network tends to produce words that are not actually English words, but that look like English words — “foppion” or “ondish.” Or phrases like this:

so of memmed the coutled

That looks roughly like modern English, even though it isn’t. But the earlier lines are clearly (to me) not even pseudo-English. Could they be pseudo-Old-English (the absence of thorns and eths notwithstanding)? Unfortunately I don’t know a thing about Old English, so I am uncertain how one might test this vague hunch.

Nonetheless, it seems plausible to me that the network might be picking up on the rudiments of Old English lingering in modern (-but-still-weird) English orthography. And it makes sense — of at least the dream-logic kind — that the oldest phonotactic constraints might be the ones the network learns first. Perhaps they are in some sense more fully embedded in the written language, and so are more predictable than other patterns that are distinctive to modern English.

It might be possible to test this hypothesis by looking at which phonotactic constraints the network learns first. If it happened to learn “wrong” constraints that are “right” in Old English — if such constraints even exist — that might provide evidence in favor of this hypothesis.

If you’d like to investigate this and see the kind of output the network produces, I’ve put all the necessary tools online. I’ve only tested this code on OS X; if you have a Mac, you should be able to get this up and running pretty quickly. All the below commands can be copied and pasted directly into Terminal. (Look for it in Applications/Utilities if you’re not sure where to find it.)

  1. Install homebrew — their site has instructions, or you can just trust me:
    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    
  2. Use homebrew to install hdf5:
    brew tap homebrew/science
    echo 'Now you are a scientist!'
    brew install hdf5
    
  3. Use pip to install PyTables:
    echo 'To avoid `sudo pip`, get to know'
    echo 'virtualenv & virtualenvmanager.'
    echo 'But this will do for now.'
    sudo -H pip install tables
    
  4. Get pann:
    git clone https://github.com/senderle/pann
    
  5. Generate the training data from text files of your choice (preferably at least 2000k of text):
    cd pann
    ./gen_table.py default_alphabet new_table.npy
    ./gen_features.py -t new_table.npy \
        -a default_alphabet -c 101 \
        -s new_features.h5 \ 
        SOME_TEXT.txt SOME_MORE_TEXT.txt
    
  6. Train the neural network:
    ./pann.py -I new_features.h5 \ 
        -C new_features.h5 -s 4000 1200 300 40 \
        -S new_theta.npy -g 0.1 -b 100 -B 1000 \
        -i 1 -o 500 -v markov
    
  7. Watch it learn!

It mesmerizes me to watch it slowly deduce new rules. It never quite gets to the level of sophistication that a true recurrent neural network might, but it gets close. If you don’t get interesting results after a good twenty-four or forty-eight hours of training, play with the settings — or contact me!

Log-Scale Reading

When I started this blog, I promised myself that I’d post once every two weeks. Recently, that has begun to feel like a challenge. According to that schedule, dear reader, I owed you a post approximately eleven days ago. Unfortunately, I have only a massive backlog of half-baked ideas.1

But the show must go on!2 I’ve decided to take inspiration from Andrew Piper, who recently introduced “Weird Idea Wednesday” with this astute observation:

Our field is in its infancy and there is no road map. Weird ideas have an important role to play, even if they help profile the good ideas more clearly.

Tis wrote only for the curious and inquisitive.
Sound advice from the first volume of Tristram Shandy (1760).

I couldn’t agree more. In fact, I thought the inaugural weird idea was quite wonderful! It involved using a “market basket” algorithm on texts — the amazon.com approach to diction analysis. My philosophy is, why shouldn’t a sentence be like a shopping cart? I have no idea whether this approach could be useful, but then — that’s the point.

I will forgive the skeptics in my audience for thinking that this is just a highfalutin justification for writing filler to meet an arbitrary publication schedule.3 You others: read on.

My half-baked idea for this Friday is that we should come up with a new kind of reading in addition to close and distant reading: log-scale reading. I’m not certain this is a good idea; I’m not even totally certain what it means. But think for a moment about why people use log scales: they reveal patterns that only become obvious when you use a scale that shows your data at many different levels of resolution.

For example, consider this chart:

An exponential-looking chart.

It’s a line chart of the following function:

y = 10 ** x + 0.5 * 6 ** x * np.sin(10 * x)

Now, you can see from the equation that there’s a lot of complexity here; but if you had only seen the graph, you’d only notice the pattern of exponential growth. We’re zoomed way too far out. What happens when we zoom in?

A weird, squiggly chart

Now we get a much better sense of the low-level behavior of the function. But we get no information about what happens to the right. Does it keep going up? Does it level off? We have no idea. Log scale to the rescue:

log

This doesn’t make smaller patterns invisible, nor does it cut off our global view. It’s a much better representation of the function.

Now, this may seem dreadfully obvious. It’s data visualization 101: choose the right scale for your data. But I find myself wondering whether there’s a way of representing texts that does the same thing. When discussing distant reading and macroanalysis, people talk a lot about “zooming out” and “zooming in,” but in other fields, that’s frequently not the best way to deal with data at differing scales. It’s often better to view your data at multiple scales simultaneously. I’d like to be able to do that with text.

So that illustrates, in a very hazy and metaphorical way, what log-scale reading would be. But that’s all I’ve got so far. In some future Filler Friday4 post, I’ll explore some possibilities, probably with no useful outcome. I’ll try to make some pretty pictures though.

One final question: Am I missing something? Does log-scale reading already exist? I’d love to see examples.5


  1. This is partially the result of personal circumstances; I spent the last month moving from Saratoga Springs to Philadelphia. But that’s no excuse! 
  2. This is not literally true. There’s no reason for the show to go on. 
  3. I decided not to call this feature of my blog “Filler Friday,” but I won’t object if you do. 
  4. OK, actually, it’s a pretty good name. 
  5. Since I posted this, there have been some interesting developments along these lines. In the Winter 2016 issue of Critical Inquiry, Hoyt Long and Richard Jean So make some persuasive arguments for this kind of multi-scale reading, although the task of visualizing it remains elusive, as does (I would argue) the task of developing arguments across multiple scales in fully theorized ways. 

Meaning, Context, and Algebraic Data Types

A few years ago, I read a paper making a startling argument. Its title was “The Derivative of a Regular Type is its Type of One-Hole Contexts.” I’m not entirely sure how I found it or why I started reading it, but by the time I was halfway though it, I was slapping myself on the forehead: it was so brilliant, and yet so obvious as to be almost trivial — how could nobody have thought of it before?

The idea was that if you take an algebraic data type — say something simple like a plain old list — and poke a hole in it, you get a new data type that looks like a derivative of the first data type. That probably sounds very abstract and uninteresting at first, especially if you aren’t very familiar with algebraic data types. But once you understand them, the idea is simple. And I think it has ramifications for people interested in natural language: it tells us something about the relationship between meaning and context. In particular, it could give us a new way to think about how semantic analysis programs like word2vec function: they perform a kind of calculus on language.

What’s an Algebraic Data Type?

An algebraic data type is really just a regular data type — seen through a particular lens. So it’s worth taking a moment to think through what a regular data type is, even if you’re already familiar with the concept.

A regular data type is just an abstract representation of the kind of data being stored or manipulated by a computer. Let’s consider, as a first example, the simplest possible item of data to be found in a modern computer: a bit. It can take just two values, zero and one. Most programming languages provide a type describing this kind of value — a boolean type. And values of this type are like computational atoms: every piece of data that modern computers store or manipulate is made of combinations of boolean values.1

So how do we combine boolean values? There are many ways, but here are two simple ones that cover a lot of ground. In the first way, we take two boolean values and declare that they are joined, but can vary independently. So if the first boolean value is equal to zero, the second boolean value may be equal to either zero or one, and if the first boolean value is equal to one, the second boolean value may still be equal to either zero or one. And vice versa. They’re joined, but they don’t pay attention to each other at all. Like this:

1) 0 0
2) 0 1
3) 1 0
4) 1 1

In the second way, we take two boolean values and declare that they are joined, but cannot vary independently — only one of them can be active at a given time. The others are disabled (represented by “*” here) — like this:

1) * 0
2) * 1
3) 0 *
4) 1 *

Now suppose we wanted to combine three boolean values. In the first case, we get this:

1) 0 0 0
2) 0 0 1
3) 0 1 0
4) 0 1 1
5) 1 0 0 
6) 1 0 1
7) 1 1 0
8) 1 1 1

And in the second case, we get this:

1) * * 0
2) * * 1
3) * 0 *
4) * 1 *
5) 0 * *
6) 1 * *

At this point, you might start to notice a pattern. Using the first way of combining boolean values, every additional bit doubles the number of possible combined values. Using the second way, every additional bit only adds two to the number of possible combined values.

This observation is the basis of algebraic type theory. In the language of algebraic data types, the first way of combining bits produces a product type, and the second way of combining bits produces a sum type. To get the number of possible combined values of a product type, you simply multiply together the number of possible values that each component type can take. For two bits, that’s 2 \times 2; for three bits, that’s 2 \times 2 \times 2; and for n bits, that’s 2 ^ n. And to get the number of possible combined values of a sum type, you simply add together the number of possible values that each component type can take: 2 + 2, or 2 + 2 + 2, or 2 \times n.

Pretty much all of the numerical types that a computer stores are product types of bits. For example, a 32-bit integer is just the product type of 32 boolean values. It can represent numbers between 0 and 2 ^{32} - 1 (inclusive), for a total of 2 ^{32} values. Larger numbers require more binary digits to store.

Sum types are a little more esoteric, but they become useful when you want to represent things that can come in multiple categories. For example, suppose you want to represent a garden plot, and you want to have a variable that stores the number of plants in the plot. But you also want to indicate what kind of plants are in the plot — basil or thyme, say. A sum type provides a compact way to represent this state of affairs. A plot can be either a basil plot, or a thyme plot, but not both. If there can be up to twenty thyme plants in a thyme plot, and up to sixteen basil plants in a basil plot, the final sum type has a total of thirty-six possible values:

Basil   Thyme
*       1
*       2
...     ...
*       19    
*       20
1       *
2       *
...     ...
15      *
16      *

If you’ve done any programming, product types probably look pretty familiar, but sum types might be new to you. If you want to try a language that explicitly includes sum types, take a look at Haskell.

Poking Holes in Types

Given this understanding of algebraic data types, here’s what it means to “poke a hole” in a type. Suppose you have a list of five two-bit integers:

| a | b | c | d | e |

This new variable’s type is a product type. It can take this value:

| 0 | 0 | 0 | 0 | 0 |

Or this value:

| 0 | 0 | 0 | 0 | 1 |

Or this value…

| 0 | 0 | 0 | 0 | 2 |

And so on, continuing through values like these:

| 0 | 0 | 0 | 0 | 3 |
| 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 | 2 |
| 0 | 0 | 0 | 1 | 3 |
| 0 | 0 | 0 | 2 | 0 |
...

All the way to these final values:

...
| 3 | 3 | 3 | 2 | 3 |
| 3 | 3 | 3 | 3 | 0 |
| 3 | 3 | 3 | 3 | 1 |
| 3 | 3 | 3 | 3 | 2 |
| 3 | 3 | 3 | 3 | 3 |

The total number of possible values for a variable of this type is (2 ^ 2) ^ 5 = 4 ^ 5 = 4 * 4 * 4 * 4 * 4 = 1024.

Now think about what happens if you disable a single slot. The result is a list type with a hole in it. What does a variable with this new type look like? Well, say you disable the first slot. You get something that looks like this — a list with a hole in the first slot:

| * | 0 | 0 | 0 | 0 |

Now that there’s a hole in the first slot, there are only 4 * 4 * 4 * 4 = 256 possible values this variable can take.

| * | 0 | 0 | 0 | 1 |
| * | 0 | 0 | 0 | 2 |
...

We could also poke a hole in any of the other slots:

| 0 | * | 0 | 0 | 0 |
| 0 | 0 | * | 0 | 0 |
| 0 | 0 | 0 | * | 0 |
| 0 | 0 | 0 | 0 | * |

But the hole can only be in one of the slots at any given time. Sound familiar? You could say that this is the “sum type” of each of those one-hole types. Each possible placement of the hole adds to (rather than multiplying) the number of possible values. There are five different places for the hole to go, and for each of those, there are 256 possible values. That’s a total of 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 = 5 \times 4 ^ 4.

Now suppose we make this a little more general, by replacing that 4 with an x. This way we can do this same thing with slots of any size — slots that can hold ten values, or 256 values, or 65536 values, or whatever. For the plain list type, that’s a total of x ^ 5 possible values. And for the list-with-a-hole, that’s 5 x^4 possible values. And now let’s go even further and replace the number of slots with an n. That way we can have any number of slots — five, ten, fifty, whatever you like. Now, the plain list has a total of x ^ n possible values, and the list-with-a-hole has n x ^ {(n - 1)} possible values.

If you ever took calculus, that probably looks very familiar. It’s just the power rule! This means you can take data types and do calculus with them.

One-Hole Contexts

I think that’s pretty wild, and it kept my head spinning for a couple of years when I first read about it. But I moved on. Fast forward to a few weeks ago when I started really reading about word2vec and learning how it works: it finds vectors that are good at predicting the word in the middle of an n-gram, or are good at predicting an n-gram given a word in the middle. Let’s start with the first case — say you have this incomplete 5-gram:

I went _ the store

What word is likely to be in there? Well, “to” is a good candidate — I’d venture a guess that it was the first word that popped into your mind. But there are other possibilities. “Past” might work. If the n-gram is part of a longer sentence, “from” is a possibility. And there are many others. So word2vec will group those words together in its vector space, because they all fit nicely in this context, and in many others.

But look again at that sentence — it’s a sequence of words with a hole in it. So if in this model, a word’s meaning is defined by the n-grams in which it may be embedded, then the type of a word’s meaning is a list-with-a-hole — just like the derivative of the list type that we were looking at above.

The basic idea of word2vec isn’t extremely surprising; from a certain perspective, this is an elaboration of the concept of a “minimal pair” in linguistics. If you know that “bat” means a thing you hit a ball with, and “mat” means a thing you wipe your feet on, then you can tell that /b/ and /m/ are different phonemes, even though they both involve putting your lips together: the difference between their sounds makes a difference in meaning. Conversely, if the difference between two sounds doesn’t make a difference in meaning, then the sounds must represent the same phoneme. “Photograph” pronounced with a distinct “oh” sound in the second syllable is the same word as “photograph” pronounced with an “uh” sound in the second syllable. They’re different sounds, but they both make the same word here, so they represent the same phoneme.

The thinking behind word2vec resembles the second case. If you can put one word in place of another in many different contexts and get similar meanings, then the two words must be relatively close together in meaning themselves.

What is surprising is that by pairing these two concepts, we can link the idea of derivatives in calculus to the idea of meaning. We might even be able to develop a model of meaning in which the meaning of a sentence has the same type as the derivative of that sentence’s type. Or — echoing the language of the paper I began with — the meaning type of a sentence type is its type of one-hole contexts.


  1. But don’t take this for granted. There have been — and perhaps will be in the future — ternary computers