Digital Humanities

To Conquer All Mysteries by Rule and Line

I was very excited last week to read a preprint from Tal Yarkoni and Jacob Westfall on using predictive modeling to validate results in psychology. Their main target is a practice that they refer to — at first — as p-hacking. Some people might be more familiar with the practice under a different name, data dredging. In short, to p-hack is to manipulate data after seeing at least some portion of the results, with the aim (conscious or unconscious) of inflating significance scores.1

In their paper, Yarkoni and Westfall argue that the methodological apparatus of machine learning provides a solution to the problem that allows p-hacking. In machine learning lingo, the name of that problem is overfitting. I find the argument very persuasive, in part based on my own experience. After a brief description of the paper, I’d like to share a negative result from my research that perfectly illustrates the overlap they describe between overfitting by machines and overfitting by humans.

Paranoid Android

An overfit dataset.
The green line tries too hard to fit the data perfectly; the black line makes more errors, but comes closer to the truth. Via Wikimedia Commons.

As an alternative to p-hacking, Yarkoni and Westfall offer a new term: procedural overfitting. This term is useful because it draws attention to the symmetry between the research practices of humans and the learning processes of machines. When a powerful machine learning algorithm trains on noisy data, it may assign too much significance to the noise. As a result, it invents a hypothesis that is far more complicated than the data really justifies. The hypothesis appears to explain that first set of data perfectly, but when tested against new data, it falters.

After laying out the above ideas in more detail, Yarkoni and Westfall make this claim: when researchers p-hack, they do exactly the same thing. They take noisy data and misappropriate the noise, building a theory more complex or nuanced than the evidence really justifies.

If that claim is true, then the tools machine learning researchers have developed to deal with overfitting can be reused in other fields. Some of those fields might have nothing to do with predicting categories based on features; they may be more concerned with explaining physical or biological phenomena, or with interpreting texts. But insofar as they are hampered by procedural overfitting, researchers in those fields can benefit from predictive methods, even if they throw out the predictions afterwards.

Others have articulated similar ideas before, framed in narrower ways. But the paper’s illustration of the cross-disciplinary potential of these ideas is quite wonderful, and it explains the fundamental concepts from machine learning lucidly, without requiring too much prior knowledge of any of the underlying algorithms.

A Cautionary Tale

This is all especially relevant to me because I was recently both a perpetrator and a victim of inadvertent procedural overfitting. Fortunately, using the exact techniques Yarkoni and Westfall talk about, I caught the error before reporting it triumphantly as a positive result. I’m sharing this now because I think it might be useful as a concrete example of procedural overfitting, and as a demonstration that it can indeed happen even if you think you are being careful.

At the beginning of the year, I started tinkering with Ted Underwood and Jordan Sellers’ pace-of-change dataset, which contains word frequency counts for 720 volumes of nineteenth-century poetry. Half of them were sampled from a pool of books that were taken seriously enough to receive reviews — whether positive or negative — from influential periodicals. The other half were sampled from a much larger pool of works from HathiTrust. Underwood and Sellers found that those word frequency counts provide enough evidence to predict, with almost 80% accuracy, whether or not a given volume was in the reviewed subset. They used a Logistic Regression algorithm that incorporated regularization methods similar to the one Yarkoni and Westfall describe in their paper. You can read more about the corpus and the associated project on Ted Underwood’s blog.

Inspired by Andrew Goldstone’s replication of their model, I started playing with the model’s regularization parameters. Underwood and Sellers had used an L2 regularization penalty.2 In the briefest possible terms, this penalty measures the model’s distance from zero, where distance is defined in a space of possible models, and each dimension of the space corresponds to a feature used by the model. Models that are further from zero on a particular dimension put more predictive weight on the corresponding feature. The larger the model’s total distance from zero, the higher the regularization penalty.

Goldstone observed that there might be a good reason to use a different penalty, the L1 penalty.3 This measures the model’s distance from zero too, but it does so using a different concept of distance. Whereas the L2 distance is plain old euclidean distance, the L1 distance is a simple sum over the distances for each dimension.4 What’s nice about L1 regularization is that it produces sparse models. That simply means that the model learns to ignore many features, focusing only on the most useful ones. Goldstone’s sparse model of the pace-of-change corpus does indeed learn to throw out many of the word frequency counts, focusing on a subset that does a pretty good job at predicting whether a volume was reviewed. However, it’s not quite as accurate as the model based on L2 regularization.

I wondered if it would be possible to improve on that result.5 A model that pays attention to fewer features is easier to interpret, but if it’s less accurate, we might still prefer to pay attention to the weights from the more accurate model. Additionally, it seemed to me that if we want to look at the weights produced by the model to make interpretations, we should also look at the weights produced by the model at different regularization settings. The regularization penalty can be turned up or down; as you turn it up, the overall distance of the model from zero goes down. What happens to the individual word weights as you do so?

It turns out that for many of the word weights, the result is uninteresting. They just go down. As the L2 regularization goes up, they always go down, and as the L1 regularization goes up, they always go down. But a few words do something different occasionally: as the L1 regularization goes up, they go up too. This is surprising at first, because we’re penalizing higher values. When those weights go up, they are pushing the model further away from zero, not closer to zero, as expected. This is a bit like watching a movie, turning down the volume, and finding that some of the voices get louder instead of softer.

For these steps away from zero to be worthwhile, the model must be taking even larger steps towards zero for some other set of words. That suggests that these words might have especially high explanatory power (at least for the given dataset). And when you collect a set of words like this together, they seem not to correspond to the specific words picked out by any single L1-regularized model. So what happens if we train a model using just those words? I settled on an automated way to select words like this, and I whittled it down to a 400-word list. Then I ran a new model using just those words as features. And after some tinkering, I found that the model was able to successfully classify 97.5% of the volumes:

We have 8 volumes missing in metadata, and
0 volumes missing in the directory.

We have 360 positive, and
360 negative instances.
Beginning multiprocessing.
0
[...]
700
Multiprocessing concluded.

If we divide the dataset with a horizontal line at 0.5, accuracy is: 0.975
Divided with a line fit to the data trend, it's 0.976388888889

I was astonished. This test was based on the original code from Underwood and Sellers, which does a very rigorous version of k-fold cross-validation. For every prediction it makes, it trains a new model, holding out the single data point it’s trying to predict, as well as data points corresponding to other works in the corpus by the same author. That seems rock-solid, right? So this can’t possibly be overfitting the data, I thought to myself.

Then, a couple of hours later, I remembered this:

This was part of a conversation about a controversial claim about epigenetic markers associated with sexual orientation in men. It was receiving criticism for using faulty methods: they had modified their model based on information from their test set. And that means it’s no longer a test set.

I realized I had just done the same thing.

Fortunately for me, I had done it in a slightly different way: I had used an automated feature selection process, which meant I could go back and test the process in a way that did not violate that rule. So I wrote a script that followed the same steps, but used only the training data to select features. I ran that script using the same rigorous cross-validation strategy. And the amazing performance went away:

If we divide the dataset with a horizontal line at 0.5, accuracy is: 0.7916666666666666
Divided with a line fit to the data trend, it's 0.798611111111

This is a scanty improvement on the standard model from Underwood and Sellers — so scanty that it could mean nothing. That incredible performance boost was only possible because the feature selector could see all the answers in advance.

I still think this result is a little bit interesting because it uses a dramatically reduced feature set for prediction — it throws out roughly 90% of the available word frequency data, retaining a similar (but slightly different) subset for each prediction it makes. That should give us some more confidence that we aren’t throwing out very much important information by focusing on reduced feature sets in our interpretations. Furthermore, the reduced sets this approach produces aren’t the same as the reduced sets you’d get by simply taking the highest-weighted features from any one run. This approach is probably providing some kind of distinctive insight into the data. But there’s nothing earth-shaking about it — it’s just a run-of-the-mill iterative improvement.6

My Intentions Are Good, I Use My Intuition

In this case, my error wasn’t too hard to catch. To find that first, magical feature set, I had used an automated process, which meant I could test it in an automated way — I could tell the computer to start from scratch. Computers are good at unlearning things that way. But humans aren’t.

Suppose that instead of using an automated process to pick out those features, I had used intuition. Intuition can overfit too — that’s the essence of the concept of procedural overfitting. So I would have needed to cross-validate it — but I wouldn’t have been able to replicate the intuitive process exactly. I don’t know how my intuition works, so I couldn’t have guaranteed that I was repeating the exact same steps. And what’s worse, I can’t unsee results the way a machine can. The patterns I had noticed in those first results would have unavoidably biased my perception of future results. To rigorously test my intuitive feature selections, I would have needed to get completely new data.

I’ve had conversations with people who are skeptical of the need to maintain strict separation between training, cross-validation, and test data. It’s an especially onerous restriction when you have a small dataset; what do you do when you have rigorously cross-validated your approach, and then find that it still does poorly on the test set? If you respect the distinction between cross-validation and test data, then you can’t change your model based on the results from the test without demoting the test data. If you use that data for tuning, it’s just not test data anymore — it’s a second batch of cross-validation data.

Skeptics may insist that this is overly restrictive. But my experiences with this model make me certain that it’s vital if you aren’t tuning your model in a precisely specified way. Even now, I feel a little uneasy about this feature selection method. It’s automated, and it performs as well as the full model in the tests I’ve run, but I developed it after looking at all the data. There was still a small flash of intuition that led me to notice that some words were getting louder instead of softer as I turned down the volume. And it remains possible that future data won’t behave the same way! In an ideal world, I would have held out a final 20% for testing, just to be safe. But at least in this case, I have the comfort of knowing that even without having done that, I was able to find and correct a significant overfitting problem, because I used an automated feature selection method.


  1. Discussions of this practice pop up occasionally in debates between Bayesians and advocates of classical significance testing. Bayesians sometimes argue that p-values can be manipulated more easily (or at least with less transparency) than tests based on Bayesian techniques. As someone who had long been suspicious of certain kinds of scientific results, I found my preconceptions flattered by that argument when I first heard it, and I became very excited about Bayesian methods for a while. But that excitement wore off, for reasons I could not have articulated before. 
  2. A lot of people call this ridge regression, but I always think — ridge of what? L2 refers to something specific, a measure of distance. 
  3. Also known as the LASSO. But again — huh? 
  4. This one has a nickname I understand: the manhattan distance. Like most city blocks in Manhattan, it disallows diagonal shortcuts. 
  5. After writing this, I realized that Ben Schmidt has also been thinking about similar questions
  6. I’m still getting the code that does this into shape, and once I have managed that, I will write more about the small positive result buried in this large negative result. But adventurous tinkerers can find the code, such as it is, on github, under the “feature/reg-and-penalty” branch of my fork of the original paceofchange repository. Caveat lector! It’s terribly documented, and there are several hard-coded parameters that require tweaking. The final result will have at least a primitive argparse-based console interface. 

Model Corpora for Distant Reading Research

There has been some recent discussion on Twitter in response to Alan Liu’s call for model corpora for students. It’s an extremely exciting conversation that’s long overdue. We need better model corpora. And I think it’s important for DH scholars to recognize that this is not just a pedagogical requirement — it’s a requirement for research at every level.1

Consider the field of machine learning. Almost anyone who has ever done research in the field has heard of the MNIST database. It’s a collection of 60,000 handwritten numerical digits, labeled with their actual values, preprocessed for ease of use, and carefully designed to pose a substantial challenge for learning algorithms. Achieving a decent error rate isn’t too hard, but achieving a very low error rate seems to require things like 6-layer feed-forward neural networks with thousands of neurons in each layer.

What’s great about the MNIST database is that many, many people have used it to test their work. Now, whenever you try a new learning algorithm, you can toss the MNIST data at it. If your algorithm achieves a reasonable error rate, you can feel pretty confident that you’re not making some kind of gross error, and that your algorithm really is recognizing handwriting, rather than noticing and generalizing from irrelevant details. Unfamiliar data could have surprising biases or unexpected patterns that throw off the performance of your algorithm, but the MNIST database is carefully curated to avoid many such problems. That’s vital for teaching, but it’s just as vital for research.

Andrew Ng compares this to the use of model organisms in biology. Mice, for example, consume few resources, develop and reproduce quickly, and share fundamental biological traits with many mammals, including humans. For those reasons, people have been studying them for a very long time and their biology is well understood. Researchers who want to study mammals, and who are already committed to making the kinds of ethical trade-offs that animal experimentation entails, will almost certainly start with mice or other related model organisms. The epistemological, practical, and ethical benefits are manifold. There will be fewer ways for the work to go wrong in unexpected ways, the research will proceed more efficiently, and fewer animals will suffer overall.

Fortunately, digital humanists don’t face the same ethical questions as biologists. Our “mouse models” can consist of nothing but data. But we still don’t have enough of them.2

I found the absence particularly frustrating as I sat down to play with Syuzhet a couple of weeks ago. I was initially at a loss about what data to use. It quickly occurred to me that I ought to start with Romeo and Juliet because that’s what other researchers had used, and for good reason. It’s familiar to a broad range of audiences, so it’s easy to get data about it from actual human beings. It has large variations in emotional valence with relatively clear trajectories. And well, you know — it’s Shakespeare. But one play by one author isn’t really good enough. What we need is a model corpus — or rather, many model corpora from different periods, in different genres, different languages, different registers, and so on.

There has been some discussion about how to construct corpora that are representative, but in these discussions, the question is often about whether the corpus gives us access to some kind of ground truth about the culture from which it is sampled.3 That’s an extremely important question — one of the most important in the digital humanities. But I worry that we’re not quite ready to begin answering it. We don’t know whether corpora are representative, but we also don’t know for sure what tools to use in our investigations. And it’s going to be difficult to refine our tools and build new representative corpora at the same time. In our rush to take advantage of the sudden ubiquity of literary and historical data, we might be skipping a step. We need to understand the tools we use, and to understand the tools we use, we need to test them on corpora that we understand.4

From one perspective, this is a matter of validation — of checking new results against what we already know.5 But it’s a different kind of validation than many of us are used to — where by “us” I mean mostly literary scholars, but possibly other humanists as well. It doesn’t ask “is this a claim that comports with known evidence?” It asks “what is the precise meaning of this claim?” This second question becomes important when we use an unfamiliar tool to make a claim; we need to understand the claim itself before saying whether it comports with known evidence.

From another perspective, this is a matter of theorization — of clarifying assumptions, developing conceptual structures, and evaluating argumentative protocols. But it’s a different kind of theory than many of us are used to. It doesn’t ask “what can we learn by troubling the unspoken assumptions that make this or that interpretation seem obvious?” It asks “how can we link the representations these tools produce to familiar concepts?” Literary theory has often involved questioning the familiar by setting it at a distance. But distant reading already does that — there are few obvious interpretations in the field right now, and theory may need to play a slightly different role than it has in previous decades.

From either perspective, the point of a model corpus would not be to learn about the texts in the corpus, or about the culture that produced those texts. It would be to learn about the tools that we use on that corpus, about the claims those tools might support, and about the claims they cannot support.

But what should a model corpus look like? This is where I become less certain. My first instinct is to say “let’s look at what corpus linguists do.” But the kinds of questions linguists are likely to ask are very different from the ones that literary scholars are likely to ask. Still, there are some great starting points, including a remarkably comprehensive list from Richard Xiao. Among those, the ARCHER corpus seems particularly promising. (Thanks to Heather Froelich for these suggestions!)

But in the long run, we’ll want to produce our own corpora. Fortunately, Alan Liu has already given this some thought! His focus is on pedagogical issues, but again, the kinds of model corpora he talks about are vital for research as well. On Twitter, he offered a brilliant enumeration of desirable qualities such corpora would have. I’m reproducing it here, lightly paraphrased:

Imagine what a ready-for-student-use corpus of literary materials would look like. Specs include the following:

  1. Free and public domain.
  2. Of manageable size (e.g., low thousands and not hundreds of thousands of items).
  3. Modular by nation, genre, period, language.
  4. Socioculturally diverse.
  5. Richly annotated with metadata.
  6. Pre-cleaned and chunked (or packaged with easy-to-use processing scripts).
  7. Compatible in format with similar corpora of materials in history and other fields (to encourage cross-domain experiments in analysis).
  8. Matched by period and nation to available linguistic corpora that can be used as reference corpora.

I think this is a terrific set of initial goals for model corpora, both for researchers and students. We’ll need to keep having conversations about requirements, and of course no one corpus will serve all our needs. Different disciplines within the humanities will have different requirements. But it’s clear to me that if digital humanists can develop a “canon” of familiar corpora useful for validating new tools, the field will have taken a significant step forward.

Let’s get started!


Appendix

Since there are already several links to helpful resources for thinking about humanistic corpora, I’m going to start a corpus creation and curation bibliography here. This will probably graduate into its own page or post.


  1. Update: Since posting this, I’ve learned that Laura Mandell, in collaboration with the NovelTM partnership, is working on a proposal for a journal dedicated to digital humanities corpora. I think this will be a fantastic development for the field! 
  2. There are some examples — such as the much-studied Federalist papers, which might be a good dataset to consider for testing new approaches to authorship attribution. And of course there are numerous standard corpora for use by corpus linguists — the Brown, AP, and Wall Street Journal corpora come to mind for American English, and there are many others. But these corpora have not been selected with literary studies in mind! This is where I parade my ignorance in the hope that someone will enlighten me: are there other corpora designed to be of specific methodological interest to literary scholars? 
  3. Most recently, Scott Weingart took up the question of corpus representativeness in his discussion of the Great Tsundoku, focusing on questions about what was written, what was published, and what was read. He also noted a conversation from a few years ago that Ted Underwood documented, and drew attention to Heather Froelich, who does lots of careful thinking about issues like this. And needless to say, Franco Moretti was thinking about this long ago. I think we’ll be asking this question in different ways for another fifty years. 
  4. Initially I said “We need to understand the tools we use first,” but that’s not quite right either. There’s a cyclical dependency here! 
  5. This has been a topic of widespread interest after the recent Syuzhet conversation, and I think the kinds of collective validation that Andrew Piper and others have called for would be made vastly easier by a somewhat standardized set of model corpora familiar to many researchers.