To Conquer All Mysteries by Rule and Line

I was very excited last week to read a preprint from Tal Yarkoni and Jacob Westfall on using predictive modeling to validate results in psychology. Their main target is a practice that they refer to — at first — as p-hacking. Some people might be more familiar with the practice under a different name, data dredging. In short, to p-hack is to manipulate data after seeing at least some portion of the results, with the aim (conscious or unconscious) of inflating significance scores.1

In their paper, Yarkoni and Westfall argue that the methodological apparatus of machine learning provides a solution to the problem that allows p-hacking. In machine learning lingo, the name of that problem is overfitting. I find the argument very persuasive, in part based on my own experience. After a brief description of the paper, I’d like to share a negative result from my research that perfectly illustrates the overlap they describe between overfitting by machines and overfitting by humans.

Paranoid Android

An overfit dataset.
The green line tries too hard to fit the data perfectly; the black line makes more errors, but comes closer to the truth. Via Wikimedia Commons.

As an alternative to p-hacking, Yarkoni and Westfall offer a new term: procedural overfitting. This term is useful because it draws attention to the symmetry between the research practices of humans and the learning processes of machines. When a powerful machine learning algorithm trains on noisy data, it may assign too much significance to the noise. As a result, it invents a hypothesis that is far more complicated than the data really justifies. The hypothesis appears to explain that first set of data perfectly, but when tested against new data, it falters.

After laying out the above ideas in more detail, Yarkoni and Westfall make this claim: when researchers p-hack, they do exactly the same thing. They take noisy data and misappropriate the noise, building a theory more complex or nuanced than the evidence really justifies.

If that claim is true, then the tools machine learning researchers have developed to deal with overfitting can be reused in other fields. Some of those fields might have nothing to do with predicting categories based on features; they may be more concerned with explaining physical or biological phenomena, or with interpreting texts. But insofar as they are hampered by procedural overfitting, researchers in those fields can benefit from predictive methods, even if they throw out the predictions afterwards.

Others have articulated similar ideas before, framed in narrower ways. But the paper’s illustration of the cross-disciplinary potential of these ideas is quite wonderful, and it explains the fundamental concepts from machine learning lucidly, without requiring too much prior knowledge of any of the underlying algorithms.

A Cautionary Tale

This is all especially relevant to me because I was recently both a perpetrator and a victim of inadvertent procedural overfitting. Fortunately, using the exact techniques Yarkoni and Westfall talk about, I caught the error before reporting it triumphantly as a positive result. I’m sharing this now because I think it might be useful as a concrete example of procedural overfitting, and as a demonstration that it can indeed happen even if you think you are being careful.

At the beginning of the year, I started tinkering with Ted Underwood and Jordan Sellers’ pace-of-change dataset, which contains word frequency counts for 720 volumes of nineteenth-century poetry. Half of them were sampled from a pool of books that were taken seriously enough to receive reviews — whether positive or negative — from influential periodicals. The other half were sampled from a much larger pool of works from HathiTrust. Underwood and Sellers found that those word frequency counts provide enough evidence to predict, with almost 80% accuracy, whether or not a given volume was in the reviewed subset. They used a Logistic Regression algorithm that incorporated regularization methods similar to the one Yarkoni and Westfall describe in their paper. You can read more about the corpus and the associated project on Ted Underwood’s blog.

Inspired by Andrew Goldstone’s replication of their model, I started playing with the model’s regularization parameters. Underwood and Sellers had used an L2 regularization penalty.2 In the briefest possible terms, this penalty measures the model’s distance from zero, where distance is defined in a space of possible models, and each dimension of the space corresponds to a feature used by the model. Models that are further from zero on a particular dimension put more predictive weight on the corresponding feature. The larger the model’s total distance from zero, the higher the regularization penalty.

Goldstone observed that there might be a good reason to use a different penalty, the L1 penalty.3 This measures the model’s distance from zero too, but it does so using a different concept of distance. Whereas the L2 distance is plain old euclidean distance, the L1 distance is a simple sum over the distances for each dimension.4 What’s nice about L1 regularization is that it produces sparse models. That simply means that the model learns to ignore many features, focusing only on the most useful ones. Goldstone’s sparse model of the pace-of-change corpus does indeed learn to throw out many of the word frequency counts, focusing on a subset that does a pretty good job at predicting whether a volume was reviewed. However, it’s not quite as accurate as the model based on L2 regularization.

I wondered if it would be possible to improve on that result.5 A model that pays attention to fewer features is easier to interpret, but if it’s less accurate, we might still prefer to pay attention to the weights from the more accurate model. Additionally, it seemed to me that if we want to look at the weights produced by the model to make interpretations, we should also look at the weights produced by the model at different regularization settings. The regularization penalty can be turned up or down; as you turn it up, the overall distance of the model from zero goes down. What happens to the individual word weights as you do so?

It turns out that for many of the word weights, the result is uninteresting. They just go down. As the L2 regularization goes up, they always go down, and as the L1 regularization goes up, they always go down. But a few words do something different occasionally: as the L1 regularization goes up, they go up too. This is surprising at first, because we’re penalizing higher values. When those weights go up, they are pushing the model further away from zero, not closer to zero, as expected. This is a bit like watching a movie, turning down the volume, and finding that some of the voices get louder instead of softer.

For these steps away from zero to be worthwhile, the model must be taking even larger steps towards zero for some other set of words. That suggests that these words might have especially high explanatory power (at least for the given dataset). And when you collect a set of words like this together, they seem not to correspond to the specific words picked out by any single L1-regularized model. So what happens if we train a model using just those words? I settled on an automated way to select words like this, and I whittled it down to a 400-word list. Then I ran a new model using just those words as features. And after some tinkering, I found that the model was able to successfully classify 97.5% of the volumes:

We have 8 volumes missing in metadata, and
0 volumes missing in the directory.

We have 360 positive, and
360 negative instances.
Beginning multiprocessing.
0
[...]
700
Multiprocessing concluded.

If we divide the dataset with a horizontal line at 0.5, accuracy is: 0.975
Divided with a line fit to the data trend, it's 0.976388888889

I was astonished. This test was based on the original code from Underwood and Sellers, which does a very rigorous version of k-fold cross-validation. For every prediction it makes, it trains a new model, holding out the single data point it’s trying to predict, as well as data points corresponding to other works in the corpus by the same author. That seems rock-solid, right? So this can’t possibly be overfitting the data, I thought to myself.

Then, a couple of hours later, I remembered this:

This was part of a conversation about a controversial claim about epigenetic markers associated with sexual orientation in men. It was receiving criticism for using faulty methods: they had modified their model based on information from their test set. And that means it’s no longer a test set.

I realized I had just done the same thing.

Fortunately for me, I had done it in a slightly different way: I had used an automated feature selection process, which meant I could go back and test the process in a way that did not violate that rule. So I wrote a script that followed the same steps, but used only the training data to select features. I ran that script using the same rigorous cross-validation strategy. And the amazing performance went away:

If we divide the dataset with a horizontal line at 0.5, accuracy is: 0.7916666666666666
Divided with a line fit to the data trend, it's 0.798611111111

This is a scanty improvement on the standard model from Underwood and Sellers — so scanty that it could mean nothing. That incredible performance boost was only possible because the feature selector could see all the answers in advance.

I still think this result is a little bit interesting because it uses a dramatically reduced feature set for prediction — it throws out roughly 90% of the available word frequency data, retaining a similar (but slightly different) subset for each prediction it makes. That should give us some more confidence that we aren’t throwing out very much important information by focusing on reduced feature sets in our interpretations. Furthermore, the reduced sets this approach produces aren’t the same as the reduced sets you’d get by simply taking the highest-weighted features from any one run. This approach is probably providing some kind of distinctive insight into the data. But there’s nothing earth-shaking about it — it’s just a run-of-the-mill iterative improvement.6

My Intentions Are Good, I Use My Intuition

In this case, my error wasn’t too hard to catch. To find that first, magical feature set, I had used an automated process, which meant I could test it in an automated way — I could tell the computer to start from scratch. Computers are good at unlearning things that way. But humans aren’t.

Suppose that instead of using an automated process to pick out those features, I had used intuition. Intuition can overfit too — that’s the essence of the concept of procedural overfitting. So I would have needed to cross-validate it — but I wouldn’t have been able to replicate the intuitive process exactly. I don’t know how my intuition works, so I couldn’t have guaranteed that I was repeating the exact same steps. And what’s worse, I can’t unsee results the way a machine can. The patterns I had noticed in those first results would have unavoidably biased my perception of future results. To rigorously test my intuitive feature selections, I would have needed to get completely new data.

I’ve had conversations with people who are skeptical of the need to maintain strict separation between training, cross-validation, and test data. It’s an especially onerous restriction when you have a small dataset; what do you do when you have rigorously cross-validated your approach, and then find that it still does poorly on the test set? If you respect the distinction between cross-validation and test data, then you can’t change your model based on the results from the test without demoting the test data. If you use that data for tuning, it’s just not test data anymore — it’s a second batch of cross-validation data.

Skeptics may insist that this is overly restrictive. But my experiences with this model make me certain that it’s vital if you aren’t tuning your model in a precisely specified way. Even now, I feel a little uneasy about this feature selection method. It’s automated, and it performs as well as the full model in the tests I’ve run, but I developed it after looking at all the data. There was still a small flash of intuition that led me to notice that some words were getting louder instead of softer as I turned down the volume. And it remains possible that future data won’t behave the same way! In an ideal world, I would have held out a final 20% for testing, just to be safe. But at least in this case, I have the comfort of knowing that even without having done that, I was able to find and correct a significant overfitting problem, because I used an automated feature selection method.


  1. Discussions of this practice pop up occasionally in debates between Bayesians and advocates of classical significance testing. Bayesians sometimes argue that p-values can be manipulated more easily (or at least with less transparency) than tests based on Bayesian techniques. As someone who had long been suspicious of certain kinds of scientific results, I found my preconceptions flattered by that argument when I first heard it, and I became very excited about Bayesian methods for a while. But that excitement wore off, for reasons I could not have articulated before. 
  2. A lot of people call this ridge regression, but I always think — ridge of what? L2 refers to something specific, a measure of distance. 
  3. Also known as the LASSO. But again — huh? 
  4. This one has a nickname I understand: the manhattan distance. Like most city blocks in Manhattan, it disallows diagonal shortcuts. 
  5. After writing this, I realized that Ben Schmidt has also been thinking about similar questions
  6. I’m still getting the code that does this into shape, and once I have managed that, I will write more about the small positive result buried in this large negative result. But adventurous tinkerers can find the code, such as it is, on github, under the “feature/reg-and-penalty” branch of my fork of the original paceofchange repository. Caveat lector! It’s terribly documented, and there are several hard-coded parameters that require tweaking. The final result will have at least a primitive argparse-based console interface.