Simulating Citation Networks

Chart showing citation rates for life sciences, natural sciences, social sciences, engineering, communication, and humanities
Centre for Science and Technology Studies (2007), via Dunleavy’s blog post.

A few months ago, Patrick Dunleavy published a post on the London School of Economics Impact Blog describing “a huge gulf between many STEM scientists… and scholars in other disciplines in how often they cite other people’s research.” After providing some statistics in support of this claim, some of which placed citation rates in the humanities an entire order of magnitude below those in the natural and life sciences, he offered some advice:

Those social science and humanities academics who go through life anxious to prove themselves a ‘source’ (of citations) but not a ‘hub’ (of cites to others) are not practicing the reputation-building skills that they fondly imagine… Their practice is self-harming not just to themselves, but to all who read their works or follow after them as students at any level. Others who perhaps accept such attitudes without practicing them themselves – for example, as departmental colleagues or journal reviewers – are also failing in their scholarly obligations, albeit in a minor way.

Initially I felt chastened by Dunleavy’s article. “It’s a shame,” I thought, “that we in the humanities operate in such backwards ways that our citation rates are an entire order of magnitude lower than the citation rates of researchers in the sciences. We’d really better start citing each other more — being ‘hubs’ instead of ‘sources’!” But after giving some careful thought to what that means, I’ve concluded that this issue deserves more investigation than it received in Dunleavy’s post.

My own informal investigations suggest conclusions that are quite different from those that Dunleavy drew. The global point he made — that scholars in the humanities and social sciences ought to reform their citation practices — may still hold. But if it does, I believe it will be for different reasons than he suggested in his post. And those reasons will not be fully clear without much more systematic research into the relationship between research practices in the humanities and the statistical measurements he relied on. In the end, that relationship will probably have less to do with the number of citations in bibliographies than with the age of articles being cited, and the number of articles being published. Dunleavy summarily rejected the second claim, but there’s a strong line of reasoning suggesting that he was wrong to do so. The number of publications in a field — and especially the number of articles published per journal per year — could significantly affect both of the statistics he uses. To support that argument I’ve created a citation simulator that’s publicly available as a gist. I’ve also posted an interactive JavaScript version, and I’ll talk below about the assumptions it makes and the citation patterns it produces.

But I want to begin by describing my initial thought process upon reading Dunleavy’s post. I think it’s important to take seriously the kind of informal reasoning that might lead to skepticism in cases like this. Humanists aren’t often trained to make complex mathematical arguments, but that training isn’t always necessary to see when those arguments have problems. We don’t all need more mathematical training. We just need to get more practice subjecting mathematical arguments to sniff tests. These tests often involve paying attention to the order of magnitude of a value. If the number you get from an argument is in the low five digits, and the number you expect has at least six digits, then something’s probably wrong.

Off by Inches or by Feet?

What first bothered me about Dunleavy’s argument was the mismatch between incoming and outgoing citation rates. Although the first chart he displayed shows an entire order of magnitude difference between incoming citation rates in the humanities and the sciences, I had never noticed such a dramatic difference in the number of outgoing citations per article. Being in a self-critical mood, my first instinct was to consider my own work. I thought about how many secondary sources I cited in my most recent publication — just about twenty out of a total of fifty including primary sources1. That didn’t seem great — about average for the field at best, and probably even below average. But as I continued to think about it, that also seemed about average for most of the computer science papers I had read. In fact, a lot of those papers seemed to cite between fifteen and twenty other papers. But that was a vague hunch — it called for investigation. So I thought of a computer science paper that I know has been cited many times: the original paper describing Latent Dirichlet Allocation by David Blei, Andrew Ng, and Michael Jordan.2

It only cites seven papers!

Then I started to feel a little better about my own citation practices. At least I was doing better than rockstars like David Blei and Andrew Ng. (And that was back before they were rockstars.) What about people who cited their paper? I did a couple of random checks. One cited twenty-five papers; one cited eighteen. One cited 332 — that threw me for a loop until I realized it was a book-length document. But then it seemed about in line with the secondary bibliographies of most humanities monographs that I’ve seen — higher than average, perhaps, but certainly not by an order of magnitude.

Even based on such a small and unsystematically collected sample, I think it’s reasonable to conclude that humanists and scientists probably aren’t citing wildly different numbers of fellow scholars and researchers. Clearly this is an assertion that demands a much more thorough investigation. But we can expect that if humanists were generating an entire order of magnitude fewer citations, it would be obvious at first glance. And it’s not obvious. Humanists seem to be generating a roughly similar number of outgoing citations per article on average — optimistically, about the same number, and pessimistically, perhaps two thirds or three fifths as many. But not a tenth as many.

So what are the citation practices that we need to change? Dunleavy’s argument was that if we work harder at being ‘hubs,’ we’ll also have more success — potentially an order of magnitude more — as ‘sources.’ In other words, we should include far more secondary citations in our bibliographies. This question about outgoing citations is a pretty good test of that claim, and it didn’t do very well.3

That suggests that outgoing citations by humanists are being lost by these statistics, or incoming citations by scientists are being magnified somehow. Or perhaps the data is outright biased. At worst we need to be citing different articles — not ten times as many. Later in his article, Dunleavy suggested that we need to be doing more thorough literature reviews. Could that be causing the problem somehow? Maybe it’s a matter of citing more recent work, or more obscure work. Or perhaps we’re citing people outside our own field too frequently — perhaps we should be keeping citations inside our own circles.4 Or maybe something else is happening.

It’s difficult to tell exactly why there’s such a dramatic difference, but Dunleavy’s article suggested one intriguing explanation — only to reject it. After talking about low incoming citation rates, Dunleavy went on to talk about the h5 scores of journals in various fields. Then he gave us this graph:

A chart showing the h5 scores of journals across multiple disciplines.

Even now, I look at that and have to quash little twinges of insecurity. My field is all the way at the bottom! Panic! But cooler heads prevail. Dunleavy followed the chart with a sliver of analysis before moving towards conclusions, asking “What can or should be done?” Only after asking that question did he address the possibility that publication volume might have something to do with the discrepancy: “The greater volume of STEM science publications is cited as if it explains things — it doesn’t, most individual STEM science papers are not often cited.”

As I read that sentence, my first thought was “that sentence belongs way earlier in the post.” It’s part of the analysis, not a conclusion. And what’s the logic behind it? Dunleavy didn’t explain. That means we have to do some math.

Scale Models

Let’s begin by considering the statistics that Dunleavy discussed: citation rate and h5 index. The h5 index is easier to specify, so I’ll start with that. Google offers this definition: “h5-index is the h-index for articles published in the last 5 complete years. It is the largest number h such that h articles published in 2009-2013 have at least h citations each.” Articles older than five years effectively expire for the purpose of this statistic, and their citations are no longer counted.

Citation rate is a bit harder to work out. Dunleavy didn’t explicitly state what kind of citations the citation rate statistic counts, but if it were counting outgoing citations, then the chart he begins his post with would be hard to take seriously. I am quite confident that there are more than four outgoing citations per publication in all of the fields listed on that chart. So it must count incoming citations. Dunleavy also didn’t specify which citations it counts. For consistency, I’ll take five years as the cutoff for this statistic as well: articles older than five years expire.

Now let’s test Dunleavy’s claims on a simple, artificial example. Say you have two fields, A and B. Field A produces one thousand articles per year; field B produces ten thousand. Dunleavy’s first claim was that given similar citation practices, the increase in volume will not significantly affect the citation statistics he’s talking about.5 And his second was that most of the articles in either field will not be cited.

“Most” is a little vague, so let’s say that in either field, all outgoing citations in a particular year will be evenly distributed among the top ten percent of articles from the previous year. Dunleavy also didn’t remark on the number of journals in each field, so let’s suppose that field A has fifty journals and field B has two hundred. And to keep things simple, let’s say the top articles are distributed evenly across all journals. We’ll also assume that all articles cite just ten articles from the previous year. These are unrealistic assumptions, but they aren’t totally outlandish, and they should at least help us learn some things about what Dunleavy has claimed.

Let’s start with field A. For any given year, there will be one thousand articles published in the field. They will generate ten citations each, for a total of ten thousand citations. Those citations will be evenly distributed across the top ten percent of articles from the previous year — one hundred articles receiving one hundred citations each. Those articles will be evenly distributed across all fifty journals — two each. So for that year, there will be just two articles per journal with at least two citations; they will both count towards the journal’s h5 score. Over five years, there will be four such pairs of articles (because the articles from the most recent year won’t be cited until next year). Written out numerically, there are 1000 \times 10 / 100 = 100 citations per cited article, and there are 4 \times 100 / 50 = 8 cited articles per journal. So that’s a total of eight articles published per journal that received at least eight citations in the last five years in field A, giving an h5 score of eight for every journal in the field.

The calculation for the citation rate is slightly different. Every year, a set of ten thousand citations are generated and distributed evenly among last year’s journals, and four sets will be produced that count for a given five-year span. That’s a total of forty thousand citations, divided evenly among fifty journals. Those journals together produce five thousand articles over five years, and so the citation rate is eight.

On to field B. For any given year, there will be ten thousand articles in this field, each citing ten papers, for a total of one hundred thousand citations. Those citations will be divided among the previous year’s top ten percent of papers — one thousand papers this time, each receiving one hundred citations. This time, those thousand articles are divided among two hundred journals. That’s five articles per journal, and four sets of five per five-year period. Numerically, there are 10000 \times 10 / 1000 = 100 citations per cited article, and 4 \times 1000 / 200 = 20 cited articles per journal. That gives us twenty articles published per journal with at least twenty citations each, for an h5 score of twenty for every journal in the field.

And now for the citation rate: every year, one hundred thousand citations are generated, with four sets produced over five years. That’s a total of four hundred thousand citations, divided evenly among two hundred journals. Those journals will produce fifty thousand articles in total, and so the citation rate is again eight.

So publication volume has indeed affected the h5 statistic, though perhaps in a slightly different way than Dunleavy was talking about. The change in the number of articles published per year had no effect. But the change in the number of articles published per journal had a dramatic effect. Had the number of journals in field B also gone up by a full order of magnitude, to five hundred, there would have been no difference in either statistic; had the number of journals in field B only doubled to one hundred, the difference in their h5 statistics would have been even more noticeable. This might seem a bit like cheating: I didn’t scale all the values equally. But that’s arguably more realistic. A larger field will support — and may even require — larger journals that publish more frequently.

Now consider the fact that whereas Nature publishes weekly issues that each contain between ten and twenty articles and “letters” (with full bibliographies), even a very large, respected humanities journal like PMLA might publish only four or five issues a year, each containing between ten and twenty articles and other papers with full bibliographies. That’s a minimum of five hundred articles per year from Nature, compared to a minimum of fifty per year from PMLA. An order of magnitude difference.

Once you’ve worked through the mathematics, it’s not surprising. Journals that publish more articles will naturally capture more citations, all else being equal. And it’s a pattern that you can see in real data. Consider SCIMAGO‘s list of top journals. The correlation between the “Total Docs” statistic and the “H index” statistic is immediately noticeable. Try sorting the output by “H index” — the first journal with fewer than five hundred publications over three years is ranked fifty-ninth. Sixty three of the first hundred have more than a thousand citable documents over three years, and many have more than three thousand. Most humanities journals have fewer than two hundred. In total, the SCIMAGO database contains more than a thousand journals with a thousand citable documents over three years. None of them are dedicated to the humanities.6

At one point, Dunleavy wrote “the gulf charted here isn’t the product of just a few science super-journals.” What about a thousand science super-journals?

Simulating Citation Networks

But let’s assume that’s all just a coincidence. It might not hold up to further scrutiny. And recall that the assumptions I made for the simple calculations above are highly artificial. What would happen if we used a more realistic set of assumptions? I decided to try creating a citation simulator to see. Rather than trying to work out some kind of probabilistic closed-form h5 equation, I wrote a script that simulates thirty years of publication, displaying h5 values and citation rates for each year. I found that its behavior was unpredictable, and sensitive in complex ways to various inputs. But the results also seemed reasonable — they looked like the kinds of statistics one sees browsing through Google Scholar.

It still makes simplifying assumptions that are not realistic, but it does a much better job imitating the particular kind of rich-get-richer power law behavior of citation networks. There are no arbitrary values determining which articles will be cited and which will not be, but articles that already have citations will be more likely to receive additional citations. Here’s an enumeration of the assumptions the simulator makes, and the values it allows you to tune:

  1. A set number of journals publish articles in a given field. The number is tunable.
  2. Each journal publishes a set number of issues per year. The number is tunable.
  3. Each journal publishes a set number of articles per issue. The number is tunable.
  4. Each article cites a set number of other articles. The number is tunable.
  5. Citations for each article are chosen randomly, but with a bias towards articles that have already received citations. The probability that a given article will receive a new citation is proportional to the number of citations it has already received, plus one. The code provides a tunable skew parameter that strengthens or weakens the bias towards oft-cited articles.7 Articles become available for citation in the issue cycle after they are published.
  6. Between each issue cycle, some articles are forgotten or randomly superseded by others, and expire for the purpose of citation. The probability that an article will expire in a given cycle is the same for all articles, and is tunable.

To the degree that these parameters correspond to actual scholarly practices, a number of them are likely to vary widely between disciplines. For example, the speed with which articles expire in the humanities will probably be lower, and so older articles will be cited more often. And the number of issues published per year will often be lower. As it happens, those are two values that the h5 index is very sensitive to. It’s often less sensitive to the number of citations per article. For example, given the simulator’s initial default settings, if you multiply the number of issues per year by ten, the top journal’s h5 index increases almost threefold, for an average increase of about four percent per additional issue. But given those same initial defaults, if you multiply the number of outgoing citations per article by ten, the top journal’s h5 index changes by just thirty-five percent — an average increase of about four tenths of a percent per additional citation.8 A field that wanted to double its h5 numbers under these circumstances could publish twenty or twenty-five more issues of each journal per year — or cite two hundred more sources per article on average.

The sensitivity of the index to the number of outgoing citations depends partially on the bias parameter; when the bias towards famous articles is lower, increasing the number of outgoing citations has a greater effect. But the bias has to be quite low — distributing citations almost evenly among articles that haven’t been forgotten or superseded — before changes in outgoing citation rate are as significant as changes in the number of issues per journal. This pattern also makes sense in light of the calculations above. The citations were far too concentrated on the top articles; had they been spread out among other articles, the resulting h5 scores would have been higher. The bias and decay values also influence the relationship between outgoing citations and the field-wide citation rate; for some values, the field-wide citation rate can be as low as five percent of the outgoing citation rate, because so many of the outgoing citations are going to older articles that have expired for the purpose of the calculation.

There are a number of phenomena the simulator does not try to model at all. For example, it assumes that there is no particular bias in favor of one journal over another. Arguably even a mild bias could skew the results dramatically. In its current form, the simulator tends to produce fields that balance citations relatively evenly across all journals. A more realistic simulation might distribute the majority of citations over the top thirty or forty percent of journals; this would probably drive those journals’ h5 indices even higher.

But my goal is not to produce a perfectly realistic simulator. My goal is to show that a simulator that approaches even a moderate level of realism produces complex, unpredictable, nonlinear relationships between many different variables. Suppose we assume for the moment that the numbers that Google Scholar produces for humanities journals are as reliable as the numbers it produces for the sciences.9 And suppose we assume that we really should want the h5 indices of our journals to go up. We can’t expect to get a straightforward linear response by citing more articles and crossing our fingers. Given some very reasonable assumptions about publication conventions in the humanities, there’s a good chance that citing more articles will have only a small effect. The effect will certainly not be large enough to address the score gap between the humanities and the sciences. Other decisions about citation will matter more: which articles we cite, how recently they were published, which journals they were published in, and the number of articles those journals publish.

The assumptions that lead to those conclusions are not based on any evidence. This simulator can’t tell us anything about the actual state of the humanities. Perhaps a field-wide increase in outgoing citation rates would dramatically boost incoming citation rates and h5 scores. We can’t be certain without more careful investigation.

However, that means being doubly skeptical of hasty conclusions that reinforce popular stereotypes about the humanities and the sciences. I was troubled at times while reading Dunleavy’s post — especially when he implied that fields with lower citation rates are more likely to harbor scholars who are “ignoring other views, perspectives and contra-indications from the evidence.” The humanists I know make special effort to do just the opposite, because they know that the kind of research we do is often more vulnerable to ideological bias than research in the sciences. And we can’t shift the burden of objectivity onto our methods; we can’t pretend to be passive spectators, as some scientists might. To do good intellectual work, we have to confront our bias directly, paying careful attention to conflicting evidence from multiple perspectives. That’s challenging, certainly, but I’m not at all convinced that we draw our conclusions in more biased ways than scientists do.

It also troubled me when Dunleavy cited an op-ed by Mark Bauerlein suggesting that literary scholars should give up doing research altogether. Bauerlein is building his reputation as an ivory tower provocateur, and some have used even stronger language to describe his recent output — words like “trolling” and “clickbait” come to mind. The fact that he and Dunleavy might agree about this doesn’t exactly give me confidence that Dunleavy’s perspective is unbiased.

Despite those issues, I remain sympathetic to his call for citation reform. I would not have written this post if his hadn’t called for a thoughtful response. His suggestion of adopting systematic review deserves serious consideration; it might even address some of the factors that could be leading to the apparent dearth of incoming citations in the humanities, because it concerns not only the number of outgoing citations, but also their distribution over time and across the field. Although the literary scholars I know conduct secondary research in thorough and systematic ways, they each do it a little differently. It would be helpful to articulate a clearer set of field-wide standards for secondary research and citation practices.

But if we choose to do that, we shouldn’t worry about increasing the h5 statistics of our journals. We shouldn’t worry about the impact our work has within some arbitrary time frame. We should worry about creating better literary scholarship.

  1. “Common Knowledge: Epistemology and the Beginnings of Copyright Law,” forthcoming in PMLA
  2. It has been cited exactly 11111 times as of this writing, reports Google Scholar. 
  3. Unless, that is, it turns out that by citing secondary sources fifty percent more, we could increase our citation rate by four or five times. That seems unlikely. 
  4. I cited several historians of philosophy in my paper — those citations were “lost” for my field. When this occurred to me I thought “Oops! Well, C’est la vie.” Never mind that this directly contradicts Dunleavy’s advice to avoid “discipline-siloed” citation practices. 
  5. To be perfectly explicit, I am interpreting Dunleavy’s claim as entailing the contrapositive of the following premise: if higher publication volume significantly increases h5 statistics, then higher publication volume explains at least part of the gap between the humanities and the sciences. 
  6. The size of these journals is surely related to the publication incentive structures at work in the sciences. And at least one Nobel-winning scientist, Randy Schekman, has argued that scientific incentive structures are broken: “Mine is a professional world that achieves great things for humanity. But it is disfigured by inappropriate incentives.” 
  7. The citation sampler selects papers using a bin-based sampling process that’s fast and works intuitively, but that has zero theoretical justification. The papers are placed in an array of bins; papers with more citations appear towards the beginning of the array, and get more bins. Then a random number between zero and one is chosen and multiplied by the number of bins. That number is used as an index into the array of bins. The bias parameter is applied in one of two ways. If it’s greater than one, then the random number is raised to the power of the bias parameter before being multiplied by the number of bins. So if the bias parameter is two, then the square of the number is used. This pushes the values downwards — recall that the square of one half is one quarter — towards the most often cited papers. If the bias parameter is less than one, then the number of bins allocated to each paper is raised to the power of the bias parameter. So if the bias parameter is one half, then a paper that already has sixteen citations gets only four bins. This strategy produces a smooth transition between the two kinds of bias, and produces fewer bins than citations when possible, but never more. 
  8. The script is set with the following defaults: five hundred journals, five issues per year, ten articles per issue, and ten citations per article. The bias parameter is set to one (a standard rich-get-richer bias), and the decay parameter is set to seventy-five percent. 
  9. There are strong reasons to doubt this. Google Scholar has some fairly specific requirements for inclusion. Do as many humanities journals as science journals worry about meeting those requirements? Almost certainly not. And Google’s coverage for journals in my field looks incredibly dodgy. I don’t blame Google for that — but it certainly should have some bearing on the kind of reform we aim for. Let’s work on getting our journals properly indexed before we start overhauling our entire field. 

Model Corpora for Distant Reading Research

There has been some recent discussion on Twitter in response to Alan Liu’s call for model corpora for students. It’s an extremely exciting conversation that’s long overdue. We need better model corpora. And I think it’s important for DH scholars to recognize that this is not just a pedagogical requirement — it’s a requirement for research at every level.1

Consider the field of machine learning. Almost anyone who has ever done research in the field has heard of the MNIST database. It’s a collection of 60,000 handwritten numerical digits, labeled with their actual values, preprocessed for ease of use, and carefully designed to pose a substantial challenge for learning algorithms. Achieving a decent error rate isn’t too hard, but achieving a very low error rate seems to require things like 6-layer feed-forward neural networks with thousands of neurons in each layer.

What’s great about the MNIST database is that many, many people have used it to test their work. Now, whenever you try a new learning algorithm, you can toss the MNIST data at it. If your algorithm achieves a reasonable error rate, you can feel pretty confident that you’re not making some kind of gross error, and that your algorithm really is recognizing handwriting, rather than noticing and generalizing from irrelevant details. Unfamiliar data could have surprising biases or unexpected patterns that throw off the performance of your algorithm, but the MNIST database is carefully curated to avoid many such problems. That’s vital for teaching, but it’s just as vital for research.

Andrew Ng compares this to the use of model organisms in biology. Mice, for example, consume few resources, develop and reproduce quickly, and share fundamental biological traits with many mammals, including humans. For those reasons, people have been studying them for a very long time and their biology is well understood. Researchers who want to study mammals, and who are already committed to making the kinds of ethical trade-offs that animal experimentation entails, will almost certainly start with mice or other related model organisms. The epistemological, practical, and ethical benefits are manifold. There will be fewer ways for the work to go wrong in unexpected ways, the research will proceed more efficiently, and fewer animals will suffer overall.

Fortunately, digital humanists don’t face the same ethical questions as biologists. Our “mouse models” can consist of nothing but data. But we still don’t have enough of them.2

I found the absence particularly frustrating as I sat down to play with Syuzhet a couple of weeks ago. I was initially at a loss about what data to use. It quickly occurred to me that I ought to start with Romeo and Juliet because that’s what other researchers had used, and for good reason. It’s familiar to a broad range of audiences, so it’s easy to get data about it from actual human beings. It has large variations in emotional valence with relatively clear trajectories. And well, you know — it’s Shakespeare. But one play by one author isn’t really good enough. What we need is a model corpus — or rather, many model corpora from different periods, in different genres, different languages, different registers, and so on.

There has been some discussion about how to construct corpora that are representative, but in these discussions, the question is often about whether the corpus gives us access to some kind of ground truth about the culture from which it is sampled.3 That’s an extremely important question — one of the most important in the digital humanities. But I worry that we’re not quite ready to begin answering it. We don’t know whether corpora are representative, but we also don’t know for sure what tools to use in our investigations. And it’s going to be difficult to refine our tools and build new representative corpora at the same time. In our rush to take advantage of the sudden ubiquity of literary and historical data, we might be skipping a step. We need to understand the tools we use, and to understand the tools we use, we need to test them on corpora that we understand.4

From one perspective, this is a matter of validation — of checking new results against what we already know.5 But it’s a different kind of validation than many of us are used to — where by “us” I mean mostly literary scholars, but possibly other humanists as well. It doesn’t ask “is this a claim that comports with known evidence?” It asks “what is the precise meaning of this claim?” This second question becomes important when we use an unfamiliar tool to make a claim; we need to understand the claim itself before saying whether it comports with known evidence.

From another perspective, this is a matter of theorization — of clarifying assumptions, developing conceptual structures, and evaluating argumentative protocols. But it’s a different kind of theory than many of us are used to. It doesn’t ask “what can we learn by troubling the unspoken assumptions that make this or that interpretation seem obvious?” It asks “how can we link the representations these tools produce to familiar concepts?” Literary theory has often involved questioning the familiar by setting it at a distance. But distant reading already does that — there are few obvious interpretations in the field right now, and theory may need to play a slightly different role than it has in previous decades.

From either perspective, the point of a model corpus would not be to learn about the texts in the corpus, or about the culture that produced those texts. It would be to learn about the tools that we use on that corpus, about the claims those tools might support, and about the claims they cannot support.

But what should a model corpus look like? This is where I become less certain. My first instinct is to say “let’s look at what corpus linguists do.” But the kinds of questions linguists are likely to ask are very different from the ones that literary scholars are likely to ask. Still, there are some great starting points, including a remarkably comprehensive list from Richard Xiao. Among those, the ARCHER corpus seems particularly promising. (Thanks to Heather Froelich for these suggestions!)

But in the long run, we’ll want to produce our own corpora. Fortunately, Alan Liu has already given this some thought! His focus is on pedagogical issues, but again, the kinds of model corpora he talks about are vital for research as well. On Twitter, he offered a brilliant enumeration of desirable qualities such corpora would have. I’m reproducing it here, lightly paraphrased:

Imagine what a ready-for-student-use corpus of literary materials would look like. Specs include the following:

  1. Free and public domain.
  2. Of manageable size (e.g., low thousands and not hundreds of thousands of items).
  3. Modular by nation, genre, period, language.
  4. Socioculturally diverse.
  5. Richly annotated with metadata.
  6. Pre-cleaned and chunked (or packaged with easy-to-use processing scripts).
  7. Compatible in format with similar corpora of materials in history and other fields (to encourage cross-domain experiments in analysis).
  8. Matched by period and nation to available linguistic corpora that can be used as reference corpora.

I think this is a terrific set of initial goals for model corpora, both for researchers and students. We’ll need to keep having conversations about requirements, and of course no one corpus will serve all our needs. Different disciplines within the humanities will have different requirements. But it’s clear to me that if digital humanists can develop a “canon” of familiar corpora useful for validating new tools, the field will have taken a significant step forward.

Let’s get started!


Since there are already several links to helpful resources for thinking about humanistic corpora, I’m going to start a corpus creation and curation bibliography here. This will probably graduate into its own page or post.

  1. Update: Since posting this, I’ve learned that Laura Mandell, in collaboration with the NovelTM partnership, is working on a proposal for a journal dedicated to digital humanities corpora. I think this will be a fantastic development for the field! 
  2. There are some examples — such as the much-studied Federalist papers, which might be a good dataset to consider for testing new approaches to authorship attribution. And of course there are numerous standard corpora for use by corpus linguists — the Brown, AP, and Wall Street Journal corpora come to mind for American English, and there are many others. But these corpora have not been selected with literary studies in mind! This is where I parade my ignorance in the hope that someone will enlighten me: are there other corpora designed to be of specific methodological interest to literary scholars? 
  3. Most recently, Scott Weingart took up the question of corpus representativeness in his discussion of the Great Tsundoku, focusing on questions about what was written, what was published, and what was read. He also noted a conversation from a few years ago that Ted Underwood documented, and drew attention to Heather Froelich, who does lots of careful thinking about issues like this. And needless to say, Franco Moretti was thinking about this long ago. I think we’ll be asking this question in different ways for another fifty years. 
  4. Initially I said “We need to understand the tools we use first,” but that’s not quite right either. There’s a cyclical dependency here! 
  5. This has been a topic of widespread interest after the recent Syuzhet conversation, and I think the kinds of collective validation that Andrew Piper and others have called for would be made vastly easier by a somewhat standardized set of model corpora familiar to many researchers.