Model Corpora for Distant Reading Research

There has been some recent discussion on Twitter in response to Alan Liu’s call for model corpora for students. It’s an extremely exciting conversation that’s long overdue. We need better model corpora. And I think it’s important for DH scholars to recognize that this is not just a pedagogical requirement — it’s a requirement for research at every level.1

Consider the field of machine learning. Almost anyone who has ever done research in the field has heard of the MNIST database. It’s a collection of 60,000 handwritten numerical digits, labeled with their actual values, preprocessed for ease of use, and carefully designed to pose a substantial challenge for learning algorithms. Achieving a decent error rate isn’t too hard, but achieving a very low error rate seems to require things like 6-layer feed-forward neural networks with thousands of neurons in each layer.

What’s great about the MNIST database is that many, many people have used it to test their work. Now, whenever you try a new learning algorithm, you can toss the MNIST data at it. If your algorithm achieves a reasonable error rate, you can feel pretty confident that you’re not making some kind of gross error, and that your algorithm really is recognizing handwriting, rather than noticing and generalizing from irrelevant details. Unfamiliar data could have surprising biases or unexpected patterns that throw off the performance of your algorithm, but the MNIST database is carefully curated to avoid many such problems. That’s vital for teaching, but it’s just as vital for research.

Andrew Ng compares this to the use of model organisms in biology. Mice, for example, consume few resources, develop and reproduce quickly, and share fundamental biological traits with many mammals, including humans. For those reasons, people have been studying them for a very long time and their biology is well understood. Researchers who want to study mammals, and who are already committed to making the kinds of ethical trade-offs that animal experimentation entails, will almost certainly start with mice or other related model organisms. The epistemological, practical, and ethical benefits are manifold. There will be fewer ways for the work to go wrong in unexpected ways, the research will proceed more efficiently, and fewer animals will suffer overall.

Fortunately, digital humanists don’t face the same ethical questions as biologists. Our “mouse models” can consist of nothing but data. But we still don’t have enough of them.2

I found the absence particularly frustrating as I sat down to play with Syuzhet a couple of weeks ago. I was initially at a loss about what data to use. It quickly occurred to me that I ought to start with Romeo and Juliet because that’s what other researchers had used, and for good reason. It’s familiar to a broad range of audiences, so it’s easy to get data about it from actual human beings. It has large variations in emotional valence with relatively clear trajectories. And well, you know — it’s Shakespeare. But one play by one author isn’t really good enough. What we need is a model corpus — or rather, many model corpora from different periods, in different genres, different languages, different registers, and so on.

There has been some discussion about how to construct corpora that are representative, but in these discussions, the question is often about whether the corpus gives us access to some kind of ground truth about the culture from which it is sampled.3 That’s an extremely important question — one of the most important in the digital humanities. But I worry that we’re not quite ready to begin answering it. We don’t know whether corpora are representative, but we also don’t know for sure what tools to use in our investigations. And it’s going to be difficult to refine our tools and build new representative corpora at the same time. In our rush to take advantage of the sudden ubiquity of literary and historical data, we might be skipping a step. We need to understand the tools we use, and to understand the tools we use, we need to test them on corpora that we understand.4

From one perspective, this is a matter of validation — of checking new results against what we already know.5 But it’s a different kind of validation than many of us are used to — where by “us” I mean mostly literary scholars, but possibly other humanists as well. It doesn’t ask “is this a claim that comports with known evidence?” It asks “what is the precise meaning of this claim?” This second question becomes important when we use an unfamiliar tool to make a claim; we need to understand the claim itself before saying whether it comports with known evidence.

From another perspective, this is a matter of theorization — of clarifying assumptions, developing conceptual structures, and evaluating argumentative protocols. But it’s a different kind of theory than many of us are used to. It doesn’t ask “what can we learn by troubling the unspoken assumptions that make this or that interpretation seem obvious?” It asks “how can we link the representations these tools produce to familiar concepts?” Literary theory has often involved questioning the familiar by setting it at a distance. But distant reading already does that — there are few obvious interpretations in the field right now, and theory may need to play a slightly different role than it has in previous decades.

From either perspective, the point of a model corpus would not be to learn about the texts in the corpus, or about the culture that produced those texts. It would be to learn about the tools that we use on that corpus, about the claims those tools might support, and about the claims they cannot support.

But what should a model corpus look like? This is where I become less certain. My first instinct is to say “let’s look at what corpus linguists do.” But the kinds of questions linguists are likely to ask are very different from the ones that literary scholars are likely to ask. Still, there are some great starting points, including a remarkably comprehensive list from Richard Xiao. Among those, the ARCHER corpus seems particularly promising. (Thanks to Heather Froelich for these suggestions!)

But in the long run, we’ll want to produce our own corpora. Fortunately, Alan Liu has already given this some thought! His focus is on pedagogical issues, but again, the kinds of model corpora he talks about are vital for research as well. On Twitter, he offered a brilliant enumeration of desirable qualities such corpora would have. I’m reproducing it here, lightly paraphrased:

Imagine what a ready-for-student-use corpus of literary materials would look like. Specs include the following:

  1. Free and public domain.
  2. Of manageable size (e.g., low thousands and not hundreds of thousands of items).
  3. Modular by nation, genre, period, language.
  4. Socioculturally diverse.
  5. Richly annotated with metadata.
  6. Pre-cleaned and chunked (or packaged with easy-to-use processing scripts).
  7. Compatible in format with similar corpora of materials in history and other fields (to encourage cross-domain experiments in analysis).
  8. Matched by period and nation to available linguistic corpora that can be used as reference corpora.

I think this is a terrific set of initial goals for model corpora, both for researchers and students. We’ll need to keep having conversations about requirements, and of course no one corpus will serve all our needs. Different disciplines within the humanities will have different requirements. But it’s clear to me that if digital humanists can develop a “canon” of familiar corpora useful for validating new tools, the field will have taken a significant step forward.

Let’s get started!


Since there are already several links to helpful resources for thinking about humanistic corpora, I’m going to start a corpus creation and curation bibliography here. This will probably graduate into its own page or post.

  1. Update: Since posting this, I’ve learned that Laura Mandell, in collaboration with the NovelTM partnership, is working on a proposal for a journal dedicated to digital humanities corpora. I think this will be a fantastic development for the field! 
  2. There are some examples — such as the much-studied Federalist papers, which might be a good dataset to consider for testing new approaches to authorship attribution. And of course there are numerous standard corpora for use by corpus linguists — the Brown, AP, and Wall Street Journal corpora come to mind for American English, and there are many others. But these corpora have not been selected with literary studies in mind! This is where I parade my ignorance in the hope that someone will enlighten me: are there other corpora designed to be of specific methodological interest to literary scholars? 
  3. Most recently, Scott Weingart took up the question of corpus representativeness in his discussion of the Great Tsundoku, focusing on questions about what was written, what was published, and what was read. He also noted a conversation from a few years ago that Ted Underwood documented, and drew attention to Heather Froelich, who does lots of careful thinking about issues like this. And needless to say, Franco Moretti was thinking about this long ago. I think we’ll be asking this question in different ways for another fifty years. 
  4. Initially I said “We need to understand the tools we use first,” but that’s not quite right either. There’s a cyclical dependency here! 
  5. This has been a topic of widespread interest after the recent Syuzhet conversation, and I think the kinds of collective validation that Andrew Piper and others have called for would be made vastly easier by a somewhat standardized set of model corpora familiar to many researchers. 

8 Responses

  1. I’m very sympathetic to Alan’s list of requirements for literary corpora; he’s completely right that there is no such thing as a standard corpus for learning in what I think is best described as Quantitative Literary-historical Stylistics. But that’s not for a dearth of completely acceptable, very carefully curated corpora which cover literary-historical texts (and an increasing number of digitised corpora of well-recognised textual collections such as EEBO or ECCO). The problem, it seems to me, is not that there are no acceptable teaching corpora, but that the definition of a “literary corpus” hasn’t been decided upon yet.

    For example, in Early Modern studies, the Corpus of Early English Dialogues includes dramatic comedies, trial proceedings, language teaching dialogues, trial proceedings and witness depositions, is in theory a “literary corpus” because it contains “literary text”. But it also includes records of spoken language which we would now read, but is not considered “literary” by virtue of sociohistorical context. Similarly, the EEBO-TCP phase 1 corpus contains religious pamphlets, scientific writing, prose, verse, dramatic writing (plays, masques, interludes) and much more but only a small subsection of that may be considered “literary” in the sense they are for “literary” reading.

    Likewise, very standard teaching corpora in corpus linguistics, such as the BNC, include newspapers, periodicals, journals, academic books, popular fiction, letters, and essays. All of these texts are meant for reading, but they are not strictly “literary”. Again the Corpus of Contemporary American English and Corpus of Historical American English both include the category of “Fiction”, covering short shories and plays from literary magazines, children’s magazines, popular magazines, selections from newspapers from across the US, peer reviewed academic journals from the Library of Congress classification system, project gutenberg texts, non-fiction, movie and play scripts, etc. As for Alan’s point about sociocultural diversity, the Wikipedia corpus and Global Web-Based English may be suitable starting points to widening the discussion (something I agree is completely necessary).

    It’s also worth pointing out that corpus construction is a subfield of corpus linguistics in its own right: here’s a sampling of some widely cited examples; and the gold standard of talking about corpus construction remains Biber’s 1993 paper, “Representativeness in Corpus Design”.

    With all that said, Alan is totally correct that a lot of linguistic corpora is not designed for literary study, in that they either are not full-text, recognisable ‘fun’ texts that students care about, or even accessible as a full-text corpus which is acceptable, but it strikes me as especially curious as there already is an accepted canonicity of literary study which seems ripe for the taking. Some authors, such as Shakespeare, are exceptionally easy to find corpora for; the Project Gutenberg texts make many canonical authors and their contemporaries quite accessible.

    Quantitative literary-historical stylistics needs to decide what is “literary” in the data deluge and think about how to make meaningful connections into established corpora. It would not be a stretch, for example, to make a class project be compiling a companion corpus to the Bronte’s oeuvre and practicing quantitative and qualitative skills as part of building into a larger canonical corpus of literary-historical texts with an established and widely-used set of guidelines, centrally hosted and accepting a growing set of contributions to build a suitable corpus for ongoing literary historical stylistic research.

    But perhaps I dream of a day when the early corpus linguistic contributions such as the Lancaster-Oslo-Bergen corpus and the BROWN corpus, both from the 1960s and both of which pave the way for us present-day digital scholars, are more widely recognised as models for non-linguists to continue to improve upon, rather than brushed aside as “not good enough”.

    1. Thank you for this! I think you’re absolutely right that there’s a ton of work in corpus linguistics that people trained in English departments are woefully ignorant of. And — speaking as one of those woefully ignorant people — it might even be that many of those corpora would be useful for literary scholars. The challenge is that those judgements will have to come from a disciplinarily-informed perspective. It’s not so much that these corpora aren’t “good enough” — it’s that most literary scholars just haven’t thought through these particular problems. (Update: And upon reflection, I think that here I’m essentially restating your claim that “the definition of a “literary corpus” hasn’t been decided upon yet.” in different words. But the following question remains.)

      Will it be easier for us to think through these problems by drawing from work in other fields, or will it be easier for us to do so by creating our own corpora?

      My personal tendency is to need to recreate things to understand them. I just can’t use certain computational tools without recreating them myself first. For example, when I was teaching myself sorting algorithms — I had to rewrite the sorting algorithms. I’ll never use that code because it’s slow and wrong in many ways, but the process of writing the code was indispensable. For example, I would be foggy about which algorithms were stable and which unstable if I hadn’t taken the time to rewrite them and think about how they work. Ideally that wouldn’t be necessary, but for me it is.

      I think this might be a case where that logic applies at the level of the entire discipline. By the time we’ve worked out the details for ourselves, we might just conclude that we should use a corpus somebody else already created. But the process of creating a corpus for ourselves might be necessary.

      There are a couple of other points I want to bring out as well. First, you say there “already is an accepted canonicity of literary study which seems ripe for the taking.” But I’m not sure that’s true, if you take the word “accepted” really seriously. I would say there is a “contested canonicity” — a canon part of which should be included as part of any model corpus we develop for literary scholarship. But there will need to be discussion and debate about which part.

      And second, you rightly point out that many texts that would certainly be included are readily available via Project Gutenberg. But in a sense, that’s actually the problem. It’s far too easy for us to create ad-hoc corpora for whatever we happen to be working on. But that’s not as useful as a standard model corpus because it’s not stable. The value of the MNIST database is that thousands of people have worked with and become familiar with the exact same data. I think that’s true of corpora like the Brown corpus too — but what corpus has that status among literary scholars? I don’t know of one. And I’m not sure that it will be easy to convince literary scholars to use one particular corpus en masse that doesn’t have high-profile backing from scholars in the field.

      Finally, I think it’s worth distinguishing the field of literary study not only by the texts we use as evidence (a body that is in rapid and unpredictable flux), but also by the questions we ask. Those questions may well lead us to choose different texts than linguists would, but not for reasons that necessarily have to do with their “literariness” in any obvious sense.

      But all of these points are essentially pragmatic. I think it’s vital for anyone thinking about these issues to become much more familiar with the resources you’ve drawn attention to, and I, for one, have a lot more reading to do now!

      1. There’s a well-rehearsed argument from the literary text miners (or whatever they choose to call themselves these days) in modern digital humanities discourse which says “we don’t want representative corpora; we have increasingly large collections of data at our fingertips and we don’t care about questions of balance. We want to KNOW MORE” (here is one such example of this, from 2013; but the issue keeps cropping up again: more recently in Scott Weingart’s Not Enough Perspectives blog post). As far as I can tell, their thing seems to be that we don’t need representativeness, except when we do.

        To a certain extent, they are completely correct: all the linguistic corpora I enumerated above are representative, balanced corpora for linguistic inquiry, not just Data Because It’s There. Linguistic corpora are not completely appropriate for making sense of the myth of digital everything, because they were designed without the myth of the digital everything in mind: early projects such as the LOB/FLOB/Brown corpora were designed to show the widest swath of language in use at the time that computers could handle. As we increasingly have terabytes of storage at our disposal, perhaps we don’t need to start carefully sampling and curating corpora; we have the space and the scope for it. In hopes of transparency, I’m completely on board with this model: wild untamed corpora are in some ways far more interesting than highly curated corpora for a specific purpose.

        However, the issue of representativeness keeps cropping up, which is why I keep motioning towards linguistic corpora. It may be, as you say, siloing and ignorance between highly compatible fields, and it almost certainly is true that linguists and DH folks have entirely different end goals in mind. But it’s also worth pointing out that early corpus work was about putting literary texts through computers in search of finding patterns (Ben Ross Schnieder Jr’s brilliant Travels in Computerland or Incompatibilites and Interfaces, 1974 seems out of print but is wonderfully illustrative of this early work). Convergence in approaches is likely a good thing, but the DH community repeatedly is turning backwards towards questions which have been long established elsewhere. Turning away from highly established models when looking for one does not seem to be a wise approach to me – but disciplinary may be rearing its ugly head here in that I have always been between literary studies and linguistics and have always been thinking about how the two overlap. I can’t separate them.

        As for the issue of replicability, a well documented corpus (and there are many to be found on the <a href=““CoRD page) will tell you exactly what’s in it, how it has or has not been annotated, what the sampling techniques were, file naming system, word counts, curators, and much more. This is not so different from GitHub documentation on a script or a CRAN package for R. Writing documentation is something that is an excellent teaching technique: how would someone replicate this study? who does what kind of work?

        And you’re correct that there’s an education in contributing to a corpus; I’m not dissuading the idea of doing it yourselves. I do think, however, there’s a lesson to be learned in balance and sampling for a curated teaching corpus: what percentage of the corpus should be Dead White Men? What percentage should be women’s writing, what percentage should be people of colour, what issues of socioeconomic class can we account for in our study? These immediately will beget research questions, such who uses what kinds of language and register and how? What intersecting features can contribute to what kind of understanding of printed literary language? These are useful questions which would serve nicely to an iterative course which centres on contributing to a larger database for someone (or many someones) to use. And it’s doable in 10 weeks – I believe Katherine D Harris (San Jose State) does a course with this underlying principle; I’ve taught on one in the UK which addresses Shakespeare’s language.

        You’re also right to raise the question of canonicity: at risk of going on too long here, I’ll just say that in some senses it’s much more interesting to see which texts Jockers used to illustrate Syuzhet, as one example, or more broadly the texts that most projects seem to start with, are the ones we already know a lot about. Surely that’s an acceptance of literary canonicity in that they’re easily accessible and widely understood? The promise of the myth of digital everything is that we can find information about thousands if not millions of “lost” or unread texts, but we start with the ones we understand best to understand our tools and that seems unlikely to be accidental!

        1. Quite so! Regarding your last comment — it’s definitely not accidental. That’s also the approach I’ve used when looking at new tools, and I think that gets at something central here: It might make sense to think of the canon as a kind of vernacular! Upon reflection, that might be more surprising to me than it is to you.

          I strongly agree with this as well: “Turning away from highly established models when looking for one does not seem to be a wise approach to me.” And your following remark about disciplinarity is spot on — I’m taking a slightly more conservative perspective on doing interdisciplinary work. I constantly have Jonathan Kramnick’s “Against Literary Darwinism” (Critical Inquiry, Winter 2011) running in the back of my mind during conversations like these. (You can download it and a scathing follow-up at his faculty page). It looks at first like he’s engaging in border warfare, but I think he’s actually carving out space for himself and others to do peaceful and mutually beneficial interdisciplinary work.

          Anyway, there’s lots to think about here. This will be extremely helpful to me — and I hope to others as well; thanks again.

  2. I am interested in your question: “whether the corpus gives us access to some kind of ground truth about the culture from which it is sampled.”

    I have seen scholars classify data according to categories that I think are misguided because they are ahistorical and then draw conclusions based on that categorization. Perhaps practicing with model corpora would be one way to instill rigor in both scholars’ and students’ thinking about these issues.

    To construct these model corpora, collaboration between disciplines is important, not only for the skills brought from multiple fields, but because it would create opportunities to “naively” raise questions about areas that are sometimes treated as settled.

    1. I couldn’t agree more that model corpora could help scholars think more critically about other corpora — that’s a really important insight. For example, it would be very interesting to see whether certain approaches to clustering create coherent results in one period, while creating incoherent results in another — we could potentially learn a lot about the history of genre, and it would also help us understand how to construct new corpora for individual projects. This strikes me as the kind of sub-disciplinary cross-validation that’s not very easy right now.

      I also think that your point about interdisciplinary conversation is really important — as Heather Froehlich’s contributions have already shown! I’m looking forward to seeing more naive and not-so-naive questions about this problem.

  3. Thanks, Scott, for creating a home for this conversation and also to everyone who is participating in it.

    Over the past couple of decades, libraries have created huge collections of digitized text. Now we are looking for ways to make those collections available to scholars who want to do more than simply read a digital version of a text. Whether that be computational linguistics, topic modeling or just basic word clouds, I just want the collections to be helpful.

    Heather Froehlich’s comments here and on Twitter shows that there are several places where scholars can find things to study but I’m wondering if these are either too disparate and too disciplinarily focused for the question Alan Liu has raised. What he seemed to be pointing to seems to be ready-made corpora for using computational methods to study literature. (Some of the resources linked to above may meet that need; I just need to spend a bit more time looking at them.)

    I’d be really interested to hear from people reading this what they would like to see. Prof. Liu’s list gives us a good place to start and I’d like to see more comments about it.
    – Would a “library of libraries” be enough to link all of these collections?
    – If it were to be gathered together, where would scholars want to go to look for this kind of data; an existing digital library? A professional organization/association? Which one?
    – What formats and affordances would be most useful?
    – Are there useful models we could use as examples?

    Some librarians are already talking amongst ourselves about how to join forces and make our collections more discoverable and interoperable. Any ideas would be greatly appreciated.

    1. I think you’re absolutely right. There’s suddenly such a huge amount of material that we risk creating a thousand different projects that each splinters off into its own discursive universe.

      I think Liu’s call for a moderately sized corpus is key here — a corpus that isn’t too hard to develop some “close reading” familiarity with. People will still be assembling corpora for their own projects, but a model corpus would ensure that they can test out any new techniques and display the results for public validation and critique. Others can use both “close” and “distant” familiarity with the corpus to asses these new methods.

      To offer a specific example, I’ve been interested in Ben Schmidt’s experiments with topic modeling across narrative time. But I recently learned of some researchers in computer science who have incorporated sequential information into their statistical model, and the diagrams that it produces look smoother.

      The question is this: are they better? It’s hard to tell. If a lot of us knew one corpus really well, we could compare results and discuss the question in a more informed way. This is sort of what we’re already doing with canonical works — see Heather Froelich’s comments! — but we’re doing it in a way that’s not very systematic or reproducible.

Leave a Reply