The Radical Potential of RDF Dimension Reduction? (original version)

Note (2016-05-25): The event described below has happened, and any event-specific information here is no longer up-to-date. An updated version is available here.

In about eight hours, Hanna Wallach will give a talk at today’s morning session of this week’s Culture Analytics workshop. I really wish I could be there. In it, she’ll be talking about a new Bayesian dimension-reduction technique that operates over timestamped triples. In other words — as she and her co-authors put it in their paper — “records of the form ‘country i took action a toward country j at time t’ — known as dyadic events.”

The talk as well as the paper frame this as a way of analyzing international relations. But what struck me immediately about this model is that it works with data that could be represented very naturally as RDF triples (if you add in a timestamp, that is). That means that this method might be able to do for RDF triples what topic modeling does for texts.

This probably seems like an odd thing to care about to people who haven’t read Miriam Posner‘s keynote on the radical potential of DH together with Matthew Lincoln‘s response. In her keynote, Posner poses a question to practitioners of DH: why do we so willingly accept data models that rely on simplistic categories? She observes, for example, that the Getty’s Union List of Artist Names relies on a purely binary model of gender. But this is strangely regressive in the wider context of the humanities:

no self-respecting humanities scholar would ever get away with such a crude representation of gender… So why do we allow widely shared, important databases like ULAN to deal so naively with identity?”

She elaborates on this point using the example of context-dependent racial categories:

a useful data model for race would have to be time- and place-dependent, so that as a person moved from Brazil to the United States, she might move from white to black. Or perhaps the categories themselves would be time- and place-dependent, so that certain categories would edge into whiteness over time. Or! Perhaps you could contrast the racial makeup of a place as the Census understands it with the way it’s articulated by the people who live there.

Matt Lincoln’s brilliant response takes this idea and gives it a concrete computational structure: RDF. Rather than having fixed categories of race, we can represent multiple different conceptualizations of race within the same data structure. The records of these conceptualizations take the form of {Subject, Verb, Object} triples, which can then form a network:

A diagram of a network of perceived racial categories.

Given that Posner’s initial model included time as well, adding timestamps to these verbs seems natural, even if it’s not, strictly speaking, included in the RDF standard. (Or is it? I don’t know RDF that well!) But once we have actors, timestamped verbs, and objects, then I think we can probably use this new dimension reduction technique on networks of this kind.1

What would be the result? Think about what topic modeling does with language: it finds clusters of words that appear together in ways that seem coherent to human readers. But it does so in a way that is not predictable from the outset; it produces different clusters for different sets of texts, and those differences are what make it so valuable. They allow us to pick out the most salient concepts and discourses within a particular corpus, which might be very different from place to place. This technique appears to do the very same thing, but with relationships between groups of people over time. We might be able to capture local variations in models of identity within different communities.

What new perspective on identity might emerge from work like this? If you’re at Wallach’s talk today, I hope you’ll bring this up for me!

  1. With a few small adjustments. Since actors and objects are of the same kind in the model, the verbs would need to have a slightly different structure — possibly linking individuals through identity perceptions or acts of self-identification.