The Radical Potential of RDF Dimension Reduction?

Note (2016-06-03): This revises an earlier post by removing some dated information and expanding the conclusion.

This brief post was inspired by the abstract for a talk by Hanna Wallach at the fourth IPAM Culture Analytics workshop. I didn’t even attend it, so take the first part of this with a grain of salt! In her talk, Wallach discussed a new Bayesian dimension-reduction technique that operates over timestamped triples. In other words — as she and her co-authors put it in the paper summarizing their research — “records of the form ‘country i took action a toward country j at time t’ — known as dyadic events.”

The abstract as well as the paper framed this as a way of analyzing international relations. But what struck me immediately about this model is that it works with data that could be represented very naturally as RDF triples (if you add in a timestamp, that is). That means that this method might be able to do for RDF triples what topic modeling does for texts.

This probably seems like an odd thing to care about to people who haven’t read Miriam Posner‘s keynote on the radical potential of DH together with Matthew Lincoln‘s response. In her keynote, Posner poses a question to practitioners of DH: why do we so willingly accept data models that rely on simplistic categories? She observes, for example, that the Getty’s Union List of Artist Names relies on a purely binary model of gender. But this is strangely regressive in the wider context of the humanities:

no self-respecting humanities scholar would ever get away with such a crude representation of gender… So why do we allow widely shared, important databases like ULAN to deal so naively with identity?”

She elaborates on this point using the example of context-dependent racial categories:

a useful data model for race would have to be time- and place-dependent, so that as a person moved from Brazil to the United States, she might move from white to black. Or perhaps the categories themselves would be time- and place-dependent, so that certain categories would edge into whiteness over time. Or! Perhaps you could contrast the racial makeup of a place as the Census understands it with the way it’s articulated by the people who live there.

Matt Lincoln’s brilliant response takes this idea and gives it a concrete computational structure: RDF. Rather than having fixed categories of race, we can represent multiple different conceptualizations of race within the same data structure. The records of these conceptualizations take the form of {Subject, Verb, Object} triples, which can then form a network:

A diagram of a network of perceived racial categories.

Given that Posner’s initial model included time as well, adding timestamps to these verbs seems natural, even if it’s not, strictly speaking, included in the RDF standard. (Or is it? I don’t know RDF that well!) But once we have actors, timestamped verbs, and objects, then I think we can probably use this new dimension reduction technique on networks of this kind.1

What would be the result? Think about what topic modeling does with language: it finds clusters of words that appear together in ways that seem coherent to human readers. But it does so in a way that is not predictable from the outset; it produces different clusters for different sets of texts, and those differences are what make it so valuable. They allow us to pick out the most salient concepts and discourses within a particular corpus, which might differ case-by-case. This technique appears to do the very same thing, but with relationships between groups of people over time. We might be able to capture local variations in models of identity within different communities.

I am not entirely certain that this would work, and I’d love to hear any feedback about difficulties that this approach might face! I also doubt I’ll get around to exploring the possibilities more thoroughly right now. But I would really like to see more humanists actively seeking out collaborators in statistics and computer science to work on projects like this. We have an opportunity in the next decade to actively influence research in those fields, which will have widespread influence in turn over political structures in the future. By abstaining from that kind of collaboration, we are reinforcing existing power structures. Let’s not pretend otherwise.

  1. With a few small adjustments. Since actors and objects are of the same kind in the model, the verbs would need to have a slightly different structure — possibly linking individuals through identity perceptions or acts of self-identification. 

11 Responses

  1. I’ve read both Poser’s keynote and Lincoln’s response and, if that’s where you want to go, then maybe you need to look at the general literature on knowledge representation if you haven’t done so already. There’s been a s**t-ton of work in the area in the last half-century.

  2. No doubt! Perhaps there’s already some precedent in that literature for — as they call it, Bayesian Poisson Tensor Factorization? Part of what excites me about this prospect is that there’s already a fair amount of RDF data — as a random example, the metadata for Project Gutenberg is stored in RDF (though in that case, perhaps not in a form immediately usable by this algorithm). My hope is that this method might make it easy (well, easier) to automatically generate knowledge schemas based on lower-level representations.

    I’m not sure whether that’s realistic! This is certainly very speculative. But I thought it was worth mentioning.

  3. I wasn’t thinking so much about Bayesian Poisson Tensor Factorization (which is above my pay grade) as the various examples Posner used and about Lincoln’s graphs. And about time.

    1. By the way — let me know if you have any suggestions for starting points. The point about time is suggestive, and I definitely want to read more about this.

      1. I suppose you’re referencing my remark that there’s been a “s**t-ton of work” in knowledge representation. Well, I don’t know much about your background and I assume you don’t know much about mine. So finding one another may be tricky.

        What I was reacting to in Posner is: 1) her contention that “technically speaking, we frankly haven’t really figured out how to deal with categories like gender that aren’t binary or one-dimensional or stable”, and 2) her assertion at the end about the need to “to develop models of the world that have any relevance to people’s lived experience”. And then there’s those moderately complex graphs that Lincoln drew.

        Posner’s first assertion is wrong. The limited nature of census data about gender and ethnicity has nothing to do with the technical capabilities of databases. Those limitations are the result of decisions made about what kind of information should go into the database and those decisions, of course, are political. As Ted Underwood pointed out in a recent post we can jam all sorts of information into databases. And then we have Lincoln’s data model, which outlines a technical approach.

        And then we have Posner’s second assertion, with the phrases “models of the world” and “people’s lived experience,” both of which are rather open ended. Of course database people have long-known that databases embody some model of the world and there is a rather large literature about that. But then there is AI and cognitive science, which set out to model how people think about the world, and that gave rise to the discipline of “knowledge representation” (aka KR). That is, the subject quickly became so rich that you treat it simply as something you did on the way to creating a program of some sort. It became a study unto itself. But there isn’t a firm dividing line between data modeling and KR. By the late 1990s we started seeing object-oriented database technology, which in turn derives from object-oriented programming and from KR.

        As for time, if you are going to represent people’s common sense knowledge of the world, and if you are doing to represent narrative, you’ve got to deal with time. One the one hand you’ve got to situate events in relation to one another (succession, simultaneity, overlap) but you’ve also got to deal with the fact that the attributes of an object change over time, but the object is still the self-same object.

        As for graphs, it’s convenient and sometimes even revealing to represent these knowledge structures as a graph. In this context graphs are just notation devices, but they’re useful (and, of course, are all over the place in computing).

        Back in the mid-1970s I was getting my degree in English at SUNY Buffalo. But I ended up spending much of my time in the linguistics department working with David Hays and his students on computational semantics. That had me knee deep on conceptual graphs and I ended up with a small pile of graphs representing some of the knowledge underlying a Shakespeare sonnet. I published that in MLN in 1976: Cognitive Networks and Literary Semantics. You might want to glance through it just to get a feel for the sort of thing that can be done, though I’m not sure it will be of much immediate use. I did a Google query on “knowledge representation for time” and it turned up “Time Series Knowledge Mining” (PDF), which might be useful.

        1. This is really helpful, thanks! I have a very broad view of the trends you’re describing, mostly from reading things like Stanford Encyclopedia of Philosophy articles. I’ll follow the leads you mention here.

          I do want to defend a particular interpretation of Posner’s claim that “technically speaking, we frankly haven’t really figured out how to deal with categories like gender that aren’t binary or one-dimensional or stable.” You’re totally right that the technical problem itself is almost trivial given a specific model. But which model do we choose? My interpretation of Posner’s claim hinges on “figured out how,” which I take to mean not “figured out a way,” but “figured out which way” — that is, which way to choose from among a combinatorial explosion of possible ways, each of which is perfectly computationally tractable on its own.

          In other words, it’s a garden of forking paths problem. It’s not too hard to explore one path, but it’s extremely difficult to pick out one particular path or constrained subset of paths as “right” because there are so many possible alternatives. The work you’re describing successfully explored lots of those paths, but I would expect that if they had found one that was obviously vastly better than the binary status quo, we’d be using it in census databases by now.

          On the other hand, I sympathize with your point; I am wary of declaring ex-[humanities]-cathedra that “we haven’t figured stuff out.” Surely there’s something to learn from those twenty years of research, even if that work fell out of favor or (as I think you suggest above) was absorbed and normalized within other branches of CS.

  4. 1) Well, you might be right about what Posner meant, but I think that if that’s what she meant, then she’d have made some explicit remarks about the multiplicity of technical options. But there’s nothing like that in the piece. As for the possibility of finding THE one right representation, beware of succumbing to neoliberal managerial fantasies of transcendent access and ultimate control. 🙂

    More generally, I have trouble “calibrating” a lot of statements Posner makes. Explicit assertions imply lots of unvoiced assertions. I don’t know how to make inferences to the unvoiced assertions that Posner would assent to and I fear that the implications that seem natural to me are at variance from the implications that would seem natural to Posner (and her intended audience).

    2) On the census, I’m sure there’s a history that tells us a lot about how the census has evolved over the years. I have some vague recollection that at some time during my lifetime they decided to enlarge the range of ethic/race categories available and that decision made the news, for obvious reasons. But I don’t recall any details, though the details may well be available on the web somewhere.

    There’s a general point that systems have a history and that history places constraints on how the system can evolve, For example, if you aren’t familiar with it, look up the history of the QWERTY keyboard. For bonus points, your laptop probably allows you to use a different keyboard mapping, such as some version of the Dvorak keyboard. Find it.

    It turns out that the Census Bureau has some history on their site. I haven’t spent much time exploring it but I did discover that they are the second government agency, after the War Department (now the DOD), to commission and acquire a digital computer. In the middle of 1951 they took delivery of the UNIVAC I, which remained in service until the early 1960s.

    What features of the data model for that machine are with us today?

    Here’s a big problem for me. Early in the piece she says: “For all of its vaunted innovation, the digital humanities actually borrows a lot of its infrastructure, data models, and visual rhetoric from other areas, and particularly from models developed for business applications. In some ways, that’s inevitable, because the business market is just so much bigger, and so much better funded, than the market for weird, boutique humanities tools.” And then she has three paragraphs about maps, paragraphs in which she suggests that there might be some problem resulting from the fact that the Mercator projection emerged under conditions of empire, but she doesn’t give any examples of DH projects that have encountered such problems. So why mention the Mercator projection, not to mention Cartesian space?

    Then she starts going through other examples of problems. After awhile she asserts that humanists should start “ripping apart and rebuilding the machinery of the archive and database so that it doesn’t reproduce the logic that got us here in the first place.” From this one might conclude that she thinks that the business world is perfectly happy with computing but that the the problems faced by DH result from the the fact the computing is “shrink wrapped” to fit commercial needs. That is, there is an implicit binary opposition between commercial computing and DH and that the former creates problems for the latter.

    From my POV that’s secondary. I don’t deny the problems facing DH. But it’s clear to me that people in the commercial IT world also wish they could rip apart and rebuild the “machinery of the archive.” There’s nothing special about the problems faced by DH. This is another calibration issue.

    A couple of years ago I was involved in a project addressed to the following problem: A large corporation has a dozen or so large databases (millions of records each) containing more or less the same kind of information. But the DBs each have their own data model for that information and, moreover, the DBs are implemented in different platforms. What’s worse, we have no documentation on the data models for half the DBs. That is, we know what kinds of queries we can run against a DB, and what kind of information we get out, but we don’t know what’s going on inside the DB to produce that result. What we want to do is to somehow produce a unified model for all of these DBs that that exist in peace and harmony somewhere in The Cloud.

    That pretty much sounds like ripping things up and starting anew. And yet it isn’t about the weird and boutique world of DH, it’s about the neoliberal managerial commercial world, which foots the bill for digital tech (while the govenment foots much of the bill for fundamental R&D). The notion that existing IT tech doesn’t meet our needs is not special to DH. It’s ubiquitous. Why? Because representing the world in computational terms is very difficult. Very difficult. Third time’s a charm: very difficult. DH projects have no special privilege in the face of a various reality that retreats from and mocks us at every turn.

    Consider the following statement by Alan Kay, who is generally cited as the person who synthesized the GUI interface that was promulgated to the world in Apple’s Macintosh:

    There is the desire of a consumer society to have no learning curves. This tends to result in very dumbed-down products that are easy to get started on, but are generally worthless and/or debilitating. We can contrast this with technologies that do have learning curves, but pay off well and allow users to become experts (for example, musical instruments, writing, bicycles, etc. and to a lesser extent automobiles). [Douglas] Engelbart’s interface required some learning but it paid off with speed of giving commands and efficiency in navigation and editing. People objected, and laughed when Doug told them that users of the future would spend many hours a day at their screens and they should have extremely efficient UIs they could learn to be skilled in.

    If one takes that as a directive for DH, what does it imply? OTOH there are hints of that in one of the examples Posner gives, The Knotted Line. But one could also read the entire piece as a response to the imperative implied in that statement: “Don’t take what you are given. Learn to build your own.” What does that imply in the context of the hack/yack debate?

    1. I’ll be direct — I might be misunderstanding you, but it looks to me like the interpretation you’re offering of Posner posits that she must not know that it’s possible to construct a time series of two floating point numbers. (Non-binary, two-dimensional, and unstable! Problem solved …?) But that would be an extremely uncharitable reading!

      Looking back over your previous post, this jumped out at me, and I think it’s our fundamental point of disagreement:

      Those limitations are the result of decisions made about what kind of information should go into the database and those decisions, of course, are political.

      I think that these are neither purely technical nor purely political decisions. They are technical because they are political. We don’t have enough technical knowledge about more complex models to be confident that one of them will be a good solution to the political problem at hand.

      1. I think we’ve gone just about as far as we can go, Scott. I note that much of this conversation has been occasioned to remark, “technically speaking, we frankly haven’t really figured out how to deal with categories like gender that aren’t binary or one-dimensional or stable”. That surprised me, as it’s the sort of remark I generally find coming from people who don’t know much about computing and who are opposed to it. But Posner is neither. That leaves me with a problem: where’s that remark coming from?

        I don’t really know. I can only guess. My best guess is that there is unresolved tension in the DH community between a sense that computing is fascinating and powerful and can be put to good use and a fear that, because computing is deeply embedded in the worlds of industry, commerce, and government it must “evil” and therefor we engage with it at our peril. So, at that point in her piece Posner simply forgot what she may otherwise have known about computing in an effort to assert that humanists have work to do and contributions to make. That of course is true. But also, it’s not so simple.

        As for the line you quote from me, I wasn’t making a general statement. I was specifically referring to census information about gender and race. There is a history there. A rich and fascinating history.

        The first census was conducted in 1790. The information gathered was quite limited. For each household:

        The number of free White males aged: a) under 16 years and b) 16 years and upward
        Number of free White females
        Number of other free persons
        Number of slaves

        That’s what they needed to know to comply with constitutional requirements concerning proportional representation and taxation. The technology involved was simple, pen and paper.

        Now, let’s look at the situation in 1940, when statistical sampling was introduced. A much wider variety of information was collected and there was now considerable technological help. Not from digital computers, of course, for they had not yet been invented. Rather, they used Hollerith’s electromechanical punch-card technology, which was first put into service in 1890. If we take a look at the population schedule (PDF) for the 1940 census we see that gender information (called “sex” at the head of the column, #9) has the familiar binary, M/F. Column 10 asks for color or race and a notation at the bottom of the form (lower left) gives us the following categories:

        Other races, spell out in full

        That’s an interesting list. But it certainly doesn’t break down neatly into the traditional three races: Caucasoid, Mongoloid, and Negroid. Would the people who devised that 1940 list have recognized the categories as social constructions? In some sense, likely yes, maybe no, it depends.

        How did we get from the categories employed in 1790 to that particular set? The sets of categories themselves can be found at the Census Bureau’s site, but that’s the barest beginning. What considerations led to the inclusion of more categories? I wonder what turned up under “other races” and what use was made of that information. There’s an interesting bit of history here.

        That most recent census was in 2010. Now we’re deep into the digital era and the web. I have no idea of the range of technologies involved.

        The information gathered on race and Hispanic origin (PDF) is more diverse than that gathered in 1940. The most interesting thing is that Hispanic origin is considered separately from race, and properly so. If a person is of Hispanic origin, we’ve got these alternatives:

        Mexican, Mexican American, Chicano
        Puerto Rican
        Other (print out in full)

        Under the race item there are four opportunities for write-ins. If a person is and American Indian, write out the “enrolled or principle tribe.” If “Other Asian,” write it out. If “Other Pacific Islander,” write it out. And finally, if none of the other categories work, write it out.

        I have no idea what kind of historical work has been done. The Census Bureau has its own history (PDF), which appears to be bare bones exposition of information asked for decade by decade. Wikipedia lists Margo J. Anderson, The American Census: A Social History, Second Edition. New Haven: Yale University Press, 2015, and Anderson, Margo J. Encyclopedia of the U.S. Census. Washington, DC: CQ Press, 2000, and a few other sources. I assume there’s more.

        What kind of DH project could be done? There’s the political and social history. But also technology. The business of census-taking, in the US and elsewhere, was central to the emergence of the data-processing industry and IBM is one of the companies that came out of that. And of course, contextualize it all with our most sophisticated critical insights. The mind boggles at the scope.

        Consider an intro to DH class where the census is put on the table as a class project. What could be done in a semester? How could we then build on that?

        1. Yeah — I think we may just have slightly different mental maps of the field. I associate the attitude towards computation that you describe more with the authors of the article that must not be named than with Posner. They cited her with a kind of backhanded approval that I found irksome, and I’m inclined to draw a bright line between them. I really respect Posner’s work outside DH as well. She has done some work on the history of lobotomy — you can imagine how that might dispose one to be suspicious of procedures that are thought of as good solutions mostly because they’re easy to implement.

          But — I can see where you’re coming from in general. And otherwise, I agree with the large majority of what you’ve said here. I’m a pragmatist when it comes to work done by for-profit companies, and I think universities could do a lot more good in the world if they would adopt a best-is-the-enemy-of-the-good approach more often, at least in their IT practices.

          Also, the census class you describe here sounds fantastic and, yes, captures exactly the interplay of political and technological forces that I was thinking about in my last post. (I might have to pass that idea on to a colleague here at Penn, Ben Wiggins — with your blessing first of course! I think it would be right up his alley.)

          1. LOL!

            Interesting that you should mention the LARB hit job, since that’s been much on my mind and it’s in that context I was reading Posner’s article. And I recognize that she has a very different stance from them. The resulting cognitive dissonance had me cycling through “does not compute does not compute does not compute….”

            In reading around on her site I found the work on lobotomy. What most impressed me about it, again in this general context, is that she didn’t rush to demonize Freeman, at least not in what I saw. Obviously, demonization would have been easy and, or course, it has been done. She made a real effort to understand what he was doing, and why, all in historical context. (Beyond this, I note that Freeman’s son, who died this past April, was a distinguished neuroscientist at Berkeley. I had quite a bit of correspondence with him a decade or so ago when I was thinking about the brain and music. He was sensitive about his father’s legacy, as you could easily imagine.)

            By all means, pass on the census class idea to Wiggins; it really does seem right up his ally.. In the next day or three I’m going to take some of the stuff I’ve posted here and write up a post for New Savanna. Maybe we could get Posner interested as well and get an East Coast/West Coast thing going.

Leave a Reply to Scott Cancel reply