Deriving Browsing Similarity

The following is an extremely deliberate, step-by-step derivation of what I think is a novel vector similarity measure.1 I discuss motivations for it here; this focuses on the math. Describing this as a “vector” similarity measure might be a little deceptive because part of the point is to get away from the vector model that we’ve inherited via cosine similarity. But in its final form, the measure is similar enough to cosine similarity that it’s helpful to hold on to that way of thinking for now.

The initial aim of this similarity measure was to create better topic model graphs, and I think it really does. But as I’ve thought it through, I’ve realized that it has applications to any bipartite graph that can be interpreted in probabilistic terms. More on that later!

The derivation is pure conditional probability manipulation; it doesn’t require anything but algebra and a few identities. But if you’re not familiar with the concepts of conditional probability and marginalization (in the mathematical sense!) you may want to read up on them a bit first. Also, I’m not certain that I’m using conventional notation here — please let me know if I’ve done something odd or confusing. But I’m confident the reasoning itself is correct.

Informally, the aim of this measure is to determine the probability of happening upon one topic while browsing through a corpus for another one. Imagine for a moment that the corpus is a collection of physical books on a bookshelf in front of you. You’re interested in crop circles, and you are looking through the books on the bookshelf to find information about them. What is the probability that in the process, you’ll happen upon something about clock gear ratios?2

Formally, suppose we have random variables X and Y, each representing identical categorical distributions over topics in a model, and suppose we have B, a random variable representing a uniform distribution over all books in the model’s corpus. We’re interested in the probability of happening upon topic x given that we selected a book b based on the proportion of the book discussing topic y. To be as precise as possible, I’ll assume that we use the following process to browse through the corpus:

  1. Pick a topic y of interest from Y.
  2. Pick a book at random, with uniform probability.
  3. Pick a word from the book at random, with uniform probability.
  4. If the word is not labeled y, put the book back on the shelf and repeat steps 2 and 3 until the word you choose is labeled y.
  5. Pick another word from the book at random, again with uniform probability.
  6. Use that word’s topic label x as your new topic of interest.
  7. Repeat steps 2 through 6 indefinitely.

This is roughly equivalent to the less structured process I described above in the informal statement of the problem, and there’s at least some reason to believe that the probabilities will turn out similarly. Suppose, for example, that you know in advance the proportion of each book devoted to each topic, and choose a book based on that information; and that you then choose your new topic using that book’s topic distribution. The probabilities in that case should work out to be the same as if you use the above process. But specifying the above process ensures that we know exactly where to begin our derivation and how to proceed.

In short, what we have here is a generative model — but this time, instead of being a generative model of writing, such as LDA, it’s a generative model of reading.

Now for the derivation. First, some basic identities. The first two give different versions of the definition of conditional probability. The third shows the relationship between the conditional probabilities of three variables and their joint probability; it’s a version of the first identity generalized to three variables. And the fourth gives the definition of marginalization over the joint probability of three variables — which simply eliminates one of them by summing over the probabilities for all its possible values. (You can do the same thing with any number of variables, given a joint distribution.) I’m using the convention that an uppercase variable represents a probability distribution over a support (set of possible values) and a lowercase variable represents one possible value from that distribution. To avoid clutter, I’ve silently elided the =x in P(X=x) unless clarity requires otherwise.

1) \quad p(X,Y) = p(X|Y) \times p(Y)

2) \quad p(X|Y) = p(X,Y) / p(Y)

3) \quad p(X,Y,B) = p(Y) \times p(B|Y) \times p(X|B,Y)

4) \quad p(X,Y) = \sum \limits_{b \in B} p(X,Y,B=b)

Now for the derivation. First, substitute (3) into (4):

5) \quad p(X,Y) = \sum \limits_{b \in B} (p(Y) \times p(B=b|Y) \times p(X|B=b,Y))

The prior probability of Y, p(Y), is constant with respect to b, so we can move it outside the sum:

6) \quad p(X,Y) = p(Y) \times \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

Divide both sides:

7) \quad p(X,Y) / p(Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

And by (2):

8) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b,Y))

Finally, for any given book b, we can simplify p(X|B=b,Y) to p(X|B=b) because the probability of picking a word labeled with topic x depends only on the given book. The topic that led us to choose that book becomes irrelevant once it has been chosen.

9) \quad p(X|Y) = \sum \limits_{b \in B} (p(B=b|Y) \times p(X|B=b))

So now we have a formula for p(X|Y) in terms of p(B|Y) and p(X|B). And we know p(X|B) — it’s just the probability of finding topic x in book b, which is part of the output from our topic model. But we don’t yet know p(B|Y).

Here’s how we can determine that value. We’ll introduce a combined version of equations 1 and 2 with the variables swapped as 10 — also known as Bayes’ Theorem:

10) \quad p(B|Y) = p(Y|B) \times p(B) / p(Y)

As well as a combination of the two-variable versions of equations 3 and 4 as 11:

11) \quad p(Y) = \sum \limits_{b \in B} (p(B=b) \times p(Y|B=b))

Starting with 11, we note that for any given b, p(B=b) = 1 / N where N is the number of books. (This is because B is uniformly distributed across all books.) That means p(B=b) is a constant, and so we can move it outside the sum:

12) \quad p(Y) = p(B) \times \sum \limits_{b \in B} p(Y|B=b)

Substituting that into equation 10 we get:3

13) \quad p(B|Y) = p(Y|B) \times p(B) / (p(B) \times \sum \limits_{b \in B} p(Y|B=b))

Conveniently, the p(B) terms now cancel out:

14) \quad p(B|Y) = p(Y|B) / \sum \limits_{b \in B} p(Y|B=b)

And we substitute that into the our previous result (9) above:

15) \quad p(X|Y) = \sum \limits_{b \in B}(p(X|B=b) \times p(Y|B=b) / \sum \limits_{b \in B} p(Y|B=b))

Now we can simplify again by noticing that \sum \limits_{b \in B} p(Y|B=b) is constant with respect to the outer sum, because all the changes in the values of p(Y|B=b) in the outer sum are subsumed by the inner sum. So we can move that out of the sum.4

16) \quad p(X|Y) = \sum \limits_{b \in B} (p(X|B=b) \times p(Y|B=b)) / \sum \limits_{b \in B} p(Y|B=b)

This is a very interesting result, because it looks suspiciously like the formula for cosine similarity. To see that more clearly, suppose that instead of looking at p(X|B=b) and p(Y|B=b) as conditional probabilities, we looked at them as vectors with a dimension for every value of b — that is, as X = [x_1\ x_2\ x_3\ ...\ x_n] and Y = [y_1\ y_2\ y_3\ ...\ y_n]. We’d get this formula:

17) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sum \limits_{b = 1}^{n} (y_b)}

And here’s the formula for cosine similarity for comparison:

18) \quad \frac{\displaystyle \sum \limits_{b = 1}^{n} (x_b \times y_b)}{\displaystyle \sqrt{\sum \limits_{b = 1}^{n} x_b^2 \times \sum \limits_{b = 1}^{n} y_b^2}}

As you can see, the sum on top is identical in both formulas. It’s a dot product of the vectors X and Y. The difference is on the bottom. In the cosine similarity formula, the bottom is the multiple of the lengths of the two vectors — that is, the Euclidean norm. But in the new formula we derived, it’s a simple sum! In fact, we could even describe it as the Manhattan norm, which makes the relationship between the two formulas even clearer. To convert between them, we can simply replace the Euclidean norm of both vectors with the Manhattan norm of the second vector Y — that is, the conditioning probability distribution in p(X|Y).

So at this point, you might be thinking “That was a lot of trouble for a result that hardly differs from cosine similarity. What was the point again?”

There are two answers.

A diagram illustrating cosine similarity
The cosine similarity between two topics. Topic One accounts for three words in Book One and four words in Book Two. Topic Two accounts for five words in Book One and four words in Book Two. The cosine similarity is equal to the cosine of the angle between them, Theta. If you’re not confused by this, then I think you should be. It’s weird that we’re still modeling texts as vectors.

The first emphasizes the familiar. Although we just derived something that looks almost identical to cosine similarity, we derived it using a specific and well-defined statistical model that invites a new and more concrete set of interpretations. Now we don’t have to think about vectors in an abstract space; we can think about physical people holding physical books. So we’ve come to a better way to theorize what cosine similarity measures in this case, even if we don’t seem to have discovered anything new.

The second emphasizes the unfamiliar. Although what we have derived looks similar to cosine similarity, it is dramatically different in one respect. It’s an asymetric operator.

If you’re not sure what that means, look at the diagrams to the right. They illustrate the vector space model that cosine similarity uses, this time with two simple 2-d vectors. Think of the vectors as representing topics and the axes as representing documents. In that case, the cosine similarity corresponds roughly to the similarity between topics — assuming, that is, that topics that appear together frequently must be similar.

A graph of the cosine function.
Cosine similarity is symmetric because the cosine function is symmetric around 0. Swapping the order of application negates theta, but the cosine of negative theta is the same as the cosine of theta.

The cosine similarity is simply the cosine of the angle between them, theta. As the angle gets larger, the cosine goes down; as it gets smaller, the cosine goes up. If the vectors both point in exactly the same direction, then the cosine is 1. And if they are at right angles, it’s 0. But notice that if you swap them, the value doesn’t change. The cosine similarity of A with respect to B is the same as the cosine similarity of B with respect to A. That’s what it means to say that this is a symmetric operator.

But browsing similarity is not a symmetric operator. Why? Because the denominator only includes the norm of the conditioning variable (the Y in p(X|Y)). The conditioned variable (X in p(X|Y)) isn’t included in the denominator. This means that if the sum of X is different from the sum of Y, the result will be different depending on the order in which the operator is applied. This means that the graph this measure produces will be a directed graph. Look — see the arrows?

The arrows indicate the order in which the similarity operator has been applied to the given topic vectors. If the arrow is pointing towards topic X, then it represents the value p(X|Y) — otherwise, it represents the value p(Y|X).

What’s exciting is that this has a physical interpretation. Concretely, the probability that people who are interested in topic Y will be exposed to topic X is different from the probability that people who are interested in topic X will be exposed to topic Y. This shouldn’t be too surprising; obviously a very popular topic will be more likely to “attract” readers from other topics than an unpopular topic. So this is really how it ought to be: if Y is an unpopular topic, and X is a very popular topic, then p(Y|X) should be much lower than P(X|Y).

This makes me think that we’ve been using the wrong similarity measure to create our topic graphs. We’ve been assuming that any two topics are equally related to one another, but that’s not really a sound assumption, except under a model that’s so abstract that we can’t pin a physical interpretation to it at all.

Furthermore, this measurement isn’t limited to use on topic models. It can be used on any pair of types related to each other by categorical distributions; or, to put it another way, it can be used to collapse any bipartite graph into a non-bipartite graph in a probabilistically sound way. This opens up lots of interesting possibilities that I’ll discuss in later posts. But to give you just one example, consider this: instead of thinking of topics and books, think of books and recommenders. Suppose you have a community of recommenders, each of whom is disposed to recommend a subset of books with different probabilities. The recommenders form one set of nodes; the books form the other. The browsing process would look like this: you talk to someone likely to recommend a book you already like and ask for another recommendation. Then you talk to someone else likely to recommend that book, and so on. This could be a new way to implement recommender systems like those used by Amazon.

  1. I hope someone will tell me it’s not! In that case, my task is easier, because it means I won’t have to do all the theory. I’ve found a few near misses. In a paper on topic model diagnostics, Chuang, et. al. propose a refinement of cosine similarity for finding matches between topics across multiple independently-run models. Unlike the measure I’m proposing, theirs uses topic-word vectors; is not based on a generative model; and is a symmetric operator. However, theirs was better at predicting human judgments than cosine similarity was, and is worth investigating further for that reason. See also this wonderful post by Brendan O’Connor, which lays out a fascinating set of relationships between various measures that all boil down to normed dot products using different norms. This measure could be added to that list. 
  2. Here’s the graph-theoretic interpretation: given that the topics and books together form a complete bipartite graph, in which the edges are weighted by the proportion of the book that is about the linked topic, this is equivalent to building a non-bipartite directed graph describing the probability of moving from one topic to another while browsing. This takes up ideas that I first encountered through Scott Weingart
  3. This is almost the same as what Wikipedia calls the “extended version” of Bayes Theorem. The only difference is a small modification that resulted from the step we took to generate equation 12, which was possible because p(B) is uniform. 
  4. If you’re anything like me, you probably start to feel a little anxious every time a variable moves inside or outside the summation operator. To give myself a bit more confidence that I wasn’t doing something wrong, I wrote this script, which repeatedly simulates the 7-step process specified above a million times on an imaginary corpus, and compares the resulting topic transition probability to the probability calculated by the derived formula. The match is close, with a tight standard deviation, and very similar results for several different averaging methods. The script is a little bit sloppy — my apologies — but I encourage anyone interested in verifying this result to look at the script and check my work. Let me know if you see a potential problem!