The Volume and Surface Area of Computer Programs

A cube.

I’ve been refactoring some software. It’s not a fun task, exactly, but there’s something strangely satisfying about it — it’s a bit like folding socks. In the past, I have found the concepts of “coupling” (bad) and “cohesion” (good) useful for this kind of work. But the definitions of those concepts have always seemed a bit vague to me. The Wikipedia page linked above offers the following definition of cohesion: “the degree to which the elements of a module belong together.” The definition of coupling isn’t much better: “a measure of how closely connected two routines or modules are.”

What does that mean? Belong together in what sense? Closely connected how? Aren’t those essentially synonyms?

If I were in a less lazy mood, I’d look up the sources cited for those definitions, and find other, more reliable (?) sources than Wikipedia. But I’ve done that before, and couldn’t find anything better. And as I’ve gained experience, I’ve developed a sense that, yes, some code is tightly coupled — and just awful to maintain — and some code is cohesive without being tightly coupled — and a joy to work with. I still can’t give good definitions of coupling or cohesion. I just know them when I see them. Actually, I’m not even sure I always know them when I see them. So I’ve spent a lot of time trying to figure out a more precise way to describe these two concepts. And recently, I’ve been thinking about an analogy that might help explain the difference — it links the relationship between cohesion and coupling to the relationship between volume and surface area.

The analogy begins with the idea of interfaces. Programmers often spend long hours thinking about how to define precise channels of communication between pieces of software. And many programming styles — object-oriented development, for example — emphasize the distinction between private and public components of a computer program. Programs that don’t respect that distinction are often troublesome because there are many different ways to modify the behavior of those programs. If there are too many ways, and if all those ways are used, then it becomes much more difficult to predict the final behavior that will result.

The net of a cube.
The net (unfolded surface) of a cube.

Suppose we think of interfaces and public variables as the surface of a program, and think of private variables and methods as being part of the program’s interior — as contributing to its volume. In a very complex program, enforcing this distinction between public and private becomes like minimizing the surface area of the program. As the program gets more and more complex, it becomes more and more important to hide that complexity behind a much simpler interface, and if the interface is to remain simple, its complexity must increase more slowly than the complexity of the overall program.

What if the relationship between these two rates — the rate of increase in complexity of the interface and the rate of increase in complexity of the overall program — is governed by a law similar to the law governing the relationship between surface area and volume? Consider a cube. As the cube grows in size, its volume grows faster than its surface area. To be precise (and forgive me if this seems too obvious to be worth stating) its volume is x^3, while its surface area is 6 \times x^2 for a given edge-length x.

Biologists have used this pattern to explain why the cells of organisms rarely grow beyond a certain size. As they grow larger, the nutrient requirements of the interior of the cell increase more quickly than the surface’s capacity to transmit nutrients; if the cell keeps growing, its interior will eventually starve. The cell can avoid that fate for a while by changing its shape to expand its surface area. But the cell can only do that for so long, because the expanded surface area costs more and more to maintain. Eventually, it must divide into two smaller cells or die.

Might this not also give us a model that explains why it’s difficult to develop large, complex programs without splitting them into smaller parts? On one hand, we have pressure to minimize interface complexity; on the other, we have pressure to transmit information into the program more efficiently. As the program grows, the amount of information it needs increases, but if the number of information inlets increases proportionally, then soon, it becomes too complex to understand or maintain. For a while, we can increase the complexity of the program while keeping the interface simple enough just by being clever. But eventually, the program’s need for external information overwhelms even the most clever design. The program has to be divided into smaller parts.

So what do “coupling” and “cohesion” correspond to in this model? I’m not sure exactly; the terms might not be defined clearly enough to have precise analogs. But I think coupling is closely related to — returning to the cell analogy now — the nutrient demand of the interior of the cell. If that demand goes unchecked, the cell will keep expanding its surface area. It will wrinkle its outer membrane into ever more complex and convoluted shapes, attempting to expose more of its interior to external nutrient sources. At this point in the analogy, the underlying concept becomes visible; here, the cell is tightly coupled to its environment. After this coupling exceeds some threshold, the cell’s outer membrane becomes too large and complex to maintain.A drop of water.

In turn, cohesion is closely related to the contrary impulse — the impulse to reduce surface area.

Imagine a drop of water on a smooth surface. It rests lightly on the surface seeming almost to lift itself up. Why? It turns out that surfaces take more energy to maintain than volumes; they cost more. Molecules in the interior of the drop can take configurations that molecules on the surface can’t. Some of those configurations have lower energy than any possible configuration on the surface, so molecules on the surface will tend to “fall” into the interior. The drop will try to minimize its surface area, in much the way a marble in a bowl will roll to the bottom. And the shape with the lowest surface-area-to-volume ratio is a sphere.

We have different words for this depending on context. This phenomenon is the same phenomenon we sometimes name “surface tension.” Water striders can glide across the surface of a pond because the water wants to minimize its surface area. The water does not adhere to their limbs; it coheres; it remains decoupled. These ways of thinking about coherence and coupling make the concepts seem a bit less mysterious to me.

Meaning, Context, and Algebraic Data Types

A few years ago, I read a paper making a startling argument. Its title was “The Derivative of a Regular Type is its Type of One-Hole Contexts.” I’m not entirely sure how I found it or why I started reading it, but by the time I was halfway though it, I was slapping myself on the forehead: it was so brilliant, and yet so obvious as to be almost trivial — how could nobody have thought of it before?

The idea was that if you take an algebraic data type — say something simple like a plain old list — and poke a hole in it, you get a new data type that looks like a derivative of the first data type. That probably sounds very abstract and uninteresting at first, especially if you aren’t very familiar with algebraic data types. But once you understand them, the idea is simple. And I think it has ramifications for people interested in natural language: it tells us something about the relationship between meaning and context. In particular, it could give us a new way to think about how semantic analysis programs like word2vec function: they perform a kind of calculus on language.

What’s an Algebraic Data Type?

An algebraic data type is really just a regular data type — seen through a particular lens. So it’s worth taking a moment to think through what a regular data type is, even if you’re already familiar with the concept.

A regular data type is just an abstract representation of the kind of data being stored or manipulated by a computer. Let’s consider, as a first example, the simplest possible item of data to be found in a modern computer: a bit. It can take just two values, zero and one. Most programming languages provide a type describing this kind of value — a boolean type. And values of this type are like computational atoms: every piece of data that modern computers store or manipulate is made of combinations of boolean values.1

So how do we combine boolean values? There are many ways, but here are two simple ones that cover a lot of ground. In the first way, we take two boolean values and declare that they are joined, but can vary independently. So if the first boolean value is equal to zero, the second boolean value may be equal to either zero or one, and if the first boolean value is equal to one, the second boolean value may still be equal to either zero or one. And vice versa. They’re joined, but they don’t pay attention to each other at all. Like this:

1) 0 0
2) 0 1
3) 1 0
4) 1 1

In the second way, we take two boolean values and declare that they are joined, but cannot vary independently — only one of them can be active at a given time. The others are disabled (represented by “*” here) — like this:

1) * 0
2) * 1
3) 0 *
4) 1 *

Now suppose we wanted to combine three boolean values. In the first case, we get this:

1) 0 0 0
2) 0 0 1
3) 0 1 0
4) 0 1 1
5) 1 0 0 
6) 1 0 1
7) 1 1 0
8) 1 1 1

And in the second case, we get this:

1) * * 0
2) * * 1
3) * 0 *
4) * 1 *
5) 0 * *
6) 1 * *

At this point, you might start to notice a pattern. Using the first way of combining boolean values, every additional bit doubles the number of possible combined values. Using the second way, every additional bit only adds two to the number of possible combined values.

This observation is the basis of algebraic type theory. In the language of algebraic data types, the first way of combining bits produces a product type, and the second way of combining bits produces a sum type. To get the number of possible combined values of a product type, you simply multiply together the number of possible values that each component type can take. For two bits, that’s 2 \times 2; for three bits, that’s 2 \times 2 \times 2; and for n bits, that’s 2 ^ n. And to get the number of possible combined values of a sum type, you simply add together the number of possible values that each component type can take: 2 + 2, or 2 + 2 + 2, or 2 \times n.

Pretty much all of the numerical types that a computer stores are product types of bits. For example, a 32-bit integer is just the product type of 32 boolean values. It can represent numbers between 0 and 2 ^{32} - 1 (inclusive), for a total of 2 ^{32} values. Larger numbers require more binary digits to store.

Sum types are a little more esoteric, but they become useful when you want to represent things that can come in multiple categories. For example, suppose you want to represent a garden plot, and you want to have a variable that stores the number of plants in the plot. But you also want to indicate what kind of plants are in the plot — basil or thyme, say. A sum type provides a compact way to represent this state of affairs. A plot can be either a basil plot, or a thyme plot, but not both. If there can be up to twenty thyme plants in a thyme plot, and up to sixteen basil plants in a basil plot, the final sum type has a total of thirty-six possible values:

Basil   Thyme
*       1
*       2
...     ...
*       19    
*       20
1       *
2       *
...     ...
15      *
16      *

If you’ve done any programming, product types probably look pretty familiar, but sum types might be new to you. If you want to try a language that explicitly includes sum types, take a look at Haskell.

Poking Holes in Types

Given this understanding of algebraic data types, here’s what it means to “poke a hole” in a type. Suppose you have a list of five two-bit integers:

| a | b | c | d | e |

This new variable’s type is a product type. It can take this value:

| 0 | 0 | 0 | 0 | 0 |

Or this value:

| 0 | 0 | 0 | 0 | 1 |

Or this value…

| 0 | 0 | 0 | 0 | 2 |

And so on, continuing through values like these:

| 0 | 0 | 0 | 0 | 3 |
| 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 | 2 |
| 0 | 0 | 0 | 1 | 3 |
| 0 | 0 | 0 | 2 | 0 |

All the way to these final values:

| 3 | 3 | 3 | 2 | 3 |
| 3 | 3 | 3 | 3 | 0 |
| 3 | 3 | 3 | 3 | 1 |
| 3 | 3 | 3 | 3 | 2 |
| 3 | 3 | 3 | 3 | 3 |

The total number of possible values for a variable of this type is (2 ^ 2) ^ 5 = 4 ^ 5 = 4 * 4 * 4 * 4 * 4 = 1024.

Now think about what happens if you disable a single slot. The result is a list type with a hole in it. What does a variable with this new type look like? Well, say you disable the first slot. You get something that looks like this — a list with a hole in the first slot:

| * | 0 | 0 | 0 | 0 |

Now that there’s a hole in the first slot, there are only 4 * 4 * 4 * 4 = 256 possible values this variable can take.

| * | 0 | 0 | 0 | 1 |
| * | 0 | 0 | 0 | 2 |

We could also poke a hole in any of the other slots:

| 0 | * | 0 | 0 | 0 |
| 0 | 0 | * | 0 | 0 |
| 0 | 0 | 0 | * | 0 |
| 0 | 0 | 0 | 0 | * |

But the hole can only be in one of the slots at any given time. Sound familiar? You could say that this is the “sum type” of each of those one-hole types. Each possible placement of the hole adds to (rather than multiplying) the number of possible values. There are five different places for the hole to go, and for each of those, there are 256 possible values. That’s a total of 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 + 4 ^ 4 = 5 \times 4 ^ 4.

Now suppose we make this a little more general, by replacing that 4 with an x. This way we can do this same thing with slots of any size — slots that can hold ten values, or 256 values, or 65536 values, or whatever. For the plain list type, that’s a total of x ^ 5 possible values. And for the list-with-a-hole, that’s 5 x^4 possible values. And now let’s go even further and replace the number of slots with an n. That way we can have any number of slots — five, ten, fifty, whatever you like. Now, the plain list has a total of x ^ n possible values, and the list-with-a-hole has n x ^ {(n - 1)} possible values.

If you ever took calculus, that probably looks very familiar. It’s just the power rule! This means you can take data types and do calculus with them.

One-Hole Contexts

I think that’s pretty wild, and it kept my head spinning for a couple of years when I first read about it. But I moved on. Fast forward to a few weeks ago when I started really reading about word2vec and learning how it works: it finds vectors that are good at predicting the word in the middle of an n-gram, or are good at predicting an n-gram given a word in the middle. Let’s start with the first case — say you have this incomplete 5-gram:

I went _ the store

What word is likely to be in there? Well, “to” is a good candidate — I’d venture a guess that it was the first word that popped into your mind. But there are other possibilities. “Past” might work. If the n-gram is part of a longer sentence, “from” is a possibility. And there are many others. So word2vec will group those words together in its vector space, because they all fit nicely in this context, and in many others.

But look again at that sentence — it’s a sequence of words with a hole in it. So if in this model, a word’s meaning is defined by the n-grams in which it may be embedded, then the type of a word’s meaning is a list-with-a-hole — just like the derivative of the list type that we were looking at above.

The basic idea of word2vec isn’t extremely surprising; from a certain perspective, this is an elaboration of the concept of a “minimal pair” in linguistics. If you know that “bat” means a thing you hit a ball with, and “mat” means a thing you wipe your feet on, then you can tell that /b/ and /m/ are different phonemes, even though they both involve putting your lips together: the difference between their sounds makes a difference in meaning. Conversely, if the difference between two sounds doesn’t make a difference in meaning, then the sounds must represent the same phoneme. “Photograph” pronounced with a distinct “oh” sound in the second syllable is the same word as “photograph” pronounced with an “uh” sound in the second syllable. They’re different sounds, but they both make the same word here, so they represent the same phoneme.

The thinking behind word2vec resembles the second case. If you can put one word in place of another in many different contexts and get similar meanings, then the two words must be relatively close together in meaning themselves.

What is surprising is that by pairing these two concepts, we can link the idea of derivatives in calculus to the idea of meaning. We might even be able to develop a model of meaning in which the meaning of a sentence has the same type as the derivative of that sentence’s type. Or — echoing the language of the paper I began with — the meaning type of a sentence type is its type of one-hole contexts.

  1. But don’t take this for granted. There have been — and perhaps will be in the future — ternary computers