How can we measure the size of a set?

One of the more enjoyable aspects of my first year studying maths at undergraduate level involved learning about countable and uncountable sets, which I briefly mentioned in this post. I had previously seen Cantor’s diagonal argument and, while it was intuitive, I didn’t really appreciate how it could be formalised into a really rigorous mathematical argument.

Essentially what these proofs boil down to is the existence, or non-existence, of various maps from the set of interest into the natural numbers \(\{1,2,3,\dots\}\). If an injective map exists from the set of interest into the naturals, our set of interest is countable. If no such map exists, it is uncountable. It doesn’t stop there though, as uncountable sets can be broken down themselves into a hierarchy of different “sizes” known as cardinalities, although I never actually attended the follow up course so that’s about all I know. However, in this setting, the “size” of an interval \([a,b]\) is the same as that of the real numbers \(\mathbb{R}\), which is the same as that of 3D space \(\mathbb{R}^3\).

Around the same time, we learnt about the notion of dimension in vector spaces. A vector space is essentially a nice kind of set where notions of addition and (scalar) multiplication are defined, and which is “closed” under these operations. Because of these conditions, the interval \([a,b]\) would not qualify as being one of these special sets, namely because it is not closed under addition (except in the trivial case where the interval is empty). However, the sets \(\mathbb{R}\) and \(\mathbb{R}^3\) have dimension (as vector spaces over \(\mathbb{R}\) equal to one, and three, respectively. This feels a bit more pleasing, as we have some measure by which we can say that one is bigger than the other, which feels about right.

A little later on in my degree, I encountered the notion of measure, which is a kind of generic term for assigning size to subsets of some larger set in a way which obeys certain rules which make sense, like the size of two disjoint (non-overlapping) sets being what you get by adding up their sizes. One of the most important examples of a measure is the Lebesgue measure. Under this notion of “size,” \([0,2]\) is twice as big as \([0,1]\) (as subsets of \(\mathbb{R}\)) which feels right.

After this fairly lengthy introduction, I would like to spend a bit more time unpacking these notions of “size” in this post, and comment (briefly) on how they pop up in probability and statistics. This might have a bit more of a “pure maths” flavour than other posts focusing on probability/statistics, but I think that these things are nevertheless interesting to discuss and, especially in the case of measure, are very much related to the usual content of this blog!

Injections, Surjections and Bijections

A function is essentially a “rule” which assigns, to each element (usually just a number) in a given set called the domain, another element in a given codomain. So specifying a function properly amounts to specifying all three of the “mapping rule,” the domain and the codomain. If we write \(A\) and \(B\) for the domain and codomain respectively, we will often compactly write a function as

\[f:A\longrightarrow B   ;   x\mapsto f(x)\]

With \(x\) being an input (which is a member of the set \(A\)) and \(f(x)\) being the output “apply the rule \(f\) to \(x\)” and which is an element in \(B\).

One subtlety here is that the mapping rule \(f(x)\) has to be specified at every \(x\in A\) (the \(\in\) symbol just means “in [the set \(A\)]” or “is an element of [the set \(A\)],” but it doesn’t need to “reach” every element in \(B\). For example, if \(A=\{1,2\} , B=\{1,2,3\}\) then the rule “send 1 to 1, and send 2 to 2” is a perfectly valid function, which never “reaches” the element \(3\) in \(B\). However, we do have to send every element in \(A\) somewhere (even if we end up sending lots of elements to the same place) and we can only send each element to one place each (so we can’t say that \(1\) goes to \(1\) and \(2\), but we could make it so that both \(1\) and \(2\) go to \(2\)).

With this in mind, we say that a function is injective (sometimes called one-to-one) if we don’t assign the same element of \(B\) to more than one element of \(A\). Formally, a function \(f:A\longrightarrow B , x\mapsto f(x)\) is injective if, for any two elements \(x,y\in A\) (which can be the same), we have that \(f(x)=f(y)\) implies \(x=y\). This is saying that you can always identify an element in \(A\) from the value of the function \(f\), precisely because there was only one element which got sent to that function value.

A function is called surjective if it “reaches” every element of \(B\) (Mathematically, for every \(z\in B\) there exists \(x\in A\) for which \(f(x)=z\). It is called bijective if it is both injective and surjective.

Returning to my example before, the function I cooked up with \(A=\{1,2\} , B=\{1,2,3\}\) that took \(1\) to \(1\) and \(2\) to \(2\) would be injective but not surjective. In fact, there are no surjective functions from \(A\) to \(B\) in this example, because \(B\) has more elements than \(A\) (try to convince yourself!) Analogously, if the set \(A\) had more elements than \(B\), there would be no injective function with domain \(A\) and codomain \(B\).

To summarise this finding, if we write \(\lvert S \rvert \) for the size of some arbitrary set \(S\) (which is just the number of elements in the set) then injective functions with domain \(A\) and \(B\) exist if and only if \(\lvert A\rvert \leq \lvert B \rvert \) and subjection if and only if \( \lvert A\rvert \geq \lvert B \rvert \). Consequently, the only way we can have a bijection is if \(A\) and \(B\) have the same size (although even then it doesn’t mean that every function on these sets is a bijection - just that we could have one).

This actually gives as an alternative way of thinking about sizes of sets: if there exists an injection from the first set to the second, the first is no larger; if there exists a surjection, it is no smaller; if there exists a bijection, the sets have the same size.

Countable and Uncountable sets

It turns out that this notion of size of sets via types of maps gives us a way of thinking more generally about sizes of sets, when those sets contain infinitely many elements. This is where the diagonal argument comes in, which I will detail very briefly in a moment.

Our starting point is the definition of a countable set, which is one with “size” at most that of the natural numbers \(\mathbb{N}=\{1,2,3,\dots\}\). In particular, \(\mathbb{N}\) itself is countable, as are any finite sets. Any subset of a countable set must itself be countable, since a subset is no larger than the set its a subset of. The way we measure size is in the way described above: a set \(S\) is countable if and only if there exists an injective function with domain \(S\) and codomain \(\mathbb{N}\). Equivalently, we can say that our set is countable if and only if there exists a surjection with domain \(\mathbb{N}\) and codomain \(S\) (Exercise: show that these are indeed equivalent!)

The word “countable” is associated to the idea that, by “counting” we can eventually reach our favourite element of the set (if we have enough time to waste). We’ll never get to the end, but if we fix one end goal then, no matter how far away it is (according to some “order”) we will get there in finite time. So if I really like the number a billion, I could count to it. Or a billion billion - it would take a while but you would get there eventually.

One of the slightly weird things about countability is that sets which seem “bigger” than \(\mathbb{N}\) are still actually countable. For instance, the set \(\{0,1,2,3,\dots\}\) seems larger than \(\mathbb{N}=\{1,2,3,\dots\}\), but yet the rule “add one to everything” gives you an injection (and in fact a bijection) from the first of these sets to the second. So even “slightly bigger” sets are sort of still the same size - basically because we’ve had to slightly adjust our notion of size (because what does something like \(\infty +1\) even mean?)

Cantor’s diagonal argument tells us that the real numbers (which are basically decimals with whatever arbitrarily long decimal expansion you like, so includes \(\pi\) for instance) are not countable. Sets which aren’t countable are (unsurprisingly) called uncountable sets. So we could show that there is no injection from the real numbers (\(\mathbb{R}\) to the natural numbers \(\mathbb{N}\). Alternatively, we could show there is no surjection from the naturals to the reals - this is what the diagonal argument does.

Suppose we had a function \(f\) which we thought might be our mythical surjection. It would send \(1\) somewhere (say \(f(1)=x_1\in\mathbb{R}\)) it would send \(2\) to some other \(x_2\) (which may or may not be the same as \(x_1\)) and so on.

Imagine we “write” an infinitely long list of all of the values of our function \(f\). Then we define \(x^*\) as follows. If \(x_i\) has digit 5 in position \(i\) after the decimal point, we change it to a 4. Otherwise, we change it to a 5. I’ve shown what this would look like in the figure below.

Cantor's diagonal argument: If the values of our function are listen (as done so in black), we can construct \(x^*\) to differ from each of these function values in (at least) one decimal digit, and so it is not in the "reach" of our function.

Now we need to decide if our function we started with is actually a surjection after all. Well, why don’t we see if there is an element in the domain \(\mathbb{N}\) which “reaches” the element \(x^*\) we just cooked up? Well it can’t be \(f(1)\), because that’s \(x_1\) and we made \(x^*\) different to that one. And it can’t be \(f(2)\) for the same reason. Or \(f(3)\). In fact, \(x^*\) can’t be equal to “\(f\) of” anything at all because of the way we designed it! So our \(f\) isn’t a surjection after all. But there was nothing special about \(f\) - it could have been anything - which tells us that no surjection can exist and so the real numbers are uncountable.

As I mentioned in this post, countable and uncountable sets play a role in probability because they determine what kind of object is natural for representing a distribution - either a probability mass function or a probability density function. This is really because it determines what the “natural” choice of measure on our sample space is, but I will not elaborate on what I mean by that at this point.

Vector spaces and dimension

Although I haven’t gone into all of the details about countable an uncountable sets, this notion of size (via injective/surjective functions) isn’t always exactly what we want. For example, in this setting the real numbers, or the “real line” \(\mathbb{R}\) and “3D space” \(\mathbb{R}^3\) have the same “size” (or cardinality). It turns out that both of these sets are what mathematicians call vector spaces and they have their own notion of size, which is roughly the number of “different” possible directions. In this setting, \(\mathbb{R}\) has dimension one (with the direction being “along the line”) and \(\mathbb{R}^3\) has dimension three (with the directions basically being “up/down,” “left/right,” and “forward/backward.”)

There are two things to point out in this definition. The first is that walking in a “direction” in this setting also includes walking backwards along the same line. Mathematically, this is basically because we would just think of it as “walking a negative distance” along that direction.

The set \(\mathbb{R}^2\) has two dimensions, since we can specify two "directions" along which we can get to any point, like the green point P labelled. The red directions are also directions, but these are redundant if we already have both black directions. We could use both red directions, or one red and one black, however. In any case, we need to specify two directions to "reach" the full vector space, and so our space has two dimensions.

The second is that these directions have to be what mathematicians call linearly independent. You might, for example, say that even under my rule that we have to be able to walk backwards along the same line, there are still a lot (infinitely many, in fact [Exercise: countably or uncountably many?]) of possible directions to choose, in \(\mathbb{R}^3\) at least. However, for our directions to be “linearly independent,” there has to be as few of them as possible to “reach” anywhere in the space. So if I’m in 2D space for instance, and I only walk along one line, I can’t reach every single point. But if I allow myself to walk horizontally or vertically, I don’t need to allow any more directions than that in order to get anywhere.

You may have noticed that this notion is closely linked to the idea of co-ordinates. In particular, if we have a vector space of dimension \(n\), we can express any element of the vector space by a set of co-ordinates with at most \(n\) entries. Then each one of these co-ordinates will (typically) be just a single real number.

Statistically, we often think of single real numbers as being “parameters.” For example, the mean or the variance of a normal distribution would be a parameter. However, we can think of these two together as a single parameter, living in a two-dimensional space, by sort of gluing them together to a co-ordinate pair of the form (mean,variance). If we are in a model where we can “glue together” all of the parameters of the model into finitely many co-ordinates like this, which we can equivalently think of as a single parameter in a finite dimensional vector space, then we call our model “parametric.”

If our model is so complex that this isn’t possible, we call it nonparametric (although you might notice that this somewhat confusingly contains models with infinitely many parameters). The study of such nonparametric models is currently a very hot (and very broad) topic of research, because allowing this sort of complexity allows for a very flexible statistical model which can model a very broad range of distributions, meaning we don’t need to impose very many assumptions at all.

A quick word on measure theory

I think that, considering the length of this post already, I will postpone a proper discussion of measure theory to another post. However, let me give you a very quick overview.

Roughly speaking, a measure on a set \(S\) is a function, with domain given by a collection of subsets of the set \(S\) called a sigma-algebra (or sometimes sigma-field) and with codomain some set \(B\) which is a subset of the (extended) non-negative real numbers \(\mathbb{R}_{\geq 0}\cup\{\infty\}\). Then the measure of a subset of \(S\) is just the value this function takes on when you plug in that subset. This function should satisfy certain nice properties - one of these which is already implied by the choice of codomain is that the measure of any set is non-negative, and another being that the measure of the empty set (which contains no elements) is zero. It should also tell us that the measure of a collection of (non-overlapping) sets is the sum of the measures of those sets, from which we can deduce that subsets of sets have measure no greater than the set in which they are contained.

Measure theory is of exceptional importance in modern probability theory. In this theory, distributions \(P\) are modelled as measures on the set \(S\) of “all possible outcomes” and subsets of that set \(S\) are particular outcomes. Then the measure of a particular subset is just the probability of a random variable with distribution \(P\) taking on an outcome described by that subset. Consequently, the measure of the whole set \(S\) must be exactly one, and measures for which this holds true are known as probability measures. The study of these distributions can then be conducted through an analysis of the “associated” measure (I suppose the measure just is the distribution, really).

As a quick example, the standard normal distribution \(N(0,1)\) can be thought of as a measure on the real numbers \(\mathbb{R}\). Then, under this measure, the “size” of a subset of \(\mathbb{R}\) is the probability that a variable distributed according to \(N(0,1)\) takes a value which is a member of that subset. So the size of the set \([-1.96,1.96]\) is (approximately) \(0.95\).

One thing to note is that this notion of size doesn’t allow us (to my knowledge) to prescribe different “sizes” of infinity to sets. We can attach a measure \(\infty\) to a set, but we can’t put these infinities in a hierarchy like we did earlier on. In the case of the Lebesgue measure on \(\mathbb{R}\) for example, the measure of the uncountable set \(\mathbb{R}\) is infinity and the measure of the countable set \(\mathbb{N}\) is zero (as is the measure of every other countable set). However, there are even uncountable subsets of \(\mathbb{R}\) which have zero measure - despite being the same “size” (see here) - as well as the less crazy sets having finite, positive measure (like \([0,1]\).

Are there other notions of size?

There are still more notions of “size” (if you can call them that) which spring to mind, particularly ones which relate to analysis and topology. One such example is that of a separable space which contains a “countable dense subset” - I won’t elaborate on what that means here, but such spaces can be “nice” to work with when studying theoretical properties of the highly complex and flexible nonparametric models I mentioned before. Another notion which relates to modern statistical theory is that of metric entropy.

I hope from this post you have discovered that the notion of size in complex sets is not so straightforward, and they don’t particularly overlap. However, it is important for probabilists and statisticians to understand these differing notions of size, and appreciate which are relevant for different aspects of the subject. In fact, one of the wonderful things about studying statistical theory is the relations it has to so many different areas of pure mathematics, which require one to have some knowledge of the kind of things I mentioned today.

Dan Moss

Dan Moss

DPhil Student at Oxford/StatML CDT. Interested in maths, stats, veganism and current affairs. Pronouns: He/him
Oxford, United Kingdom