Much of what I have already discussed on this blog concerns random variables which have only a handful of possible outcomes, such as the toss of a coin or roll of a dice. The distribution, or "profile of randomness," of these random variables can be characterised by the probability of each of the outcomes (with the constraint that the probabilities sum to one). So, in the classic example of the biased coin, it suffices to specify the probability of heads and tails (or indeed just of heads, with the probability of tails being such that the two sum to one).

This method of specifying distributions is fine provided that our random variable takes values in what mathematicians call a *countable set*. A countable set is either finite, or it is infinite but "not too large." This is a bit of a bizarre notion if you are unfamiliar with the hierarchy of different types of infinity, but what it loosely means in that we could associate each element of this set with a distinct positive integer, and by counting through the positive integers 1,2,3... we would eventually get to the number corresponding to any particular element of the countably infinite set (which isn't the case with every infinite set - see here)

In both of these cases, we can simply list the probabilities of each state, which is straightforward to do when the set is finite, and when the set is infinite this amounts to specifying an infinitely long list which add up to one (technicality of infinite summation glossed over) of the kind

\[p_1,p_2,p_3,\dots\]

where each \(p_n\) corresponds to the probability of seeing whichever outcome you associated to the positive integer \(n\). If you're cynical about the process of specifying an infinite long list, it might be done more simply by assigning a rule that fixes it, like \(p_n=2^{-n}\), rather than someone making it up every time.

Both of these objects, which are either finite or "countably infinite" lists of numbers \[p_1,p_2,p_3,\dots\] which sum to one, are known as *probability mass functions*. We could imagine plotting a bar chart with \(n\) along the \(x-\)axis and \(p_n\) along the \(y-\)axis, with the property that adding up all of the heights of the bars gives one.

**Bar charts and Histograms**

For example, if we imagine that cars are only produced near me in the colours red, yellow, green, cyan, blue and magenta and I try and find out the probability of cars going by my house being one of those colours. I could specify some numbers \(p_{red},p_{yellow},p_{green},p_{cyan},p_{blue},p_{magenta}\) and plot them as below.

Imagine now that we categorise heights into a handful of different ranges, and we are interested in the probability of some population of people falling into those categories. For the sake of simplicity, imagine this population of interest is always between 150cm and 180cm. Imagine that my data is placed into different categories according to the height \(h\) measured, say \(150\leq h < 155, 155\leq h < 160, 160\leq h < 165, 165\leq h < 170, 170\leq h < 175, 175\leq h < 180\). If I call these categories 1,2,3,4,5,6 and denote the probabilities of falling into them as \(p_1,p_2,p_3,p_4,p_5,p_6\), then I might estimate from my statistical inference that, for example, \(p_1=0.1\), \(p_2=0.15\), \(p_3=0.18\), \(p_4=0.17\), \(p_5=0.25\) \(p_6=0.15\). I could plot my results in a bar chart like this:

Visually, I think that when we look at this graph we get the impression that the heights are reasonably varied, but with a bit of a skew towards the taller end.

Imagine that my friend has gone and conducted very similar analysis, but they accidentally counted 150-160 all as one category, with a combined estimated probability for that category being \(0.25\) as a result. They produce a bar chart showing their estimated probability mass function as follows:

Inevitably, we have lost some information about how exactly people are split in the 150-160 category, which can be seen from my chart. But there is something more striking about this graph. It gives the impression that there is actually much more of a skew towards people in the population being shorter, especially when compared to my graph.

This is because the bar is bigger, because its representing a combined larger class, but because of the way we have displayed it we do not take into account the fact that this probability is somehow "spread out" across a larger class.

There is a solution to this, which is known as a *histogram*. In a histogram, we plot bars so that *the area of each bar is proportional to the probability of the category *(which means the label on the y axis is the slightly unfamiliar term "Frequency density" because it's the area that we should care about here!).

Here are what the histograms would look like for the two approaches.

Inevitably, my friend's histogram looks a bit different because of what I mentioned before, that they have lost information about how people are split within the broader 150-160 category. But it notably *doesn't *give the impression that the population as a whole is shorter than it really is.

**Discrete and continuous random variables**

When we had random variables taking values in a countable set, we could write down probability mass functions which fully described their distributions. We can represent these with a (potentially infinite) bar chart, and for things which we feel are genuinely discrete, like colours of cars or days of the week, these bar charts work just fine.

We run into problems with using these bar charts when these discrete categories are actually ranges of values of a *continuous variable* which, loosely speaking, is one that can take on any arbitrary decimal value like 1, or 2.3, or 3.14159..... with as many numbers as you like after the decimal point. And it turns out there are so many of these numbers, that there are *uncountably infinitely many*. We saw that histograms are perhaps a more natural way to represent the kind of data we had above, and I think the reason why is that the data isn't *really *discrete, it's just a "discretisation" of a continuous spectrum that allows us to view it as a finite set of different categories.

So perhaps histograms are the way to go with continuous data, and perhaps we should try and make our "class width" (how wide each category is, so 150-155 has width 5 etc.) as small as possible to really somehow capture the continuous nature of these variables. But this seems a bit artificial, and it's only because we're feeling constrained to use a representation of distributions via these probability mass functions, prohibiting us from defining something on an uncountably infinite set.

**Probability density functions**

It turns out that the natural way of characterising continuous random variables, where we can't specify a probability mass function, is with what's called a *probability density function*. In the same way that a probability mass function can be thought of in terms of a typical bar chart of data associated to those probabilities, the probability density function can be thought of as telling you what a typical histogram of the continuous data might look like, if you use sufficiently small class width.

Formally, a probability density function is a function \(p\) which takes in a value of \(x\), which is just any (real) number, and gives out the "probability density" at \(x\), \(p(x)\). This defines a distribution \(P\) as follows: We say that \(X\) follows a distribution \(P\) with density \(p\) if, for any class \(a<x<b\), the probability of \(X\) lying in this class is equal to the area of the shape formed by shading under the graph of the density function between the points where \(x=a\) and where \(x=b\).

The actual values \(p(x)\) (on the \(y\)-axis) are very similar to the values of the frequency density in the histogram, in the sense that it is really there to make sure a particular area, which is really the thing of interest, is as desired. The astute reader will note that, analogously to the way in which all of the probabilities \(p_n\) defining a mass function needed to add up to one, the graph of the function \(y=p(x)\) needs to have the total area of the shape underneath it being equal to one (or approaching one as we make the class width infinitely wide in both directions).

The area under a curve between two points \(a\) and \(b\) actually has a special name, called the *integral *of the function between \(a\) and \(b\), and there is a fancy way of writing this which is as

\[\int_a^b p(x)\text{d}x\]

The symbol on the left, known as the integral sign, is a sort of stretched "S" which derives from the fact that an integral can be defined as a sum of lots of very small strips, like if we looked at the area under a histogram with very narrow class width.

**Discrete and continuous - is that all?**

I think it was always presented to me that distributions were either discrete or discontinuous, but it is not hard to construct an example where this is a false dichotomy, even in a pretty reasonable situation. This can happen if our distribution contains both discrete and continuous aspects, because probability mass functions and probability density functions don't mix well together. The reason for this is that probability density functions assign no probability to any single point, only to intervals (since shapes with a side of zero length have zero area) and so we can't include mass functions in p.d.f.s, but we also can't use a mass function for our continuous bit because it usually won't take values in a countable set where we can actually define them.

An example of where this might be the case is if we had a machine which measured (arbitrarily accurately, for arguments sake), the length of some pieces of wood. Unfortunately, the machine is a little bit unreliable, and so 5% of the time it simply comes back with "Error". This means that, if we feed the machine pieces of wood whose lengths were random variables distributed according to some probability density function, the reading on the machine would 95% of the time give readings follows the distribution of this density, and 5% of the time simply say "Error". Because of this incompatibility between densities and mass functions, it's difficult to describe this profile of randomness in a nice way. In fact, it might be easier to describe in a kind of hierarchical fashion of the type described in this post, where we understand the distribution "indirectly" through an unpacking of the underlying distributions which somehow feed into the overall distribution.

Perhaps this may still not be enough, and we can get even more general if we adopt the language of *measure theory*, but that certainly deserves a fuller treatment than that which I could give at the end of this post. Fortunately, an understanding of both probability mass functions and probability densities is enough to guide you through the vast majority of things you will encounter, and I hope this post has provided you with something of an introduction to them both.