Continuous random variables: variance, expectations and the normal distribution.

In many of my posts up to now, when using examples to demonstrate a concept in probability or statistics, I have resorted to things like coin tosses and dice rolls. In the language of my previous post, which you might wish to read before this one, these examples have always concerned discrete distributions rather than continuous distributions. I didn’t want to mention the latter before I had introduced probability density functions, but I still feel like there is more to say on this matter before I start incorporating such examples in other posts. In this post, I wish to develop some more of the machinery for talking about continuous random variables, before introducing what is surely the most important class of such distributions - normal distributions.


In this post, we introduced the notion of the expectation, or mean, of a random variable. Sometimes, we’re interested in quantifying the “spread” of a random variable around its mean. The mean alone, for example, wouldn’t allow us to distinguish between a random variable \(X\) which is +100 half of the time, and -100 half of the time, and a random variable \(Y\) which is +1 or -1 with equal probability - both of these have mean zero. What’s happening here is that, even though the two variables have the same mean, the first one typically lies further away from its mean (in fact, always at distance 100) than the second one (which always lies just one away from the mean). I think that we would really like to ask the question, “On average, how far away is the random variable from its average (or mean)?” This is what the variance does.

The first thing we might think to do to quantify the “average distance away” might be something like \[E[X-E[X]],\]but if we do that we will quickly find out that this quantity is always zero. This makes intuitive sense, because the random variable \(X-E[X]\) could be positive or negative depending on whether \(X\) is bigger than or less than its mean, and we would expect that, on average, \(X\) is the same as its mean rather than above or below, meaning that on average it would be zero, which is the case.

This cancelling out of positive and negative parts doesn't really make much sense, because the thing we're interested in was just be how far away the random variable typically gets from the mean, without quantifying the direction of this departure. One way (perhaps not the only, but I do not want to get too sidetracked) of making sure that we don't have positive and negative parts cancelling out is to square the quantity \(X-E[X]\) before taking the expectation. Since squaring a number and squaring the negative of that number gives the same answer, we're no longer sensitive to whether we're above the mean (in which case the original number is positive) or below the mean (in which case its negative), with variability only coming from how far away we are. This quantity is known as the variance of the random variable \(X\), which I will call \(V(X)\), so


In a sense, this is a slightly weird think to have done, because although the variance now represents some notion of average departure from the mean, it still doesn't quite feel right. In the case where \(X=\pm 100\) with equal probability, we would actually get a variance of \(100^2\) when really we might want it to be \(100\).

There are (at least) two possible resolutions to this. If you dislike the fact that this isn't quite giving us the right notion in this example, you could actually just take the square root of the variance and you would be back at \(100\) and all is well - this quantity is known as the standard deviation. Another way of looking at it is that, even though this number doesn't quantify exactly what you might have had in mind, it still does the job of discerning between distributions which are more spread out and less spread out (in the sense that the variance of "\(X=\pm 100\) with equal chance" is more than the variance of "\(Y=\pm 1 \) with equal chance" and less than the variance of "\(Z=\pm 101\) with equal chance."

In a sense, both of these resolutions are adopted in practice. On the one hand, variance is usually the thing we end up speaking about rather than standard deviation, I suppose because you could anyway just square root it to get the standard deviation, and in reality it is the variance itself which has nice properties. On the other hand, when we don't know what the value of the variance of a distribution is, or if we are just giving it a generic symbol to represent it, the usual choice is \(\sigma^2\), which sort of nods to the fact that in some sense it's the square of what you want. As an aside, the analogous choice of symbol for the mean is \(\mu\).

Expectations for continuous distributions

Just as we defined expectation and variance in the discrete setting, we can define expectations of continuous random variables. In both of these cases, we could have written the case of interest as \(E[g(X)]\), where \(g(X)\) is a function which takes in the random variable \(X\), and gives out something whose value depends on \(X\). So if we took the function \(g\) which does nothing (known as the identity function), then we could write \(g(X)=X\) (whatever we put in is unchanged). And if we took \(g\) to be “subtract the mean and square it” we would get \(g(X)=(X-E[X])^2\).

In these cases, E[g(X)] would give us the mean and variance corresponding to the first and second choices of \(g\) respectively. I could have just defined these quantities, but I thought I would write this more generally as it summarises everything nicely into one definition.

Once we have chosen our favourite function \(g\), then if we have a continuous random variable \(X\) with density function \(f\), we can define the expected value of \(g(X)\) as

\[E[g(X)] = \int_{-\infty}^{+\infty}g(x)f(x)\text{d}x\]

In particular, plugging in the choice \(g(x)=x\) gives us the expected value of the random variable \(X\).

Where did this definition come with, and why is it sensible? There are a couple of ways to answer this question. The first is the same as in the case of discrete distributions - this just is the number we converge to if we average out an increasing number of independent samples from the distribution with density \(f\). If you are satisfied with that, feel free to skip to the next section!

The other way we could see this is by using what we already had for discrete distributions. For ease of notation I will take \(g\) to be the identity, so we are just thinking about \(E[X]\). Suppose I have a continuous random variable \(X\), and I take a "discretised approximation" \(\tilde{X}\) of my random variable, which just tells me the endpoint of the class into which \(X\) falls.

Density function, in blue, for our random variable \(X\) which takes values between 0 and 1. The red region is the area under the curve between \(x=0.6\) and \(x=0.7\), representing the probability of \(X\) lying between these two points. When this happens, \(\tilde{X}=0.7\), and so the red area is the probability of \(\tilde{X}=0.7\)

For instance, imagine that \(X\) is a continuous random variable which could take any decimal value between 0 and 1, with density function \(f(x)=2x\), plotted above, and that \(\tilde{X}\) is the random variable taking values in \(0.1,0.2,0.3,\cdots,0.9,1.0\) which gives the right endpoint of the interval where the precise value of \(X\) lies. So if \(X=0.05\), \(\tilde{X}=0.1\), if \(X=0.6728\dots\) then \(\tilde{X}=0.7\), and so on. We could compute the probabilities of lying in each interval by calculating the area under the curve.

Using what we know about expectations of discrete random variables, we can calculate the expectation of \(\tilde{X}\), which will amount to multiplying all of the areas of each region of width 0.1 by the maximum value in that region (so each term in the sum would involve something similar to multiplying the red region displayed by 0.7, but for all the different intervals of length 0.1).

If we evaluate this sum, we would find that the answer would be (if my calculations are correct) 0.715. If we were to use this as an estimate of the expected value of \(X\), we might realise that this will be an overestimate, since \(\tilde{X}>X\). We could actually quantify this a bit more if we wanted to. We know that \(\tilde{X}\) gives us an overestimate for \(\tilde{X}\), but we also know that this doesn't overesimate more than 0.1 higher (else we would move up to the next value of \(\tilde{X}\). So \(\tilde{X}-0.1\) is actually always less than our value of \(X\), and so its expectation is also. We can summarise this as \(E[\tilde{X}-0.1] \leq E[X] \leq E [\tilde{X}] ,\) and it is intuitively clear that we can write the left hand side as \(E[\tilde{X}]-0.1\) instead, since averaging out a random quantity and adding or subtracting something ought to give me the same as subtracting that something first then averaging (this is part of a property of expectation called linearity). For our case, we end up with \[0.615 < E[X] < 0.715.\]

We could have just as well split our interval into 100 subdivisions and played the same game, rounding up to the nearest multiple of 0.01 rather than 0.1. If we did that, and bounded \(E[X]\) above by the new \(E[\tilde{X}]\), and below by this quantity minus 0.01 this time (by the same logic as before), we would get \[0.66165<E[X]<0.67165\]

If we kept refining \(\tilde{X}\) as giving us \(X\) to the nearest 0.001, 0.0001 and so on, the sequence of upper bounds on \(E[X]\) would decrease, ever more slowly, and they would in fact get arbitrarily close to 2/3 (although never reach it). Likewise, the sequence of lower bounds would approach 2/3 from below, also never reaching it. From these two sequences, we could deduce that in fact \(E[X]\) must be equal to 2/3, since it will always be less than something which is an arbitrarily small amount larger than 2/3, and always be more than something which is an abitrarily amount bigger than 2/3, and that entails that \(E[X]\) could not possibly be anything else (see here). Indeed, that is the exact value you get for \(E[X]\) when evaluating the integral described at the start of this section, with \(g(x)=x\) and \(f(x)=2x\). And although we were working in a specific case here of \(g\) and \(f\), a similar argument could be made with (almost) any choices of function \(g\) and density \(f\).

The normal distribution, or "bell curve"

With this knowledge of expectation and variance under our belt, we can now get to the main point of this post, concerning the normal distribution. I think the most common name I hear for the normal distribution is the “bell curve,” which represents the idea that the majority of the “mass” of the distribution lies near the average, with ever fewer “outliers” in the population who depart considerably from this average. By “mass” I just mean area of the curve, so mathematically we are saying that a high proportion of the total area under the curve is found near the mean. By “outliers” I mean things which are rare events under this probability distribution (for instance, someone taller than 190cm might be considered an outlier across the distribution of heights, which might be a bell curve).

Bell curve representing distribution of heights in a population (mean 160, variance 10). Most of the area under the blue curve is concentrated around the middle, with outliers (like those 190cm or taller) being rare (only accounting for the red proportion of the total area under the blue curve)

We can formally define this family of normal distributions as the ones characterised by the (bell-shaped) densities

\[f(x;\mu,\sigma) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

If a (continuous) random variable \(X\) follows a distribution \(P\) for which the above is the probability density, we say that \(X\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\). As a sanity check, we should probably make sure that, if \(X\) has a distribution with this density, it really does have mean and variance given by \(\mu\) and \(\sigma^2\), which amounts to confirming that


and also that


which (I think) can be done with not so much knowledge of integration, if the reader is so inclined. If you want more of a challenge, it's interestingly quite a bit more difficult to confirm that the function I wrote down is indeed a density, meaning that it is positive (the easy bit) and that


Implicit in my definition is that the mean and variance of the normal distribution completely characterize it. That means that, if I have a normal distribution and so does my friend, and we both agree on the mean and variance, then we have the same distribution. This is emphatically not always the case for distributions in general. We saw earlier that the means of the discrete random variables \(X=\pm100\) and \(Y=\pm1\) agree, but the distributions certaintly do not, which motivated our study of the variance to capture more aspects of these distributions. And even the variance (along with the mean) isn't always enough to completely describe the distribution. But if you know the mean, variance, and that it is in the family of normal distributions, you know exactly your distribution.

This family is thus an example of a parametric family of statistical models, which I mentioned in this post. That means that the collection \(\mathcal{P}\) of normal distributions can be defined in terms of a reasonable simple to understand set of parameters \(\Theta\), which in this case contains all possible pairs \(\mu,\sigma^2\), with each element of \( \theta \) corresonding to a unique element of the collection. Since statistical inference when using parametric models is much more straightforward mathematically (although requires more stringent assumptions on the process which gives rise to your data), this makes the family of normal distributions a popular choice for \(\mathcal{P}\), and in a slight abuse of terminology sometimes study of "the" normal distribution refers to this whole family.

Are things normally distributed, normally?

It turns out that quite a lot of things are normally distributed, or at least assumed to be so. One of the mathematical results justifying this assumption is called the Central limit theorem. It tells us that, when averaging out enough indepedent random variables from any distribution, we eventually start to look like a normal distribution, whose mean is that of the random variables being averaged, and whose variance is linked to the sample size and the variance of each random variable in the average. This even includes discrete distributions!

I will hold off on a formal statement for now, and conclude with a demonstration of the central limit theorem in the context where the true data is discrete, and represents the average outcome from a random variable \(Y\) which is \(\pm 1\) with equal probability. We saw earlier that such a random variable has expectation zero, so the central limit theorem tells us that the sample average (not the mean, but the thing we get by adding up the random variables and dividing by how many there are), which is itself a random variable, has distribution which is approximately normal if it is based on sufficiently many samples. We illustrate this pheonomenon with the following histogram, which gives an idea of the distribution of the sample mean based on averaging 1000 independent copies of \(Y\).

1000 data points from the distribution of \(Y\) were drawn and the sample mean was computed, which gave "one observation" of the (random) average of 1000 copies of \(Y\). This process was repeated 10,000 times to give 10,000 observations of this random sample mean. The histogram then shows that the sample mean based on 1000 observations from the distribution of \(Y\) is approximately normally distributed. It is the 1000 figure, being the number of copies averaged, which drives the shape of the distribution of the sample mean - this was just repeated a large number (10,000) of times to get an accurate picture for this distribution.

The normal distribution appears in many more mathematical results, one of my favourites being the Bernstein-von Mises theorem which features the normal distribution in its interconnection of two paradigms of statistical inference, known as frequentist and Bayesian inference, which I am sure will feature in a post in the not too distant future once we have laid the required foundations. Although a discussion of all aspects of the normal distribution would be impossible to contain in a single post, I hope that this serves as a helpful introduction to these distributions which are ubiquitous in statistics.

Dan Moss

Dan Moss

DPhil Student at Oxford/StatML CDT. Interested in maths, stats, veganism and current affairs. Pronouns: He/him
Oxford, United Kingdom