For most people, an “average” or “mean” usually refers to a process where you take some number of, say, positive integers, add them up, and then divide through by how many there are. We could write this formally if we denote the number of integers we have by \(n\), write the integers themselves as \(x_1,\dots,x_n\) and write the sample mean as \(\dfrac{x_1+\dots+x_n}{n}\).

**The average, or expected value, of random variables**

Once we start learning some probability, we quickly find that these terms actually have a meaning in this context as well. Suppose, for example, that we have a *random variable* \(X\) whose *distribution *is “-1 with probability half, +1 with probability half”. For an introduction to notions of random variables and distributions, check out my previous post here. We can define the mean of this random variable with the formula "multiply each probability of a value by the respective value, and add it up". In this context, that would correspond to doing \[ \frac{1}{2}\times(-1)+\frac{1}{2}\times 1. \]

As a quick aside, for this process to make any sense at all, our random variable has to take values that can be manipulated with multiplication and addition. For example, if our random variable is "Heads" with probability half and "Tails" with probability half, it is a bit difficult to evaluate \(\frac{1}{2}\times\text{"Tails"}+\frac{1}{2}\times \text{"Heads"}. \)

Going back to the earlier notion of the mean of a random variable, we can make this a bit more general. Suppose we have a random variable which can take on a total of \(k\) possible different values, and to make life easy we'll say those values are just the integers \(1,2,\dots, k\). Suppose that we specify that our random variable is such that the probability of getting a particular number between \(1\) and \(k\), say \(i\), is \(p_i\). So if we add up all the \(p_i\) where \(i\) can vary between \(1\) and \(k\), we ought to get one, otherwise we will end up with some kind of "leftover probability" (if you are not convinced, imagine tossing a coin where you have a 25% chance of heads and tails and a 50% chance of... nothing?). Mathematicians would write this compactly as \[\sum_{i=1}^kp_i=1.\] Then we can do the same thing as we did before when defining the mean, as "multiply each probability of a value by the respective value, and add it up," and we can also write this concisely as \[\sum_{i=1}^k ip_i.\] This quantity can be can be thought of as the mean of the distribution "Take value \(i\) with probability \(p_i\)," but more typically we tend to thing of first cooking up a random variable \(X\) following this distribution, and then calling the above quantity the mean, or *expected value*, of \(X\). Mathematicians will typically write something like \[E[X] = \sum_{i=1}^k ip_i.\] where \(E[\cdot]\) just means "expectation of the thing where the dot is."

**Why do we call it that?**

Now that we have some understanding of what the mean (or expected value) of a random variable is, I would like to explore a little bit more what we intuitively might think of as "average/mean" or "expected value/expectation" and to what extent I think these terms are either reasonable, or misleading.

Let me start with the term "average" or "mean" which I will consider jointly. I think that this is probably the least controversial labeling, as most of the time it can be reasonably be motivated as follows. Suppose that \(X_1,\dots,X_n\stackrel{iid}{\sim}P\) where \(P\) is some distribution. Once again, I will defer to my my previous post here for an explanation of this notation. Now an often-cited (in the non-academic sense of the word) but less-often understood (see *Gambler's fallacy*) phenomenon, known as the *Law of Large Numbers*, tells us that the following is true. If we take the mean of these variables \(X_1,\dots,X_n\), in the sense of the first paragraph of this post which we all know and love, then as \(n\) gets larger and larger (meaning we are drawing more and more independent variables from the same distribution \(P\)), our sample mean \(\dfrac{x_1+\dots+x_n}{n}\) *tends towards *the mean. By "the mean" I refer to \(E[X_1]\), which looks a bit like its strangely singling out the first draw from \(P\), but actually because all of the variables are identically distributed they all have the same mean (with this common mean being the mean of any particular one of them). More interesting is what I mean by *tends towards. *Feel free to skip the next part if you're happy to settle with any existing notion you might have of this phrase.

**Aside: limits (of random variables)**

What do we mean by tend towards? It is easiest (but not easy!) to start with what we mean when we just have some usual, non-random numbers. Then perhaps we can worry about what it means for random numbers.

If we have a sequence of ordinary (real) numbers, \(x_1,\dots,x_n,\dots\), where the second dots just mean "I've got a list that goes on forever" then I can ask whether the sequence possesses a *limit*. This is a bit of a weird notion to wrap one's head around. Imagine that you have a fixed number \(y\) and that, if you pick some tolerance level of being a bit away from \(y\), the numbers in the sequence might start off being as far as you like away from \(y\), but that there exists a certain point after which *all* the remaining numbers in the list are within this tolerance from \(x\). And if we can pick any tolerance and this is still true (though we might need to go further down for it to be within a smaller tolerance from there on) we say the sequence has a limit \(y\) and that the sequence *tends to *\(y\).

We can do something a bit similar with random variables, but because of the extra complexity involved it turns out that there are several different notions of limits and tending to limits, and that the limits can exist in certain senses but not in others, and that the limits can themselves be random. And actually, there are (at least two) laws of large numbers for different types of convergence. Yes, weird. I will probably write a big blog post at some point about how we can understand limits of random sequences but I'll leave it there for this one.

**Ok, I know what a limit is now. So "average" is a sensible word?**

So we have our random variables, we take the mean of them, and as we have more and more (independent, identically distributed) random variables, we end up approaching the average (in the probabilistic sense) by taking the average in the familiar sense of all of these variables. There is an issue here when the expected value is infinite, or if it's driven by tiny chances of getting extreme values, as we will see later. But for the most part, using the word "average" for the quantity defined earlier relating to random variables seems to fit pretty well with our pre-existing notions of the term.

**So "average" is sort-of fine. What about expected value?**

I'm glad you asked! The other term, which is really the two terms "expected value" and "expectation" which I'm treating together, is itself a bit strange. In the first example of this post, I imagined a situation where a random variable was +1 or -1 with probability half each, and the expected value here was 0. But I think you'd be hard pushed to suggest that you *expect *to see the value 0, considering that it can't even take that value! Even if that seems a bit contrived, I think that if anything, we would "expect" to see the *mode *of the distribution, that is the point which is most likely. For instance, imagine a random variable \(X\) which 90% of the time is one, 5% of the time is two, and 5% of the time is twenty. If we evaluate the expectation, we get

\[ E[X] = 0.9\times 1 + 0.05 \times 2 + 0.05 \times 20 = 2 \]

Now the value we would actually *expect * to see here would, surely, be one, since it is nine times more common to get than either of the two alternatives. But the expected value is 2, which we only see 5% of the time. A bit weird, but not as weird as what's to come.

**Infinite expectations and the St Petersburg Paradox**

The following "paradox" is attributed to cousins Nicolas and Daniel Bernoulli. Suppose we go into a casino (in St Petersburg, supposedly) and we are offered the opportunity to play a game, for a certain entry price. We are told that the house will toss a fair coin until it comes up heads. Once we see a head, we will win £\(2\times 2\times \cdot \times 2\), with the number of \(2\)'s multiplied together being equal to the number of coin tosses it took for us to see the head. So if we get heads first time, bad luck, we only get £2. And if we don't get heads until the fifth, or sixth time, great - we win £32, or £64. So it seems like it's worth paying something to play, and we might even think that it would be worth paying any amount which is less than what we "expect" to win. To me, it seems like it might be worth paying around £20.

Now let's call call our winnings \(X\), which is a random variable, since our winnings are random. The probability of seeing a head first time is \(\frac{1}{2}\). And of seeing a head second time (but not first) \(\frac{1}{4}\). And for third (but not first, or second) \(\frac{1}{8}\). Hopefully you get the idea! To see our first head after \(n\) tosses has probability \(\frac{1}{2^n}\). This allows us to compute the expected value of our winnings, \(X\), as \[ E[X] = 2\times\dfrac{1}{2}+4\times\dfrac{1}{4} +8\times\dfrac{1}{8}+16\times\dfrac{1}{16} + \cdots = 1 + 1 + 1 + 1 + \cdots \]

This "sum" with infinitely many ones added up is "equal" to infinity, by which we mean that if we add up enough of them, we eventually get to a number which is as large as we like (and then keep increasing further from there). So for our game, the *expected value* is infinity!

This illustrates a couple of things. First of all, it reinforces once again that the *expected value* should not be thought of the value we expect, since it is clearly not even a possibility to "win infinity" from this game. Secondly, it shows us that *the expected value alone is not always a great guide to make decisions*. If we really believed that expected value was all that mattered, we ought to sell all our worldy possessions for a chance to play this game - even if I have to pay £100,000, I am still expecting to get back much, much more. Yet even £100 seems a bit excessive here.

It's not really a *paradox *per se, since there's nothing especially weird going on in the maths. The only sense in which its paradoxical is that it might contradict our pre-existing notions of using expectations to value random outcomes of games like the one described. One explanation as to why this approach fails here is that we very quickly get contributions to the expectation coming from minuscule probabilities of enormous sums of money. If, for instance, the casino tells us that the maximum payout they will give us is £1,000,000,000,000, we probably don't feel that our decision is affected so much by that since it's such an enormous cap anyway. But this constraint alone changes the expected value of the game to being somewhere in the region of £40.

As a closing remark, I always think it is rather odd when the expected value of a finite-valued random variable is infinite. In most settings, where the expected value is finite, it ends up being something which lies between the lowest and highest possible value. For instance, if we were to ask what the expected sum of the values on two normal dice, it would be very strange if we found out that the expected value was somehow bigger than 36. But in the example we just discussed above, the expectation, being infinite, is larger than all of the possible values the random variable might take on! The way I once explained this to a friend of mine is as follows: infinity *seems *too high, but because any finite number is too low (necessarily having lots more numbers bigger than it), the expectation somehow gets forced all the way to infinity. This is the same thing we'd see with the law of large numbers in this setting. If we played a (very, very) large amount of rounds, and calculated our average winnings per round as we did so, that would get bigger and bigger over time up to the point where it was eventually always bigger than any finite number. But it would take a *really* long time.