Independence, Dependence, and Markov Chains

When I was studying for my undergraduate degree, I attended a course in second year titled "Markov chains," which was lectured by Geoffrey Grimmett. For those who have not enjoyed the lecturing style of Prof. Grimmett, I could probably devote a blog post entirely to that, but rest assured that he lectured in a unique way, which could certainly not be faulted for being uninteresting. Although I still have a reasonable working understanding of the material in the course, I think one of the more interesting take homes for me was something he said at the very start, in order to motivate the study of these structures. He argued that much of the probability theory which we had met up to that point dealt with independent observations, and often this assumption was hard to be satisfied in reality. Unfortunately, simply removing the requirement of any "degree of independence" (for want of a better phrase) makes it rather more complicated to study things. So his argument was that Markov chains provided a middle ground between these two extremes, nicely interpolating between them to give us a way of viewing the world where things influenced one another, but not in too complicated a way. I wish to elaborate a little more on these ideas in this post, and tell you what we mean by a "Markov chain."

Random variables and distributions

When we assume data is independent, mathematicians and statisticians often write down things like

\[ X_1, X_2, \dots, X_n \stackrel{iid}{\sim} P\]

When we write \( X \sim P \) we should read this as \(X\) is distributed according to the distribution \(P\). This leaves us with a few more questions, namely, "What is \(X\)?", "What is \(P\)?", and "What does it mean for \(X\) to be distributed according to a distribution \(P\)?" \(X\) is what mathematicians call a random variable - loosely speaking it is the outcome of some random procedure, whose outcome is both "random" and which can vary according to this randomness. There is a distinction to be made here which feels a bit nit-picky, but I shall make it for the sake of completeness. If we define a random variable before the "randomness has taken place" it is something which will end up having a particular value, but one which we do not yet know. Once the "randomness has taken place," we tend to say that the random variable has been realised meaning that it has now taken a particular value. But for the most part, these distinctions can be swept aside.

A distribution \(P\) can be thought of as a descriptor for a random variable, in the sense that it describes the outcomes we could get from a random variable, by telling us which things are possible and with what probability. For instance, we could imagine that \(P\) is something which tells us "Heads with probability half, Tails with probability half". Then, by writing \(X \sim P \), I am saying that \( X \) is a random variable whose profile of possible values is described by \(P\) - in the example given, \(X\) could be understood as the outcome when I toss an (unbiased) coin. Statisticians will typically assume that data gathered in the wild has arisen as a (realisation of a) random variable with a particular distribution.  The distribution and the random variable itself are emphatically not the same thing - the value of \(P\) tells us the profile of what a random variable could look like, whereas the random variable \(X\) itself, when distributed according to \(P\) is simply a single (random) observation (or data point) of the random phenomenon (fully) described by \(P\), which once observed has a particular realisation which can be thought of as a "draw from \(P\)."

If we were to write \( X_1, X_2, \dots, X_n \sim P\), what we mean by that is that the whole list of \(n\) (which is just what mathematicians as a placeholder for some number, like 10 or 20) random variables all are distributed according to the distribution \(P\). The little numbers in the subscript aren't anything scary, its just a way of giving a different "letter" to each of the variables. We could just as well have used A,B,C,D,... but when you get a few more you start running out of letters in a way that you don't run out of numbers. It's just an arbitrary name for the random variable.

Dependence, independence, and something in between.

Are we done? I said before that \(P\) "fully" describes \(X\), in the sense that everything you could hope to know about the nature of \(X\) is determined by \(P\). But unfortunately, that's not the case if I have a list of random variables like I described above, all of them distributed according to \(P\). If I were to just look at how each one behaves individually, I'd be back in the situation above, but how they behave jointly is not yet described. Let me make this more tangible by continuing the example of the coin toss. I could cook up a random variable, call it \(X_1\), by saying that whatever the value comes up next time I toss a coin, it will be \(X_1\). I could then say that I will define \(X_2\) similarly, but with another coin tossed by my friend. So far, so good. What if I then define \(X_3\) (before I start doing any coin tosses) by saying it will be equal to whatever \(X_1\) ends up being?

Now both \(X_1\) and \(X_3\) are distributed according to "Heads with probability half, Tails with probability half" individually, but somehow it wasn't quite what we had in mind when we imagined a whole string of random variables. Intuitively, I think we want \(X_1\) and \(X_3\) to be related to each other in the same way as \(X_1\) and \(X_2\), with both bringing forward some kind of "new randomness".

So let's instead keep the definition of \(X_1\) and \(X_2\) the same, and then replace \(X_3\) with a "different version" \( \tilde{X}_3 \), which will be whichever side comes up when I toss the coin again, after my friend and I have tossed our coins to get the first two values, so that \(\tilde{X}_3\) feels like a genuinely "new" random variable. In a sense, this random variable is identical to the old \(X_3\) because both are distributed according to \(P\). But what we mean is that the actual, observed outcome from this "new" random variable has the potential to be different to \(X_1\). In fact, not only did it have the potential to be different to \(X_1\), it was completely independent of \(X_1\). The property of a whole string of random variables \(X_1,X_2,\dots,X_n\) being independent and being identically distributed (meaning just that they all share the same distribution \(P\)) is what mathematicians conveniently abbreviate to "iid" and tend to pop on top of the \(\sim\) sign (the sign which means "distributed according to").

Great! Why don't we build a theory of probability and statistics around the assumption that the random variables we observe (which in the statistical setting are assumed to be the data we observe) are independent of one another? After all, if we had this contrived example where a couple of the data points were assumed to be exact matches, we would just throw out the extras so a theory concerning such trivially correlated data would be useless. Well, the answer to the first question is that yes, mathematicians do indeed develop a lot of theory in the so-called "iid setting" where things are more or less as described above. But, as for the second remark, the stark contrast to this setting, outlined above with the "\(X_3\) is the same as \(X_1\)" example, isn't the only way that data could depend on each other.

Imagine the family game of monopoly, where you've just scored double sixes on the dice roll (something which we could readily describe with a framework similar to the above). Your relative throws the dice, but the throw is a little bit limp and, after seeing one of the dice come back up still as a six, you suspect that there wasn't really enough force behind their throw to really tip it off the balance. Sure, it wasn't a dead cert that they'd get a six again (after all, the other dice came up as a one) but somehow their throw made you feel it was at least biased towards your previous lucky roll.

The example above, while unserious, demonstrates the point that the sort of data statisticians have to deal with, which are often presumed to be random variables arising from some distribution \(P\), need not fall into either of the two camps described earlier, and it might be worth developing a mathematical theory for some of the in-between cases.

Enter the Markov chain.

In a Markov chain, the assumption we make on the dependency between our random variables \(X_1,\dots,X_n\) is that, while one observation may influence another, all of the influence of the observations "before" the current one (i.e. with labels less than that number) can be boiled down to the influence of the most recent one. This does not mean that \(X_1\) and \(X_3\), say, are independent, but rather that any dependency of \(X_3\) on \(X_1\) can be wholly "captured" by \(X_2\). If this feels like a difficult concept to grasp, it's because it is, and indeed it's something that took me at least as long as the six-week course was lectured to quite appreciate. But to make things concrete once again, I will return to the monopoly game example.

Imagine our data \(X_1,\dots,X_n\) were distributed according to \(P\), but this time \(P\) is "Dice 1 comes up 1-6 with equal chance, Die 2 comes up 1-6 with equal chance independently of Die 1" and so describes the random variables corresponding to the values we see when we roll two dice. Imagine that we designate \(X_1\) to be the value we see on the first roll of the two dice, \(X_2\) on the second, and so on. As we discussed above, it's perhaps not entirely reasonable to suggest that these are independent. And perhaps it's even not reasonable to suggest that \(X_1\) and \(X_3\) are independent. After all, if I roll first my two sixes, and then my relative with a limp hand rolls a six again, then if my next relative also shares the limp-handed gene its plausible that they too are the beneficiary of my luck. However, they only depend on my role through the relative who rolled after me, and so even though it depends on what I did, the dependency can be boiled down to what happened the step before they rolled.

It turns out that this assumption on how different random variables, or different points of data, relate to one another leads to a nice mathematical theory where a lot of the things that mathematicians know and love about independent random variables follow through to Markov chains. And it's not too difficult to see that there are plenty of situations where this would be a much more flexible and reasonable assumption on the structure of some real-world data we observe, such as the weather on a given day or (much more controversially) the price of a stock. Perhaps in these situations (especially the latter) this assumption is still too strong and unlikely to be true of the data we gather in the real world, but it certainly seems like an improvement over the assumption that these events are all independent of the past or future. To end this section, I will leave a quote of statistician George Box, who famously said

"All models are wrong, but some are useful" - George Box

So, at the very least, maybe a model of data which supposes the random variables we use to model them behave like a Markov chain is more useful than one in which they are supposed to be completely independent.

I'd love to go more into depth about the theory of Markov chains and some further generalisations we can make to this dependency assumption (such as Hidden Markov models with which my own research is concerned) but I will have to leave it there for this post, and leave those musings for another time.

Dan Moss

Dan Moss

DPhil Student at Oxford/StatML CDT. Interested in maths, stats, veganism and current affairs. Pronouns: He/him
Oxford, United Kingdom