Some of the very first notions of statistics to which we are introduced at school surround gathering data in such a way that the sample of the underlying population is “representative” of this underlying population. I recall being presented with the image of someone standing outside a swimming pool, apparently by accident, and then making erroneous inference about the extent to which the average person on the street would enjoy swimming. Perhaps even using a baseline of “the average person on the street” as wholly representative of the population is itself questionable, given that those people who find themselves outside more rarely may also take part in other physical activity less. Either way, it is clear that biases such as this in the data gathering process should be minimised, or somehow accounted for at the point of processing of this data, if inference surrounding the broader group is our ultimate goal.

Yet when we meet a formalisation of statistics such as that discussed in my previous post, it is routine to assume both that our observations all come from the same distribution, and that the object of interest in our inference is this common distribution. Perhaps the first assumption will be verified in our swimming pool example above, with most people we survey giving responses from a similar distribution, but since we imagined a situation where we were really interested in the preferences of the wider population, we are certainly not in our scenario where we wish to infer aspects of the distribution from which this data was observed.

In this post, I would like to discuss a class of statistical models known as “mixture models” and talk about how they relate to these issues of finding representative samples, and how they might be thought of as a way to bridge the gap between this seemingly rigid statistical formulation, and the reality of data gathering in the real world.

**Hierarchical models**

Recall that a model, as defined here, is a collection \(\mathcal{P}\) consisting of candidate distributions \(P\) for our observed data, which we will as usual call \(X_1,\dots,X_n\). Recall that these distributions \(P\) are used to describe how the randomness of a random variable looks - when we write \(X\sim P\) we mean that “\(X\) is generated according to the randomness rule described by \(P\)” which could be something as simple as “Heads with probability half, tails with probability half”.

These rules of randomness could be defined in what statisticians might call a “hierarchical” way, where we build a random variable \(X\) with a certain distribution \(P\) by using some other random variables to help us. The resulting distribution \(P\) could always be defined without the use of these other random variables, but they can sometimes help us get more of an intuition for how the random variable (or data) is generated.

For example, imagine that we model whether I decide to go for a run on a given day by a random variable \(X\), which takes on the values “run” or “no run”. If the temperature is less than 10 degrees, I never go. If it’s between 10 and 15 degrees, I’ll only go if my friend agrees, and they decide whether or not to go by tossing a (possibly biased) coin (agreeing if it comes heads up). And if it’s over 15 degrees, I always go.

We could construct this randomness scheme with other random variables: \(T\) distributed according to \(P_T\) which tells me the temperature, and \(C\) distributed according to \(P_C\) which tells me the outcome of the coin toss. Then my random variable \(X\) can be defined by

\[ X = \begin{cases} \text{“run” } & \text{ if } T\geq 15 \text{ or } (10\leq T\leq 15\text{ and }C=\text{“Heads”}) \\ \text{“no run” } & \text{ if } T< 10 \text{ or } (10\leq T \leq 15 \text{ and }C=\text{“Tails”}) \end{cases}\]

We could write down the distribution \(P\) of \(X\) without appealing to \(T\) or \(C\) if we wanted to. But this structure gives us both the distribution of \(X\), and gives us a way of thinking about it in terms of some other random variables.

In the statistical context, the variables involved in this hierarchical construction will often be unobserved, so we would only actually find out in this case whether or not I run. In fact, if we did have access to the profile of \(T\) and \(C\), it might be more natural to try and learn about the distributions \(P_T\) and \(P_C\), because understanding these would give us a full understanding of the distribution \(P\) of \(X\). So, in practice, specification of a hierarchical model usually involves dependence on some unobserved random variables, often referred to as *latent variables*.

**Mixture Models**

Mixture models are a class of models which can naturally be formulated in a hierarchical way as described above. I will try and give an intuitive explanation of what mixture models are, and then I will make this precise with some mathematical notation for the interested reader.

When we say we are using a “mixture model” we are assuming that the data, drawn from a broad population \(P\), can actually be viewed as coming from a “mixture” of sub-populations, each with a corresponding distribution when the data is in fact drawn from this sub-populations. In this scheme, we usually imagine that we can’t directly observe which sub-population our observation is drawn from, and so the sub-population to which our random data sample belongs is a *latent variable *of the kind discussed above. Often when we use mixture models, we are interested in learning both about the makeup of each of the sub-populations, and the within-sub-population distribution of the data.

Let me demonstrate this with an example. Imagine I have survey data which indicates what level of minimum wage a sample of people support. I could try and perform inference about the overall distribution of people’s ideas about the correct minimum wage, and make statements such as “The mean supported wage is approximately £9 per hour” or something to that effect. However, I could model the population as a mixture of people from three groups, being “Conservative voters”, “Labour voters” and “Voters of neither party” with the latter group also including those who don’t vote. I might reasonably think that the distribution of people’s views within these sub-populations differed between these three sub-populations, and be curious about what these three distributions are, and I might also be interested in estimating what proportion of people fall into each of the three categories. Mixture models provide us with a framework to answer questions like this one.

**Aside on identifiability**

Some of the more cynical readers might note that this is mathematically sort of a non-definition, because on a basic level any distribution can actually be rewritten as a mixture. For instance, you could imagine splitting up this population from which we took our minimum-wage data along sub-populations which (presumably) have no bearing on one’s support for different wages, such as favourite colour, and then simply view the within-group distributions as being the same as the overall distribution, with no change for each group. While this is true, and mean that mixture models in the most generality are rather *too *flexible, we can place a few more assumptions on the sub-populations and distributions, which are generally not too controversial, to avoid pathologies like the one described. Such assumptions guarantee what statisticians call *identifiability *of a model, which is a necessary prerequisite to inference in any statistical setting involving distributions defined in terms of other quantities (like distributions described by parameters), but which is a more delicate matter in mixture models. But I don’t think readers of this post will lose too much by ignoring such details, unless they are already pursuing a career in academic statistics!

**Mathematical Formulation (Quite technical!)**

Mathematically, in (discrete) mixture models, we have data \(X_1,\dots,X_n\) and we will define our candidate set \(\mathcal{P}\) of distributions \(P\) as follows: We assume there exists some positive integer \(k\) and some latent variables \(Y_1,\dots,Y_n\) taking values in \(\{1,\dots,k\}\), independently and identically distributed according to some distribution \(P_Y\). Then, we suppose that, if \(Y_1=j\), then \(X_1\) is distributed according to \(P_j\). Since \(j\) can range between \(1\) and \(k\), that means that specification of a distribution \(P\) in \(\mathcal{P}\) amounts to specifying the distribution \(P_Y\), which is called the *mixing distribution, *and the distributions \(P_1,P_2,\dots,P_k\), which are called the *emission distributions*.

Because \(P_Y\) is a distribution over a finite set, it is easily characterised by elements of a simple set (of probabilities of each of the \(k\) states) which we will call \(\Theta_Y\). For ease of notation, we will also assume that there is a set \(\Theta_j\) parametrising each of the \(P_j\). What this means is that, if you give me a full list of elements \(\theta_Y\in\Theta_Y\), \(\theta_1\in\Theta_1,\dots,\theta_k\in\Theta_k\), then you have actually pointed me to all of the distributions described above, which themselves point me to an overall (candidate) distribution \(P\) governing the randomness of my data. The set containing those lists is given by a big set, which we call \(\Theta\), which we can mathematically write as

\[\Theta=\Theta_Y\times\Theta_1\times\Theta_2\times\dots\Theta_k\]

With each element \(\theta\in\Theta\) being a specific list, and so giving rise to a specific distribution corresponding to this list which we will call \(P_\theta\). Then we can finally define the *mixture model *as the set of distributions \(\mathcal{P}\) taking the form

\[\mathcal{P}=\{P_\theta:\theta\in\Theta\}\]

Phew!

**So what’s the point?**

One of the reason I think mixtures are worth discussing is because they were the first models I can recall meeting for which the object of interest was not necessarily (an aspect of) the overall distribution governing the behaviour of the data we had gathered, but some other related distribution. So even if we have data from a broad population, we can make inferences about the in-group behaviour of members of the sub-populations which make it up.

I think that this is an interesting phenomenon in the context of the sort of sampling bias I discussed at the start of this post, which essentially deals with the opposite of this problem. In those settings, you have a situation where you’re very much directly making inference about the habits of a particular sub-population (those who go swimming) and now want to bridge that into the wider population. Unfortunately, to go back the other way, you would need to know something about the profile of members of other sub-populations (the other \(P_j\) in the language of the mathematical formulation) and of the proportions of each of the sub-populations within the broader population (which constitutes knowing about the \(P_Y\) in the language of the same bit). But it does mean that our inference about this sub-population isn’t useless in making statements about the population as a whole - just that it isn’t quite enough.

**Weighted/adjusted samples**

In fact, you can make use of the kinds of ideas discussed above, where you know what sub-population a group falls into and understand that this then correlates strongly with their response, in an approach known (I think) as *weighted sampling*.

The basic idea is that, if you have a really good understanding of the makeup of the sub-populations already but you know this makeup won’t be well reflected in your sample, and you are able to determine which sub-population a group falls into, then you can combine this pre-existing knowledge with your specific sub-population-based inference to make inferences about a broader population.

**Example: National voting intentions from a student town**

For example, imagine I live in a town which is largely dominated by young people and I’m interested in finding out about voting intentions across the country, with a goal of estimating the proportion of the national vote each of a few parties will receive. I might conjecture that much of the difference in voting patterns can be attributed to age, and so for the sake of simplicity we will ignore effects that might come from focusing on just the one geographical area (even if the makeup of ages were to be the same as the country at large), and we will try and account for the differing age makeup not reflective of the larger population in which we are interested.

Suppose I go out and gather 100 data points, each of which is a list of the form (Age category, Voting intention). Suppose I have drawn some age boundaries and my categories are "18-24", "25-39", "40-49" and "60+" which I will label categories 1,2,3,4. Imagine they respond telling me they will vote for either of parties A,B or C and so each of my data points looks something like (1,A).

Now imagine my 100 data points contains 70 pieces of data from age category 1, and 10 from each of the other categories. For the 70 pieces of data with age category 1, I have 50 that are (1,A), 15 that are (1,B) and 5 that are (1,C). For the 10 in age category 2, it is split as 6 voting A, 2 voting B and 2 voting C. For the 10 in age category 3, it is split as 4 for A, 3 for B and 3 for C. And for the 10 in age category 4, it is split as 2 for A, 2 for B and 6 for C.

Imagine I were to just look at the voting intentions of my sample without any adjustment. I would note that 62% were voting A, 22% for B and 16% for C, and perhaps I would then conclude that this is likely to be the overall nationwide vote shares for these parties.

But if I know the rough makeup of the country (which is probably something sufficiently well-known and near static that I don't need my own up-to-date survey for it), then I can use this existing data to help with my inference about these nationwide voting intentions. I might then find that, across the country at large, 10% are 18-24 year olds, 20% are 25-39 year olds, 30% are 40-59 year olds and 40% are 60+ (I think the actual makeup skews a bit older according to this, but these proportions are just for the sake of example).

I won't go through all the calculations, but if we instead use our data points to find out about voting intentions *within age groups*, and then use these known population weights for the country at large to try and extrapolate, we end up with a voting intention something along the lines of 39% for A, 22.5% for B and 38.5% for C. Here we see a much different national picture, where it appears to be a toss up between A and C who are more popular with younger and older voters respectively, and this was only revealed through this sub-population based analysis we conducted.

As I mentioned, the above is obviously a simplified example because it is reasonable to think that people of a given age in my hypothetical location might not even think that similarly to people of the same age elsewhere, but I thought it nicely demonstrates how we might adjust for bias in our sampling, of the kind we are warned about from a young age.

**A final word**

As alluded to previously, mixture models aren't quite what we used in the previous example as we're sort of doing things the other way around - rather than looking at samples from an unknown mix of sub-populations and trying to infer the makeup of them and the characteristics of each one, we instead do this almost the other way round by targeting our inference first on the sub-populations rather than the broader population, and then using our knowledge of the makeup of it to get a better idea for the overall distribution.

I suppose I just thought this seemed like a nice context in which to bring up mixtures, partly because they're very close to the topic of my own research. But I do think that the basic principle underlying both of these techniques, being that populations may be better understood as the composition of sub-populations each with distinguishing characteristics, is certainly a powerful idea to incorporate into one's statistical inference.