Statistical inference: Assumptions and methods

I have on a few occasions mentioned statistical inference on this blog, by which I mean the process through which we make inferential statements about the distribution which gave rise to our data, based on the observed data. I formulated this in some detail in this post and here I also introduced the notion of an estimator. In today's post, I wish to elaborate a little bit more about the process by which we might conduct inference, and to discern between two notions: assumptions and methods. To do so, I will first introduce the terms, "frequentist" and "Bayesian."

Likelihood

In this post, I introduced and contrasted the notions of probability mass functions and probability density functions. Despite their differences, both of them give some measure of the probability of seeing a data point, either at that point (in the case of pmf) or around that point (in case of [continuous] densities).

Suppose we have a statistical model \(\mathcal{P}\) of “candidate distributions” for our data points. For simplicity, we will assume that each distribution is associated to a real number \(\theta\in\mathbb{R}\). A concrete example would be the collection of normal distributions with known variance (say equal to one) and unknown mean \(\theta\).

The idea of likelihood is that we then ask “Under each of my distributions associated to \(\theta\), what is the “likelihood” that this data would have been seen?” For simplicity, we will take just one data point \(X\). Then the probability (density) of seeing \(X\) under the distribution associated to \(\theta\) (which we will call \(P_\theta\)) is just

\[p_\theta(X)\]

Where \(p_\theta\) is the pmf or pdf associated to the distribution \(P_\theta\). This object is called the likelihood. In fact, it is just the probability mass function or probability density function at \(\theta\) evaluated at \(X\), but we are thinking of it as a function of \(\theta\) for a certain \(X\), rather than the other way round.

Concretely, if we imagine a function \(f:\Theta\times\mathcal{X}\longrightarrow\mathbb{R}^+\) for which \(f(\theta,x)=p_\theta(x)\), then the pmf/pdf for the distribution \(P_{\theta^*}\) is just the function \(f(\theta^*,\cdot):\mathcal{X}\longrightarrow\mathbb{R}^+\) and the likelihood at a point \(x^*\) is just the function \(f(\cdot,x^*):\Theta\longrightarrow\mathbb{R}^+\).

Naturally, after gathering our data \(X\), we might favour parameters \(\theta\) for which the likelihood is higher. The most obvious way to do this is to pick the value of \(\theta\) for which the likelihood is greatest, so our guess is “The distribution from which this data is most likely to have been drawn.” We will take a look at this approach in the next section.

Frequentist framework and maximum likelihood estimation

The frequentist "approach" to statistical inference assumes that a true distribution (or true parameter) \(P_*\) (or \(\theta^*\)) is responsible for the generation of our data, and that we should accordingly try to infer what the value of this true distribution or parameter is. Such inference might lead us to make point estimates (“I think this is \(\theta^*\)”) or produce confidence sets (“I am 95% confident that the true parameter \(\theta^*\) is contained in this set).

A popular method of inference is maximum likelihood estimation, whereby we do the fairly obvious thing with our likelihood we introduced in the previous section, and maximise it. Under certain assumptions known as regularity conditions which guarantee our model “behaves nicely” from a mathematical point of view, we can give a frequentist justification for the associated techniques. By this, I mean that under the frequentist assumption that a “true” parameter exists and generates our data, we can make statements about how the maximum likelihood estimator, which is a random quantity depending on our data, behaves. One such mathematical statement is asymptotic normality of the MLE,

However, I think it might be a stretch to say that maximum likelihood estimation is a frequentist method. Perhaps the difference is irrelevant, but I prefer to think of the frequentist framework as just being one where we make an assumption about the data generating process, rather than one attached to particular methods. So we can take a method, such as maximum likelihood, and say that it is justifiable (in a suitable sense) under the frequentist assumption. I suppose it is up to you whether that makes it a “frequentist method,” although perhaps by the end of the post you will agree with me.

Bayesian framework: prior and posterior

In my mind, the Bayesian framework is as much about methods as it is about assumptions, although it certainly has a foot in both camps. Let me first talk about what Bayesians “do” before I make the distinction.

In the Bayesian approach to statistics, we place a prior distribution \(\Pi\) on our unknown parameter. Mathematically, this means that our parameter \(\theta\) (or distribution \(P_\theta\) is modelled as a random variable. The usual idea (at least in simple models) is that the prior distribution encodes “prior knowledge” that you have about the data generating process, for example from expert knowledge or informed by past analysis of the procedure. We then use the data to “update” our prior beliefs: this gives us a new distribution called the posterior distribution which is actually quite a complicated object, being that it is a random distribution over the parameters (with the randomness coming from the data). So I suppose, given that the parameters are proxies for distributions, this means it possesses some kind of distribution over distributions over distributions - a bit of a mouthful. But intuitively, it tells us how we “should” update our prior beliefs (encoded by \(\Pi\)) about the parameter \(\theta\). This update is due to Bayes formula, which tells us for any “events” (which we think of as “random outcomes”) which I will call \(A,B\), we have

\[ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \]

What this ends up looking like is

\[ \Pi(\theta|X) = \frac{p_{\theta}(X)\Pi(\theta)}{m(X)} \]

Where, roughly speaking, \(A\) is the event “The random parameter takes on the value \(\theta\)” and \(B\) is the event “The data takes the value \(X\).” The \(p_{\theta}\) is just the likelihood that we had before, which once again plays a role in influencing our view on the parameter, this time pushing our posterior distribution to weight more highly those values with a high likelihood. The quantity \(m(X)\) is often ignored because it is just a “normalising constant” but is known as the model evidence, and tells us how likely our data is to have arisen from the process where the parameter \(\theta\) is random with distribution \(\Pi\). What I mean by a normalising constant is that its value actually has no impact on our inference about \(\theta\) (once we have decided on our prior) but has to be placed there in order to make the posterior a valid distribution, being either a pmf with weights summing to one, or a pdf with total enclosed area equal to one.

Bayesian methods then revolve around this posterior distribution, with statements of uncertainty and “most likely” parameters being baked in. So a common “point estimate,” where we pick the single parameter which best explains our data, might involve taking the posterior mode (point of highest pmf/pdf) or the posterior mean (being the expectation of \(\theta\) under the posterior distribution). Since we have access to a whole distribution, we are also able to quote “sets of high posterior probability,” being regions in the space \(\Theta\) which includes our possible \(\theta\) values where the posterior distributions indicates \(\theta\) is likely to land. For instance, if our posterior distribution ended up being a normal distribution with mean zero and variance one, being used to describe the posterior beliefs about a parameter \(\theta\in\Theta=\mathbb{R}\), then such a set of high posterior probability might be the interval \([-2,2]\) (which would have posterior probability or “mass” a little over 0.95).

In my view, the Bayesian approach necessitates methods based on this posterior distribution, and I would describe methods which don’t do this as “non-Bayesian.” However, a “Bayesian assumption,” which might be described as an “honest belief that the parameter is randomly generated by \(\Pi\),” (which stands in contrast to the frequentist assumption of a fixed, “true” parameter) is a bit different.

I should note here that such an honest belief might not really be based on the idea that our parameter is "randomly generated" per se, but may rather reflect the use of a probability as a means to model epistemic uncertainty, which is a type of uncertainty reflecting something one could in principle know, but doesn't. An example of this type of uncertainty might be arise when discussing the "probability that the millionth digit of (the irrational number) \(\pi\) is a 3." My mathematics teacher at secondary school was very much a frequentist in this regard, as he said that such a probability was meaningless as this number has a specific value, and it can't be anything else, hence it is not a random variable. Others argue that it would be totally reasonable to reflect the uncertainty in what you or I know about this number by modelling it as a random variable with a distribution \(\Pi\), say one which has equal chance of being any number from 0-9, even though they understand that this does actually take on a particular one of those values.

In any case, you could certainly use a Bayesian method without “believing” in the modelling of a parameter as a random variable just as you might use maximum likelihood even if you’re not a frequentist (which is why I was hesitant to describe it as a “frequentist method”). Although you would at some stage have to tell a sort of white lie that the parameter is a random variable, you could simply view this as a particular algorithm to generate estimates, with these estimates then being examined based on their performance under a frequentist assumption.

Bayesian and Frequentist: A false dichotomy?

I think at some stage of my mathematical study I had a notion that we needed to somehow choose between "Bayesian" and "frequentist," but is this really the case? I don’t think so, precisely because I think that you can use Bayesian methods without a Bayesian assumption, as described in the preceding paragraph. In fact, frequentist analysis of Bayesian methods is an active area of statistical research and its existence rather flies in the face of the idea that the two are mutually exclusive.

It is only really the “honest Bayesian”, who makes the assumption regarding parameter generation from their favourite prior (or the suitability of the prior for modelling epistemic uncertainty), rather than the “fixed, unknown truth” assumption, who necessarily contrasts with the frequentist. Frequentist analysis of Bayesian methods may nevertheless be of interest to such statisticians, as their methods both being compatible under a frequentist assumption has the effect also that their methods will be compatible with one another in a certain sense, and even those who eschew the frequentist assumption might think it reasonable that other "honest Bayesians" quantify their prior uncertainty slightly differently.

Assumptions and Methods

I think, really, the dichotomies are firstly between varying assumptions (where we might have frequentists and “honest Bayesians”), and then seperately between our chosen methods (which could be split into “Bayes” and “non-Bayes”). I think this way of thinking about certain notions in statistics can be extended a bit further and I will remark briefly on this dinsinction, in a different context, in what follows.

A parametric or nonparametric assumption is one about the statistical model being finite or infinite-dimensional (where I mean vector space dimension in the sense of this post) respectively. You then have methods which are suitable for each respective assumption, which might be categorised as parametric or nonparametric methods. But using a parametric method under a nonparametric assumption might actually be an interesting thing to do, and similarly we might want to analyse the performance of a nonparametric procedure when the frequentist true parameter is actually more simple (and may have been included in particular parametric families).

This distinction could even be made in fully nonparametric settings. For example, an adaptive method is a method which works well under a whole family of different assumptions, as well as a method tailored to a particular assumption within this family. This is a fairly advanced concept which pops up in the frequentist study of Bayesian Nonparametric procedures but is often seen as a desirable theoretical property of a method - I am sure I will talk about it at length at some stage.

I hope I have convinced you of the importance of distinguishing between assumptions and methods whenever we perform statistical inference. It may well be that certain assumptions or methods are incompatible with one another, but this distinction may also encourage us to think more carefully about the intersection between two areas which we might otherwise perceive as disjoint.

Dan Moss

Dan Moss

DPhil Student at Oxford/StatML CDT. Interested in maths, stats, veganism and current affairs. Pronouns: He/him
Oxford, United Kingdom