In the last couple of posts, I have written a bit about what a *random variable *is, and some things we can say about them. These things are interesting to talk about when we have some sort of readily defined scheme of randomness (or *distribution*), like a roulette wheel or a dice, and we are interested in querying properties of the outcome, like the expectation, or the dependence structure of sequences of such outcomes.

The point of these posts was to demonstrate some concepts in *probability,* using what mathematicians tend to call "toy examples" (in which things are set up in a way which is easy to play with and understand, but perhaps not realistic). And although I mentioned towards the end of the first post some things about "modelling" of the "real world," I have been left with the feeling that I am yet to make a precise link between the study of *probability *and the study of *statistics*. So that is what I will (try to) do here.

I might add that what I will write here is perhaps a narrow definition of statistics, one with which some people much more intelligent than myself may take issue, or deem incomplete. But I nevertheless think that, having understood some of the basic language of probability theory, this is a nice way to understand a formulation of statistics. In a nutshell, it might be summarized as "reverse probability".

**A biased coin**

Let me illustrate some statistical notions with a "toy example". Suppose I toss a coin, but that it is weighted unusually so that the probability of seeing a head is some unknown number (between 0 and 1) rather than the usual \(frac{1}{2}\). Let's suppose I'm interested in what this chance of seeing a head is, and that I am able to toss the coin 10 times before I need to make some sort of decision, maybe whether to place a bet on a particular toss being heads. Now, if I see 7 heads and 3 tails, I might note that 70% of the sample came up heads. But I don't *really *care about what the results of those ten coin tosses were, rather I am using the outcomes of them to understand something about the *data generating process*, by which I mean the random process from which my data (the coin tosses) was "generated", so that I can either understand something about a broader population (what is the long run proportion of heads if I toss the coin lots of times) or so that I am able to make predictions (how sure can I be that a future coin toss, perhaps one I bet on, will come up heads?)

**"Backwards" probability**

The way we can formalise the above example into something more generally applicable is as follows. Suppose I am interested in understanding the "randomness profile" (distribution) \(P\) of some kind of random process. Suppose now that I have data, all generated from this process \(P\), and lets say that the data points are independent. So, in the language introduced earlier, we will call our data points \(X_1,\dots, X_n\) (so \(n=10\) in the coin toss example) . Then we could write down the statement

\[X_1,\dots,X_n \stackrel{iid}{\sim} P\]

Now in probability theory, we might ask "For a certain choice of \(P\), how might our data points \(X_1,X_2,\dots\) look?" In this instance, we might think of \(P\) as somehow being the known rules of some game of chance, like tossing a fair dice or the profile of winnings from a standard roulette wheel. This discussion of how our data points might look could involve, for example, computing the mean, or the mode (being the outcome of highest probability).

But in statistics, we *don't know \(P\), *but we already have some of these "data points" or "samples" or "draws from \(P\)" \(X_1,X_2,\dots\) and we want to sort of *reverse engineer *what \(P\) could be, or some aspects of \(P\). For example, because we know that the sample average will eventually recover the mean of \(P\) (by the law of large numbers, discussed here) it is reasonable for us to think that taking the sample mean gives a good *estimate *for the mean of the distribution.

Typically what we will do is propose a set of "candidate" distributions which might provide an explanation for the data we have seen, and then build these *estimates *for (aspects of) the distribution, out of the data. In particular, these *estimates* should only depend on what we're actually able to see. In the next sections, I will formalise these notions somewhat.

**Statistical Models**

In statistics, a *model *is a set of candidate distributions (or *data generating processes)* from which we believe the data to be generated. Formally, we can write a model as a set \(\mathcal{P}\) containing distributions as elements. Often, the models are described by something more simple called a *parameter*, which is usually just a number (or list of numbers), in which case the model can be thought of as the set of candidate parameters, each parameter describing a model. Formally, we could write \[\mathcal{P}=\{P_\theta:\theta\in\Theta\}\] where \(\Theta\) is the set of parameters denoted \(\theta\), and \(P_\theta\) is the distribution according which has parameter \(\theta\). In this way, we can think about \(\mathcal{P}\) as being the same as \(\Theta\), because we can exactly match up each distribution \(P\) (or \(P_\theta\) ) in \(\mathcal{P}\) with a parameter \(\theta\in\Theta\).

Let me illustrate this with the coin toss example. We don't know what the probability, call it \(\theta\), of seeing a head, really is. But we do know that it has to be between \(0\) and \(1\). So our possible distributions, which we could call \(P_\theta\), are "Heads with probability \(\theta\), Tails otherwise" and our parameter set \(\Theta\) containing all the possible \(\theta\) values is just "All numbers between zero and one." So although our model is really all of the distributions that say "Heads with probability \(\theta\), Tails otherwise", we can just as well think of it as just being characterised by this \(\theta\) and think of the model as being all of the possible \(\theta\), i.e. \(\Theta\).

It is a common (and often questionable!) assumption to make that the set \(\mathcal{P}\) of candidate distributions (or \(\Theta\) if we are using the formulation above involving parameters) contains the *true data generating process *or equivalently the *true parameter. *With the coin toss example with the parameter family described above, this assumption is clearly verified, but that is really a consequence of the fact that the data, being results of coin tosses, is so simple. If the data was instead "time between positive COVID-19 test and admission to ICU" then specifying a model for that would be a lot more difficult, and we might not have come up with a distribution in \(\mathcal{P}\) that captures this complex process. Typically, we might choose a very large \(\mathcal{P}\) or develop estimation methods that are somehow "robust" to this "misspecification" of \(\mathcal{P}\), but a mathematical discussion of these things is beyond the scope of this post!

Once we have specified a model, which for certain choices of \(\mathcal{P}\) (including that for our coin toss example) can be thought of equivalently as having specified the parameter set \(\Theta\), we can then try and *estimate *(aspects of) the distribution, or the parameter of the distribution, which gave rise to the data we saw.

**Parameter Estimation**

So we have written down a statistical model \(\mathcal{P}\), and we have now seen some data which (hopefully) came from one of the distributions \(P_0\) living in our big collection \(\mathcal{P}\). As a word on notation, typically \(P_0\) or \(P^*\) will be used to denote the "true" distribution, being the distribution which actually generated the data, so our assumption when writing down \(\mathcal{P}\), is that \(P_0\in\mathcal{P}\) (Meaning \(P_0\) is in the set \(\mathcal{P}\) ). But of course, we don't actually know what \(P_0\) is, just like in the coin toss we don't know the true bias the coin has. And, for fear of any ambiguity, \(P_0\) is (a bit confusingly) a compact notation for \(P_{\theta_0}\) when there is a parameter \(\theta_0\) involved, superseding its interpretation as being the specific instance of \(P_\theta\) when \(\theta=0\).

An *estimate*, loosely speaking, is a number which I construct from my data points \(X_1,\dots,X_n\). This means that it can vary depending on what values are taken on by \(X_1,\dots,X_n\), but it should *not *vary based on things we don't know. For example, suppose my coin has a bias such that the probability of seeing a head is 0.75. Suppose now I toss it three times and get three tails in a row. Because I am imagining that I don't actually know the bias, my estimate can only be based on the outcomes of the coin tosses I've seen! Imagine my friend also had a biased coin, but hers is (unknowingly) such that it comes up heads with probability 0.3, and that she also happened to observe three tails in a row. Then my estimate based on my data would have to be the same as my estimate based on her data, because I can't incorporate things which I can't see, like the true bias.

We tend to think of estimates intuitively as being things which are "reasonable guesses" for something, like a reasonable guess for the probability of seeing a head if I toss the coin again. But according to the definition above, as long as I construct a number only based on things I can actually see (i.e. the data), I could very well come up with a really stupid estimator! For example, I could insist that my estimate will be that, after one hundred coin tosses, I think that there's a 100% chance that tosses must give the face I see more often (even if I saw the other face quite a lot too!). So, if I see 60 heads and 40 tails, I could "estimate" that the coin is so biased towards heads that seeing a tail is impossible! This is clearly a very silly estimate, as it is incompatible with what we've seen, but it does fit the definition of "a number I came up with based on the data." And in basic mathematical definition, this is all an estimator is.

So we are happy that an estimate is just some number I've come up with based on the data. Unfortunately this isn't an especially helpful definition by itself, so it is also worth talking about what might constitute a "reasonable" estimate.

One final note - because the estimate is based on the data, estimates will be themselves random variables. If my rule is based on which side of the coin came up most, for example, then it is subject to the "same randomness" as the coin, and so what my estimate ends up being is itself random. If it wasn't random, it would never change based on what data I saw, which would be a bit odd.

**Unbiased Estimates**

When I was studying for my A Levels, I was certainly given the impression that the holy grail of estimates were *unbiased *estimates. Loosely speaking, these are estimates which are "correct on average". A bit more formally, and adopting some of the language discussed in this post, it means that my estimator, which as outlined above is random in that it is based on randomly drawn data, has expectation equal to the thing it is trying to estimate.

I will write this more formally for those who appreciate the notation. If we have data \(X_1,\dots,X_n\stackrel{iid}{\sim}P_0\), we can write an estimator, formally, as a function \(T=T(X_1,\dots,X_n)\). This just means that its values vary depending on the data, and \(T\) is just an arbitrary labelling of this function. If the estimate is for the true parameter \(\theta_0\) governing the behaviour of the distribution \(P_0\) (e.g. an estimate for the probability \(\theta_0\) of seeing a head which characterises the randomness profile "Heads with probability \(\theta_0\), Tails otherwise") then we say that \(T\) is *unbiased* if

\[E_0[T(X_1,\dots,X_n)]=\theta_0\]

where \(E_0\) means "take expectations according to \(X_1,\dots,X_n\) having randomness dictated by \(P_0\)".

This seems like a pretty good criterion, because on average these estimates are correct. However, it's not necessarily the sole criterion we want to use, as we will see below.

**A "better" biased estimate?**

Suppose I have a machine which spits out either a zero or a one. If we like, we can think about the zero representing a tail and the one representing a head, but I wanted to make things concrete with numbers, so that it makes sense to talk about expectations. This machine assigns some unknown weight to spitting out a one, and so its randomness is governed by the distribution we will call \(P_0\), being "Machine spits out 1 with probability \(\theta_0\), 0 otherwise" where \(\theta_0\) is the "true parameter" lying somewhere between 0 and 1 (or 0% and 100%).

Let me write my usual \(X_1,\dots X_n\stackrel{iid}{\sim}P_0\), this time with the \(X\) random variables being either 0 or 1 according to what the machine spits out. I come along with my rule \(T\) for estimation of \(\theta_0\), and my friend comes along with her rule \(\tilde{T}\) for estimation of the same thing. So we are both interested in finding out what the chance is of the next thing it spits out being a 1.

Imagine that \(n=100\) and I decide that my rule, based on \(X_1,\dots X_{100}\) will be "\(T=X_1\)". It certainly only depends on the data I have seen (not all of it, but certainly not anything beyond it). I can also work out that

\[E_0[T] = 1\times\theta_0+0\times(1-\theta_0)=\theta_0\]

Great, unbiased! Now my friend has a rule which is "\(\tilde{T}=\frac{0.5+X_1+\dots+X_{100}}{101}.\)" We can compute the expectation of this rule (try it!), and show that it is

\[E_0[\tilde{T}] = \frac{100}{101}\times\theta_0+\frac{1}{101}\times\frac{1}{2} \]

This is only unbiased if \(\theta_0=\frac{1}{2}\), but otherwise \(\tilde{T}\) is *biased. M*y estimator \(T\) was unbiased, so its correct on average, but her's (in general) is not! Which do you prefer?

**So what is statistics, anyway?**

I hope what I've written here has given you a basic introduction to the fascinating subject of statistics, and I hope the most recent estimation example has given you something to ponder. There are big questions which come out of what I wrote above, such as "What makes a good \(\mathcal{P}\) or \(\Theta\)" or "What really makes a 'good' estimate \(T\)?" but those things will have to be left for another time.

There is much more to talk about, such as the "Frequentist" and "Bayesian" paradigms for estimation (and prediction), "Parametric" and "Nonparametric" models, and the wide range of sophisticated methods for dealing with the sort of data we get in the 21st century. I don't think this post has made much progress at all in answering the question I posed, and it was probably very classical in its framing, but hopefully it has exposed you somewhat to the mathematical underpinnings of the kinds of problems statisticians might face.