A first look at hypothesis testing

Much of the statistics we end up meeting at university involves inference about some parameter of the model, wherein we try and figure out what process generated our data. This might pop up in a so-called "regression" problem where we seek to establish a relationship between some dependent variable (or response/output/label) \(Y\) and some independent variables (or covariates/inputs/features) \(X=(X_1,\dots,X_p)\). Then we might be able to make statements like "We are 95% confident that the average relationship between \(X\) and \(Y\) looks like this."

Two very much related tasks in statistics are prediction and (hypothesis) testing. In the former case, we might ask "If these are my covariates, what is the probability that my response will look like this?" In the latter case, we might ask questions such as, "Do I think that this covariate matters or not?" and it is this setting which I would like to discuss today.

Null, alternative and errors

When we formulate a testing problem, we typically compare a null hypothesis \(H_0\) against an alternative hypothesis \(H_1\). In this formulation, we intuitively think of \(H_0\) as the thing we would initially presume, and \(H_1\) as an alternative which we perhaps think less likely from the outset, or which we don't want as our starting assumption for some reason. In a sense, \(H_0\) and \(H_1\) are just different labels for the two hypotheses, but the usual framework for conducting tests introduces the kind of asymmetry I'm describing, which I will come back to later. For example in a court case where we have presumption of innocence until proven guilty, we would certainly set \(H_0\) to be "innocent" and \(H_1\) to be "guilty."

Each hypothesis will provide some (set of) possible process(es) by which the data we see was generated. This data will have a different distribution depending on whether \(H_0\) is true. For example, we might that under \(H_0\) it is pretty unlikely that a defendant's DNA was found on someone's coat, while that might be a pretty likely outcome under \(H_1\). Thus the evidence in this case, or the data more generally, will have associated distributions \(P_0\) and \(P_1\) modelling the likelihood of different scenarios under the hypotheses \(H_0\) and \(H_1\) respectively.

Since we will use the data to guide our decision on \(H_0\) and \(H_1\), this decision is subject to (at least) the same randomness as is found in the data, and so it makes sense to talk about the probability of deciding on one hypothesis or the other. Since the distribution of the data will depend on which hypothesis is true, we can specifically talk about each of these probabilities under each of the distributions associated to \(H_0\) or \(H_1\). Mathematically speaking, we can write \(P_j(H_i)\) to be "probability of accepting \(H_i\) when \(H_j\) is true." where \(i,j\in\{0,1\}\) (meaning the subscripts \(i\) and \(j\) take values in the set "0 or 1," which indexes our hypotheses).

Then Type I and Type II errors correspond to particular instances of these probabilities where \(i\) and \(j\) do not match, in which case we accepted the wrong hypothesis. Type I error is \(P_0(H_1)\) which is the probability of falsely rejecting \(H_0\) (meaning that our evidence was gathered in a situation where the defendant was innocent, but it led us to a guilty verdict). Type II error is \(P_1(H_0)\), the probability of falsely accepting \(H_0\). Note that under both \(P_1\) and \(P_0\) we would end up either picking \(H_0\) or \(H_1\) (this framework doesn't provide a mechanism to reject both, and often rejecting both wouldn't make sense) and so this entails that \(P_0(H_0)+P_0(H_1)=1\) and \(P_1(H_0)+P_1(H_1)=1\) . This does not mean that \(P_0(H_0)+P_1(H_0)=1\) (or for \(H_1\)) however (for instance if we are in a rigged court that always finds the defendant guilty regardless of what actually happened).

Size, power and asymmetry in testing

The Type I error \(P_0(H_1)\) is often called the size of the test and denoted with the letter \(\alpha\). The probability \(P_1(H_1)=1-P_1(H_0)\) is called the power of the test and denoted \(\beta\), with the fact that this is equal to one minus the Type II error being a consequence of what we discussed above. These definitions already nod towards the asymmetry we will see later.

We should note that we do have a certain amount of freedom over controlling these errors. As a silly example, we could imagine a situation where we always accept the null hypothesis. This would make the Type I error go to zero at the expense of the Type II error going to 100%. The question of having a good test is not just about controlling one error or the other - we just saw that this would be trivial - but rather about having a "good" control of the errors together.

To make our notation precise, we introduce a "test" \(\psi\) which is a function of the data (i.e. it depends only on things we can observe, not on which hypothesis is actually true which we can't see) which spits out either a 0 or 1. We would then choose \(H_0\) if it's 0 and \(H_1\) if it's 1. Then the problem of hypothesis testing is the problem of choosing a suitable \(\psi\). So my silly example above would involve always having \(\psi=0\) all the time, which we didn't think was a very good idea. In this framework, \(P_i(H_j)=P_i(\psi=j)\). This is just a way of making explicit the mechanism \(\psi\) by which we decide which hypothesis to choose, then if we want to compare mechanisms we have a notation where we can distinguish between one test \(\psi\) and a different test \(\psi'\).

The question then remains as to a suitable criterion to decide how to do our hypothesis testing, which amounts to a suitable criterion to choose \(\psi\). This is where the asymmetry tends to come in, although there is no reason it really has to. We might imagine that a sensible criterion would which tells us whether to take \(H_0\) or \(H_1\) such that \(\psi\) minimises the sum of the Type I and Type II errors. Mathematically, that would mean

\[P_1(\psi=0)+P_0(\psi=1) = \min_{\psi'}\{P_1(\psi'=0)+P_0(\psi'=1)\}.\]

However, it is more standard (at least when the topic is first introduced in a statistics course) to first control the Type I error (or size) of the test \(\psi\) to a desired level, and then choose which of those has the lowest Type II error (or equivalently, the highest power). This would mean first specifying some threshold level \(\alpha\) (which tends to be put at 5%, but this is by no means always the right value) and then choosing, among all tests which have size at most \(\alpha\) (which would include my silly test), the test which has the highest power. Mathematically, this amounts to choosing a test \(\psi\) for which

\[P_1(\psi=0) = \min_{\psi':P_0(\psi'=1)\leq\alpha}\{P_1(\psi'=0)\}.\]

There is a bit of a subtlety here in that I allow tests to have size less than alpha as well as being equal to alpha. Intuitively, size being lower is better, so if it turns out my most powerful test happens to have a size lower than \(\alpha\) then that's the best one. However, beyond the size threshold \(\alpha\), we don't allow more tradeoff between \(\alpha\) and \(\beta\), for which the optimisation problem would be

\[P_1(\psi=0) = \min_{\psi':P_0(\psi'=1)\leq\alpha}\{P_1(\psi'=0)+P_0(\psi'=1)\}.\]

This would allow us to pick a \(\psi\) where we get a size a fair bit less than \(\alpha\) in return for a minor reduction in power, and maybe this is a reasonable criteria to use as well.

It is these last two formulations (the former being the typical one) which introduces asymmetry into the hypotheses. This is because my silly test where I always accept \(H_0\) would actually be preferable to a test in which both the Type I and Type II errors are sat at 6% (for \(\alpha=5\%)\). In fact, it would even be better than one where the Type I error was 5.000001% and the Type II error was zero - neither one would be a viable solution. Sometimes we want this asymmetry, but it is certainly not clear that we always do.

Remark on "optimal" tests (+Exercise for reader)

I may be wrong, but I think that the solution to the optimisation problem without tradeoff (the first asymmetric one) is always optimised with size being equal to alpha, assuming we have access to a external random number generator. If we have more "size budget" remaining, we could always add further randomisation to our test which allows us to use it up to get increased power. For instance, imagine \(\alpha=0.05\) but my proposed optimiser had \(\alpha=0.04\). I could generate a random number between 1 and 1000 independently of all the other data, and then if I get 1000 I just reject \(H_0\) in favour of \(H_1\), regardless of my data.

This sort of makes my test silly in that this bit doesn't use my data, but it has the effect of trading of Type I and Type II errors. If we pick a certain number \(n\) instead of 1000 (maybe \(n\) would need to be non-integer in which case a slightly more involved scheme might be necessary) then we could gain power until we reached the threshold \(\alpha\). I am pretty sure \(n\) wouldn't actually be 100 for this, I think it would be slightly less than that because 4% of the time we're rejecting anyway so the further randomness doesn't increase the probability in that case (so perhaps it needs to be 96?). I invite the reader to check what \(n\) should be as an exercise and see if I'm right (which I may well not be).

Case study: Testing the mean of a normal distribution

I'd like to demonstrate these concepts with a discussion of a typical hypothesis testing problem faced by A Level maths students, where implicitly we use the asymmetric criterion to choose our test. We suppose that we have data gathered from a normal distribution with unit variance and unknown mean \(\theta\) - that is, that \(X_1,\dots,X_n\sim N(\theta,1)\). The null hypothesis will be \(H_0:\theta=0\) (meaning that \(P_0\) is "Normal distribution with mean zero, variance 1") and the alternative hypothesis will be \(H_1:\theta > 0\).

As an aside, the reader may notice at this point that defining an "alternative distribution" \(P_1\) is a bit hard if the mean could be "anything positive." This doesn't affect the definition of size, but when defining Type II error, or power, we have to adjust for the fact that we now have a whole collection \(\mathcal{P}_1\) of candidate distributions which are valid under \(H_1\), being "Normal distributions with mean greater than zero." We can define the Type II error to be the worst-case error over this family, and we can define a power function which tells us the power as we vary \(P_1\) in the family \(\mathcal{P}_1\). This formulation means that the corresponding notions of Type II errors and power are not related quite as straightforwardly as in the scenario we discussed, and that our corresponding optimality criteria is then based on both the worst case Type II error and this power function. However, the intuition is broadly the same as in the previous section, and so the reader will not lose anything by not worrying about these details, which I will not make precise.

I am going to make my life a bit easier and just consider the case \(n=1\) (and write \(X=X_1\) to demonstrate the concepts, although in practice you might be rather hesitant to make claims based on one data point. I will note this, however - regardless of sample size, the approach by which we control first size then optimize for power is no more "risky" for the null hypothesis (which you might have thought) - it's just less powerful. In other words, all the gain when you have more samples is realised in the power, and so you are always making valid statements concerning Type I errors.

Ok, so we have our data point \(X\sim N(\theta,1)\) and we want to devise a test. The way I introduced these tests might have been a bit mysterious, but what it often boils down to is cooking up some kind of region, which we will call \(C\), and then depending on whether or not \(X\) lies in \(C\) we will set \(\psi=1\) (if it is) or \(\psi=0\) (if it isn't) (and go for \(H_1\) or \(H_0\) accordingly). Sometimes \(C\) is called the critical region for the test.

Remember that we need to make sure we're not rejecting the null falsely more than a proportion \(\alpha\) of the time. Thus, under the null (in this case \(N(0,1)\)) we can't put more than \(\alpha\) total probability. So something like this wouldn't work if \(\alpha=5\%\)

Proposed critical region \(C\) is \(-1<X<1\). The probability of this region under the null is shaded in red and is greater than 0.05 so the associated test isn't suitable.

So we're looking for regions where the associated red area accounts for less than 5% of the total area under the blue curve. So something like this might be better:

Proposed critical region \(C\) is \(-0.0627<X<0.0627\). The corresponding red region which gives the probability of this part has area \(\alpha\).

Unfortunately, this test, while having an appropriate size, lacks power. It's not as bad as the "always accept \(H_0\)" test, as the values lying in the region \(C\) specified here can arise from (a distribution associated to) the alternative hypothesis \(H_1\), but it's still not very good.

It turns out that the most powerful choice of \(\psi\) (in a suitable sense) is the one where we stuff the 5% of the total area on the far end of the positive axis. I've pictured that one below. Intuitively, these values are the ones which most strongly point towards positive values of the mean of \(X\), and since these are the ones our alternative concerns, this procedure has good power. So our problem is solved by taking \(\psi\) to be one when \(X\) is in the red region, and zero otherwise. What region do you think would provide the worst power for this testing problem?

"Most powerful" choice of \(\psi\) (for \(\alpha=0.05\)). We choose \(H_1\) (or reject \(H_0\)) roughly when \(X>1.645\)

However, this is by no means the only way to think of extreme or critical values, rather it is simply the "optimal" way in a certain sense, described (roughly) by the minimisation problem formulated in the previous section.

Typically, questions might give students a value for \(X\) from the get go (or work out a value for \(\bar{X}_n\), the sample mean, when they have multiple observations) and are asked whether their result is significant at a certain threshold. I remember being asked to compute something like the "probability of seeing something this extreme," and this always bothered me. Because even under a normal distribution with mean zero, seeing a data point \(X\in(-0.000001,0.000001)\) could certainly be considered "extreme." Really, what these questions are asking is to determine whether their result would put them in the critical region of a test of optimal power with a given size. So if our data satisfies \(X >1.645\) and we are "optimally" testing (at the 5% level), we would reject the null. Students also often have to decide whether a test should be "one-tailed" (like the one above) or "two-tailed", and I think this is really a question about what the optimal test is in a suitable sense - even if both of them would define tests with the correct size.

Is this how we should be doing hypothesis testing?

I never much liked hypothesis testing in this formulation. I always found the asymmetry a bit contrived, and I felt that I didn't really have much flexibility in how this asymmetry was chosen. Perhaps I want to allow a bit of trade-off between size and power but weight one hypothesis more strongly - must I construct an ad-hoc optimality criteria in every instance?

It turns out that there is a fairly different framework for testing, called Bayesian hypothesis testing, which is really just a consequence of the Bayesian framework of doing statistics as a whole, which is fascinating and somehow hasn't yet made it into a blog post yet. I will certainly devote a whole post to the Bayesian framework in the not too distant future, but the way of testing in this framework is, from my perspective, much more natural in a lot of instances. It allows you to weight precisely your hypotheses prior to seeing your data, which gives you a neat way of incorporating asymmetry if it is desired, and it also provides a probabilistic framework in which to do these tests. So if this whole post felt a bit off, maybe you'll prefer my post on that subject.

Dan Moss

Dan Moss

DPhil Student at Oxford/StatML CDT. Interested in maths, stats, veganism and current affairs. Pronouns: He/him
Oxford, United Kingdom