# Formal Hypothesis Testing

Introduction
Here we consider, in greater detail, the potential perils of defining hypotheses to test after examining and plotting the data. Abstractly, consider the case where an analyst, for an arbitrary multivariate dataset, does the following:

1) plots the data
2) examines the data to identify interesting features
3) manually chooses a statistical procedure to test whether said features are significant

Below we argue that such an approach leads to a situation where we reject an excessive number of tests.

Testing hand-picked hypotheses
One problem is that the values of the hypotheses must be real numbers and not random variables. To see that, let us consider a procedure to test an hypothesis and let us define:
$$\mathcal{H} = \left \{ \begin{array} {rl} 1 & \mbox{if the null hypothesis is true}, \\ 0 & \mbox{if it is false}. \\ \end{array} \right.$$
Let $\mathbf{X}$ be the random vector we observe and let $G(\mathbf{X})$ be the test statistic. See Upton and Cook (2003, page 165) for vocabulary about hypothesis testing. For a given $\alpha \in [0,1]$, let also $\Theta_{\alpha}$ be the rejection region of this test. We then have:
$$\left( \mathcal{H} = 1 \right) \Rightarrow \left( \mathbb{P}\left[ G(\mathbf{X}) \in \Theta_{\alpha} \right] = \alpha\right).$$

A random variable $A$ is a function from a set $\Omega$ to $\mathbb{R}$. See for example Ash and Doléans-Dade (2000, page 173). When we observe $A$, we actually observe $A(\omega)$ where $\omega$ is an unknown element of $\Omega$. That is why we call $\mathbf{X}(\omega)$ the observed value of $\mathbf{X}$ and $G(\mathbf{X}(\omega))$, the observed test statistic.

If we observed $G(\mathbf{X}(\omega)) \in \Theta_{\alpha}$ and if $\alpha$ is small enough, then either we are very unlucky (incorrectly reject the null hypothesis), or
$$\mathbb{P}\left[ G(\mathbf{X}) \in \Theta_{\alpha} \right] \neq \alpha$$
and then $\mathcal{H} \neq 1$ and we correctly reject the null hypothesis.

Testing hypotheses after looking at the data
Assume now that we have several available statistical tests and then several null hypotheses $H^{(1)}_0, \ldots, H^{(n)}_0$. We redefine $\mathcal{H}$ using the $i$th hypothesis.
$$\mathcal{H} = \left \{ \begin{array} {rl} 1 & \mbox{if } H^{(i)}_0 \mbox{ is true}, \\ 0 & \mbox{if } H^{(i)}_0 \mbox{ is false}. \\ \end{array} \right.$$
The reasoning presented above still holds. But if we choose a statistical test after looking at the data then this is like using $\mathcal{H}^{\star}(\omega)$ instead of $\mathcal{H}$ with
$$\mathcal{H}^{\star}(\omega) = \left \{ \begin{array} {rl} 1 & \mbox{if } H^{(j(\mathbf{X}(\omega)))}_0 \mbox{ is true}, \\ 0 & \mbox{if } H^{(j(\mathbf{X}(\omega)))}_0 \mbox{ is false}. \\ \end{array} \right.$$
Since the index $j(\mathbf{X}(\omega))$ is a function of $\omega$ and after checking some theoretical details, we have that $j(\mathbf{X})$ is a random variable and so is $\mathcal{H}^{\star}$. However, to apply the framework of hypothesis testing, we needed $\mathcal{H}$ to be a real number. People may argue that real numbers can be seen as random variables but we often can not prove that $\mathcal{H}^{\star}$ corresponds to that special case. An example of $\mathcal{H}^{\star}$ which is not a real number is presented in the next section.

Example
Let $\mathbf{X}$ be a single random variable generated from a continuous uniform law. We write $\mathbf{X} \sim \mathcal{U}(a,b)$ and we want to know if $a = 0$ and $b = 10$. We have $n = 10$ statistical tests and for all these tests, the null hypothesis is:
$$H^{(i)}_0\ :\ a = 0\ \mbox{and}\ b = 10.$$
For all tests $i$, the test statistic is:
$$G^{(i)}(\mathbf{X}) = \mathbf{X}.$$
and the rejection region is
$$\Theta^{(i)}_{\alpha} = [ i - 0.5 - 5 \alpha, i - 0.5 + 5 \alpha ).$$
For example, for $\alpha = 0.05$ and $i = 1$, we have a rejection region of $[0.25, 0.75)$, and our test statistic will (naturally) fall in this range $5$ percent of the time under the null hypothesis.
We can verify that for all $i$,
$$\left( H^{(i)}_0 \mbox{ is true} \right) \Rightarrow \left( \mathbb{P}\left[G^{(i)}(\mathbf{X}) \in \Theta^{(i)}_{\alpha} \right] = \alpha \right).$$
Let us choose $\alpha = 0.1$. If we choose a test after looking at the data, and if we really want to reject the null hypothesis, we will choose the test such that our observation is in the rejection region. We then define
$$j(\mathbf{X}) = \mathbb{I}_{[0,10)}(\mathbf{X}) \lfloor \mathbf{X} \rfloor + 1,$$
so that for example:
$$\Theta^{(j(3))}_{\alpha} = [3, 4),\ \Theta^{(j(7.6))}_{\alpha} = [7, 8), \ \Theta^{(j(54.9))}_{\alpha} = [0, 1),$$
since $\alpha = 0.1$. We then observe that for all $\omega \in \Omega$ such that $\mathbf{X}(\omega) \in [0, 10)$, we have
$$G^{(j(\mathbf{X}(\omega)))}(\mathbf{X}(\omega)) = \mathbf{X}(\omega) \in \Theta^{(j(\mathbf{X}(\omega)))}_{\alpha},$$
and we reject $H^{j(\mathbf{X}(\omega))}_0$. This means that if $\mathbf{X} \sim \mathcal{U}(0, 10)$, the probability to reject $H^{j(\mathbf{X})}_0$ is equal to $1$. As a consequence, if we consider that the reasoning made for $\mathcal{H}$ is also true for $\mathcal{H}^{\star}$, then if $a = 0$ and $b=10$, the probability to reject the hypothesis that $a = 0$ and $b=10$ is equal to $1$, and this testing procedure is useless.

Conclusion
I recommend that statisticians build their procedure to determine if a model is appropriate for a data-set before receiving this data-set. If they already have the data-set, I recommend asking another person to create the testing procedure after telling him or her only what is the model.

References

R. B. Ash and C. Doléans-Dade. Probability and Measure Theory. Harcourt/Academic Press, Burlington, United States of America, 2nd edition, 2000.

G. Upton and I. Cook. A dictionary of statistics. Oxford University Press, Oxford, United Kingdom, 2nd edition, 2004.