You design a test statistic \(T(X_1, \dots, X_n)\) for the purpose of performing this test which must satisfy at least the first three of the following four properties:
Suppose you have a sample of \(n\) i.i.d. random variables \(X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)\) and you know the value of \(\sigma^2\). We want to test whether the mean \(\mu\) of the distribution is equal to some pre-specified value \(\mu_0\). We can therefore use \(H_0: \mu = \mu_0\) vs. \(H_a: \mu \ne \mu_0\). At this point, a good candidate test statistic to look at for performing this test is: \[ T(X_1, \dots, X_n) := \sqrt{n} \frac{\overline X - \mu_0}{\sigma}, \quad \mbox{with} \quad \overline X := \frac{1}{n} \sum_{i=1}^n X_i. \]
Parametric testing. If you designed your test statistic carefully, you might have access to its theoretical distribution when \(H_0\) is true under distributional assumptions about the data. This is called parametric hypothesis testing.
Asymptotic testing. If it is not the case, you can often derive the theoretical distribution of the statistic under the null hypothesis asymptotically, i.e. assuming that you have a large sample (\(n \gg 1\)); this is called asymptotic hypothesis testing.
Bootstrap testing. If you are in a large sample size regime but still cannot have access to the theoretical distribution of your test statistic, you can approach this distribution using bootstrapping; this is called bootstrap hypothesis testing.
Permutation testing. If you are in a low sample size regime, then you can approach the distribution of the test statistic using permutations; this is called permutation hypothesis testing.
Depending on what you put into the alternative hypothesis \(H_a\), larger values of the test statistic that raise suspicion regarding the validity of \(H_0\) might mean:
In the first two cases, we say that the test is one-sided. In the latter case, we say that the test is two-sided because we interpret large values in both tails as suspicious.
Suppose you have a sample of \(n\) i.i.d. random variables \(X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)\) and you know the value of \(\sigma^2\). We want to test whether the mean \(\mu\) of the distribution is equal to some pre-specified value \(\mu_0\). As we have seen, a good candidate test statistic to look at for performing this test is:
\[ T(X_1, \dots, X_n) := \sqrt{n} \frac{\overline X - \mu_0}{\sigma}, \quad \mbox{with} \quad \overline X := \frac{1}{n} \sum_{i=1}^n X_i. \]
Now, using this test statistic, we might be interested in performing three different tests:
\[ T(X_1, \dots, X_n) := \sqrt{n} \frac{\overline X - \mu_0}{\sigma} \] Remember that we must look at large values of \(T\) that raise suspicion in favor of \(H_a\).
\(\overline X\) is an unbiased estimator of the true mean.
Large negative values of \(T\) that happen when the true mean is far less than \(\mu_0\) does not raise suspicion in favor of \(H_a\).
Large positive values of \(T\) that happen when the true mean is far more than \(\mu_0\) does raise suspicion in favor of \(H_a\).
Conclusion: we are interested only in the right tail of the null distribution of \(T\) to find evidence to reject \(H_0\).
\[ T(X_1, \dots, X_n) := \sqrt{n} \frac{\overline X - \mu_0}{\sigma} \] Remember that we must look at large values of \(T\) that raise suspicion in favor of \(H_a\).
\(\overline X\) is an unbiased estimator of the true mean.
Large negative values of \(T\) that happen when the true mean is far more than \(\mu_0\) does not raise suspicion in favor of \(H_a\).
Large positive values of \(T\) that happen when the true mean is far less than \(\mu_0\) does raise suspicion in favor of \(H_a\).
Conclusion: we are interested only in the left tail of the null distribution of \(T\) to find evidence to reject \(H_0\).
\[ T(X_1, \dots, X_n) := \sqrt{n} \frac{\overline X - \mu_0}{\sigma} \] Remember that we must look at large values of \(T\) that raise suspicion in favor of \(H_a\).
\(\overline X\) is an unbiased estimator of the true mean.
Large negative values of \(T\) that happen when the true mean is far more than \(\mu_0\) does raise suspicion in favor of \(H_a\).
Large positive values of \(T\) that happen when the true mean is far less than \(\mu_0\) does raise suspicion in favor of \(H_a\).
Conclusion: we are interested in both the left and the right tails of the null distribution of \(T\) to find evidence to reject \(H_0\).
So we are now at a point where we know which tail(s) which we should look at and we now need to make a decision as to what large means. In other words, above which threshold on all possible values of my test statistic should I consider that I can reject \(H_0\).
Decision | \(H_0\) is true | \(H_a\) is true |
---|---|---|
Do not reject \(H_0\) | Well done | Type II error |
Reject \(H_0\) | Type I error | Well done |
At this point, you can decide that you do not want to make more than a certain amount of type I errors. So you want to force that \(\mathbb{P}_{H_0}(\mbox{reject } H_0) \le \alpha\), for some upper bound threshold \(\alpha\) on the probability of type I errors. This threshold is called significance level of the test and is often denoted by the greek letter \(\alpha\).
Let us now translate what this rule implies for the right-tail alternative case. The event “\(\mbox{reject } H_0\)” translates in this case into \(T > x\) for some \(x\) value of the test statistic \(T\). Hence, the rule becomes: \[ \mathbb{P}_{H_0}(T > x) \le \alpha \\ \Leftrightarrow 1 - \mathbb{P}_{H_0}(T \le x) \le \alpha \\ \Leftrightarrow 1 - F_T^{\left(H_0\right)}(x) \le \alpha \\ \Leftrightarrow F_T^{\left(H_0\right)}(x) \ge 1 - \alpha \] verified for all \(x \ge q_{1-\alpha}\), where \(q_{1-\alpha}\) is the quantile of order \(1-\alpha\) of the null distribution of \(T\).
Decision-making rule:
Definition. The p-value is a scalar value between \(0\) and \(1\) that measures what was the probability, assuming that the null hypothesis \(H_0\) is true, of observing the data we did observe, or data even more in favor of the alternative hypothesis.
Mathematical expression. If \(t_0\) is the value of the test statistic computed from the observed sample, then:
Interpretation. If the \(p\)-value is very small, it means that
Decision-making rule. We can show that rejecting the null hypothesis when \(p \le \alpha\) also produces a decision-making rule that guarantees a probability of type I error at most \(\alpha\).
Definition. It is the probability of correctly rejecting the null hypothesis, i.e. to reject it when the alternative is in fact correct. It is often denoted by \(\mu\). In terms of events, it is defined by: \[ \mu := \mathbb{P}_{H_a}(\mbox{Reject } H_0). \]
Usage. The statistical power of a test is an important aspect of the test because:
Remarks.
Compliance testing aims at determining whether a process, product, or service complies with the requirements of a specification, technical standard, contract, or regulation.
Here we want to test whether the proportion \(p\) of individuals in a given population who have a feature of interest is equal to a pre-specified rate.
Hypothesis tests for comparing the distributions that generated two independent samples.
Adequacy tests aim at determining if the distribution of the observations is coherent with a given probability distribution.
Data Science with R - aymeric.stamm@cnrs.fr - https://astamm.github.io/data-science-with-r/