Human sex ratio The earliest use of statistical hypothesis testing is generally credited to the question of whether male and female births are equally likely (null hypothesis), which was addressed in the 1700s by
John Arbuthnot (1710), and later by
Pierre-Simon Laplace (1770s). Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710, and applied the
sign test, a simple
non-parametric test. In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 0.582, or about 1 in 4,836,000,000,000,000,000,000,000; in modern terms, this is the
p-value. Arbuthnot concluded that this is too small to be due to chance and must instead be due to divine providence: "From whence it follows, that it is Art, not Chance, that governs." In modern terms, he rejected the null hypothesis of equally likely male and female births at the
p = 1/282 significance level. Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. Dr.
Muriel Bristol, a colleague of Fisher, claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance. The null hypothesis was that the Lady had no such ability. The test statistic was a simple count of the number of successes in selecting the four cups. The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (H_0: "the defendant is not guilty", and H_1: "the defendant is guilty". The first one, H_0, is called the
null hypothesis. The second one, H_1, is called the
alternative hypothesis. It is the alternative hypothesis that one hopes to support. The hypothesis of innocence is rejected only when an error is very unlikely, because one does not want to convict an innocent defendant. Such an error is called
error of the first kind (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, an
error of the second kind (acquitting a person who committed the crime), is more common. A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold ("beyond a reasonable doubt"). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence.
Clairvoyant card game A person (the subject) is tested for
clairvoyance. They are shown the back face of a randomly chosen playing card 25 times and asked which of the four
suits it belongs to. The number of hits, or correct answers, is called
X. As we try to find evidence of their clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is: the person is (more or less) clairvoyant. If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly
p. The hypotheses, then, are: • null hypothesis \text{:} \qquad H_0: p = \tfrac 14 (just guessing) and • alternative hypothesis \text{:} H_1: p > \tfrac 14 (true clairvoyant). When the test subject correctly predicts all 25 cards, we will consider them clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider them so. But what about 12 hits, or 17 hits? What is the critical number,
c, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value
c? With the choice
c=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with
c=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a
false positive, or Type I error. With
c = 25 the probability of such an error is: :{{nowrap|P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X = 25\mid p=\frac 14\right)=\left(\frac 14\right)^{25}\approx10^{-15},}} and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times. Being less critical, with
c = 10, gives: :{{nowrap|P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X \ge 10 \mid p=\frac 14\right) = \sum_{k=10}^{25}P\left(X=k\mid p=\frac 14\right) = \sum_{k=10}^{25} \binom{25}{k}\left( 1- \frac 14\right)^{25-k} \left(\frac 14\right)^k \approx 0.0713.}} Thus,
c = 10 yields a much greater probability of false positive. Before the test is actually performed, the maximum acceptable probability of a Type I error (
α) is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value
c is calculated. For example, if we select an error rate of 1%,
c is calculated thus: :{{nowrap|P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X \ge c\mid p=\frac 14\right) \le 0.01.}} From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a
false negative. For the above example, we select: c=13. P(X=0 \mid H_0 \text{ is valid}) = P\left(X = 0\mid p=\frac 14\right) = \left(1-\frac 14\right)^{25} \approx 0.00075. This is highly unlikely (less than 1 in a 1000 chance). While the subject can't guess the cards correctly, dismissing H0 in favour of H1 would be an error. In fact, the result would suggest a trait on the subject's part of avoiding calling the correct card. A test of this could be formulated: for a selected 1% error rate the subject would have to answer correctly at least twice, for us to believe that card calling is based purely on guessing. --> == Variations and sub-classes ==