Chapter 10
Hypothesis testing
Puzzle 1
According to Paul Zak, the hormone oxytocin increases trust in humans. You do an experiment in which groups of people are exposed to oxytocin or not while meeting a stranger. They rate how much they trusted the stranger from 0 (no trust) to 10 (complete trust). What are the null and alternative hypotheses?

Null hypothesis: People’s ratings of trust towards strangers are identical regardless of whether they were exposed to oxytocin or not.

Alternative hypothesis: Compared to people who are not exposed to oxytocin, people who are exposed to oxytocin will give a higher rating of trust towards strangers.
Puzzle 2
In a study based on ## Puzzle 1, trust ratings were 8, 95% CI [6, 10], in the oxytocin group and 6, 95%CI [4.5, 7.5], in the nonoxytocin group. What is the value of the MOE? Are trust levels likely to be significantly different in the two groups using an αlevel of 0.05?
The confidence interval for the oxytocin group ranges from 6 to 10 so has a length of 4 and an MOE of half this value (i.e., 2). For the nonoxytocin group, it ranges from 4.5 to 7.5, which is a length of 3 and an MOE of 1.5. The average MOE is, therefore, (2 + 1.5)/2 = 1.75. Moderate overlap, which would indicate a significant difference between groups at p = 0.05, would be half of this value (i.e., 0.88). We can calculate the size of the overlap of the confidence intervals by subtracting the lower limit of the confidence interval for the oxytocin group (6) from the upper limit of the confidence interval for the nonoxytocin group (7.5), which gives us 1.5. The moderate overlap (0.88) tells us the maximum amount of overlap that would yield a significant difference (assuming a 0.05 Type I error rate); therefore, an overlap less than this indicates that p would be less than 0.05 (i.e., a difference deemed more significant than p < 0.05), whereas an overlap greater than this value indicates that p is greater than 0.05 (which is typically interpreted as nonsignificant). In this example, therefore, because the overlap of 1.5 is greater than 0.88 (the overlap for p = 0.05), it suggests that there is more overlap than you would get for p < 0.05; therefore, the oxytocin and nonoxytocin groups are unlikely to be significantly different from each other using an αlevel of 0.05.
Puzzle 3
What is the difference between a Type I and Type II error?
A Type I error is when you believe that there is a genuine effect in the population, when in fact there is not. A Type II error is the opposite: when you believe that there is no effect in the population when, in reality, there is.
Puzzle 4
What are the major problems with null hypothesis significance testing?
 It doesn’t allow you to draw any firm conclusions about the null hypotheses.
 Significance (the pvalue) is affected by the sample size.
 It encourages allornothing thinking.
 It can be influenced by the intentions of the scientist.
Puzzle 5
What effect does sample size have on statistical significance?
The significance of a test statistic is directly linked to the sample size: the same effect will have different pvalues in differentsized samples: small differences can be deemed ‘significant’ in large samples, and large effects might be deemed ‘nonsignificant’ in small samples.
Puzzle 6
What are the arguments for not using onetailed tests?
If you do a onetailed test and the results turn out to be in the opposite direction to what you predicted you must ignore them and accept the null hypothesis. If you don’t do this, then you have done a twotailed test using a different level of significance from the one you set out to use, which is cheating; therefore, onetailed tests temp people to cheat. Also, if you do a twotailed test and find that your p is 0.06, then you would conclude that your results were not significant (because 0.06 is bigger than the critical value of 0.05). Had you done this test onetailed, however, the p you would have got would be half of the twotailed value (0.03). This onetailed value would be significant at the conventional level (because 0.03 is less than 0.05). Therefore, a scientist who finds a twotailed p that is just nonsignificant might be tempted to pretend that they had always intended to do a onetailed test because the ‘onetailed’ pvalue is significant.
Puzzle 7
The manager of The Reality Enigma tells Zach that they will sell more albums if they include a free gift with the album than if they don’t. What are the null hypothesis and alternative hypothesis? Would you use a one or twotailed test to evaluate these hypotheses?
 Null hypothesis: Album sales are identical regardless of whether or not you give away a free gift.
 Alternative hypothesis: More albums are sold when accompanied by a free gift.
This hypothesis is directional, since it states that more albums will be sold if a free gift is given. Importantly, a result in the opposite direction to the expected direction would result in the same action as a nonsignificant result (a free gift would not be given away with the album), therefore you could use a onetailed test to evaluate this hypothesis.
Puzzle 8
In Puzzle 7, Zach sold 62, 95% CI [52, 72], albums when he gave away a free gift and 53, 95% CI [50, 56], when he didn’t. Are these likely to be statistically significantly different using a criterion of p = 0.05?
The confidence interval for the freegift group ranges from 52 to 72 so has a length of 20 and an MOE of half this value (i.e., 10). For the nofreegift group, it ranges from 50 to 56, which is a distance of 6 and an MOE of 3. The average MOE is, therefore (10 + 3)/2 = 6.5. Moderate overlap, which would indicate a significant difference between groups at p = 0.05, would be half of this value (i.e., 3.25). We can calculate the size of the overlap of the confidence intervals by subtracting the lower limit of the confidence interval for the freegift group (52) from the upper limit of the confidence interval for the nofreegift group (56), which gives us 4. This value is slightly greater than 3.25, suggesting that the freegift and nofreegift groups are unlikely to be significantly different using a criterion of p = 0.05.
Puzzle 9
A researcher conducted 15 significance tests on the same data. What is the experimentwise error rate? What pvalue value would you use as a threshold for significance to correct for this error rate?
We can calculate the experimentwise or familywise error rate using the following equation from the book:
$$
\begin{aligned} \text{familywise error} &= 10.95^n \\ &= 10.95^{15} \\ &= 10.46 \\ &= 0.54 \end{aligned}
$$
In other words, when conducting 15 tests there is a 54% chance of at least one Type I error, compared to 5% when only one test is conducted.
We can correct for this error rate using the Bonferroni correction. It is not the only method , but it illustrates the general principle well. All it means is that because 15 tests were conducted, rather than accept a result as significant if its probability is below 0.05, we accept it as significant if its probability is below 0.05/15 = 0.003. In doing so, we ensure that the familywise Type I error remains below 0.05. In other words, the pvalue that you would use as a threshold for significance to correct for the current error rate is 0.003.
Puzzle 10
What is the statistical power of a test?
The power of a statistical test is the probability that it will detect an effect when one actually exists.