Chapter 15
Comparing two means
Puzzle 1
Earlier in his journey, Milton tried to convince Zach that trying to learn statistics dressed as Florence Nightingale would help him (Chapter 8). This intervention was based on research by Zhang et al. (2013) showing that women completing a maths test under a fake name performed better than those using their real name. Table 15.5 (in the book and reproduced below) has a random selection of the scores from that study. The table shows scores from 20 women and 20 men, in each case half performed the test using their real name whereas the other half used a fake name. Conduct an analysis on the women’s data to see whether scores were significantly higher for those using a fake name compared to those using their own name.
Table 15.5 (reproduced): A random selection of data from Zhang et al. (2013). Scores show the percentage correct on a maths test when males and females take the test either using their own name, or a fake name  

Females  Males  
Fake  Own  Fake  Own  
60  47  77  72  
78  63  85  70  
57  27  57  57  
64  42  31  57  
89  23  84  27  
79  75  67  83  
63  24  40  64  
81  44  17  57  
46  40  85  47  
50  44  57  80  
Mean  66.70  42.90  60.00  61.40 
SD  14.33  16.56  24.20  16.51 
Whether someone conducted the maths test using their own name or a fake name is represented by the variable Name
, and we want to predict this variable from Accuracy
, which is how well the person scored on the maths test. The model can be expressed as follows:
$$ \text{Accuracy}_i = b_0 + b_1 \text{Name}_i + \epsilon_i $$
Because the experiment in this puzzle uses different entities in each group, (i.e. different women took part in the fake name and own name conditions) the design is known as an independent design and therefore requires us to conduct an independent ttest. We could assign participants who took the maths test under their own name a 0 for the variable Name, this is the baseline group. The ‘experimental’ group was made up of participants who took the test under a fake name, and we could assign these participants a value of 1 for the variable Name
.
We need to calculate estimates for $ b_0 $ and $ b_1 $ in the model. To compute $ b_1 $, we calculate the total relationship between the predictor and outcome, the sum of cross products (SCP) and divide this value by the total deviation of the predictor from its mean, $ SSx $, you can find these values in this table:
Calculating the sum of squared deviations for $x$ ($SS_x$) and the sum of crossproduct deviations ($SCP$) between $x$ and $y$  

Accuracy $(y_i)$ 
Name $(x_i)$ 
$(x_i\bar{x})$  $(x_i\bar{x})^2$  $(y_i\bar{y})$  $(x_i\bar{x})(y_i\bar{y})$  
60  1  0.5  0.25  5.2  2.6  
78  1  0.5  0.25  23.2  11.6  
57  1  0.5  0.25  2.2  1.1  
64  1  0.5  0.25  9.2  4.6  
89  1  0.5  0.25  34.2  17.1  
79  1  0.5  0.25  24.2  12.1  
63  1  0.5  0.25  8.2  4.1  
81  1  0.5  0.25  26.2  13.1  
46  1  0.5  0.25  8.8  4.4  
50  1  0.5  0.25  4.8  2.4  
47  0  0.5  0.25  7.8  3.9  
63  0  0.5  0.25  8.2  4.1  
27  0  0.5  0.25  27.8  13.9  
42  0  0.5  0.25  12.8  6.4  
23  0  0.5  0.25  31.8  15.9  
75  0  0.5  0.25  20.2  10.1  
24  0  0.5  0.25  30.8  15.4  
44  0  0.5  0.25  10.8  5.4  
40  0  0.5  0.25  14.8  7.4  
44  0  0.5  0.25  10.8  5.4  
Mean  54.80  0.50  
Sum  5.00  119.00 
To calculate an estimate of $ b_1 $ we divide the sum of crossproduct deviations by the sum of squared deviations for the predictor
$$ \hat{b}_1 = \frac{SCP}{SS_x} = \frac{119}{5} = 23.80 $$
We could also calculate $ b_1 $ by taking the difference between the two group means (you can find the means in Table 15.5 (reproduced)):
$$ \hat{b}1 = \bar{X}\text{Fake Name}  \bar{X}̅_\text{Own Name}=66.7042.90=23.80 $$
Calculating $ b_0 $ is easy because it is equal to the mean of the baseline group (the group that we coded as zero), which in this case was the own name group:
$$ \hat{b}_0 = 42.90 $$
or
$$ \begin{aligned} \hat{Y}_i &= \hat{b}_0 + \hat{b}_1X_i \\ \hat{b}_0 &= \hat{Y}_i\hat{b}_1X_i \\ &= 54.80  23.80 \times 0.50 \\ &= 42.90 \end{aligned} $$
Next, we need to calculate the sum of squared residuals, which I have done in the table below.
Calculating the sum of squared residuals $\sum(Y_i\hat{Y}_i)^2$  

Accuracy $(y_i)$ 
Name $(x_i)$ 
Model  $\hat{Y}$  Residual $Y_i\hat{Y}$ 
Residual squared $(Y_i\hat{Y})^2$ 

60  1  42.9 + (23.8 $\times$ 1)  66.7  6.7  44.89  
78  1  42.9 + (23.8 $\times$ 1)  66.7  11.3  127.69  
57  1  42.9 + (23.8 $\times$ 1)  66.7  9.7  94.09  
64  1  42.9 + (23.8 $\times$ 1)  66.7  2.7  7.29  
89  1  42.9 + (23.8 $\times$ 1)  66.7  22.3  497.29  
79  1  42.9 + (23.8 $\times$ 1)  66.7  12.3  151.29  
63  1  42.9 + (23.8 $\times$ 1)  66.7  3.7  13.69  
81  1  42.9 + (23.8 $\times$ 1)  66.7  14.3  204.49  
46  1  42.9 + (23.8 $\times$ 1)  66.7  20.7  428.49  
50  1  42.9 + (23.8 $\times$ 1)  66.7  16.7  278.89  
47  0  42.9 + (23.8 $\times$ 0)  42.9  4.1  16.81  
63  0  42.9 + (23.8 $\times$ 0)  42.9  20.1  404.01  
27  0  42.9 + (23.8 $\times$ 0)  42.9  15.9  252.81  
42  0  42.9 + (23.8 $\times$ 0)  42.9  0.9  0.81  
23  0  42.9 + (23.8 $\times$ 0)  42.9  19.9  396.01  
75  0  42.9 + (23.8 $\times$ 0)  42.9  32.1  1030.41  
24  0  42.9 + (23.8 $\times$ 0)  42.9  18.9  357.21  
44  0  42.9 + (23.8 $\times$ 0)  42.9  1.1  1.21  
40  0  42.9 + (23.8 $\times$ 0)  42.9  2.9  8.41  
44  0  42.9 + (23.8 $\times$ 0)  42.9  1.1  1.21  
Sum  4,317.00 
I calculated the sum of squared residuals to be 4317 and we can turn this total error into an average by dividing by the degrees of freedom. The degrees of freedom will be $ N  p \(, where *p* is the number of parameters. There were 20 participants, *N*, and two parameters (\)
b_0 $ and $ b_1 $), so we divide the residual sum of squares, 4317 by $ 202 = 18 $. The resulting mean squared error in the model is 239.83.
$$ \text{MSE} = \frac{SS_\text{R}}{\text{df}} = \frac{\sum_{i = 1}^n(Y_i\hat{Y}_i)^2}{Np}) = \frac{4317}{18} = 239.83. $$
The standard error of the model will then be the square root of this value:
$$ SE_\text{Model} = \sqrt{\text{MSE}} = \sqrt{239.83} = 15.49. $$
We can then use the standard error in the model to calculate the standard error for the b by dividing by the square root of the sum of squares for the predictor (sum of cross products):
$$ SE_{b} = \frac{SE_\text{model}}{\sqrt{SS_x}} = \frac{15.49}{\sqrt{5}} = 6.92. $$
Now we can calculate the tvalue
$$ t_{(np)} = \frac{\hat{b}}{SE_{b}} = \frac{23.80}{6.92} = 3.44. $$
Now we need to look up the critical value for t (see the table ‘Critical values of the tdistribution’ at the back of the main textbook) at the 0.05 significance level with 18 degrees of freedom, which is 2.10. Our observed value of 3.44 is larger than the critical value, indicating that we have a significant result. In other words, we can conclude that scores on the maths test were significantly different in females who used their own name compared to females who used a fake name, and the means tell us that women who used a fake name scored significantly higher on the maths test than women who used their own name.
Puzzle 2
Conduct the same analysis as above but on the male participants.
For this puzzle we will follow the same procedure as in Puzzle 1, only using the data for the male participants in Table 15.5 rather than the data for the female participants. The model will therefore be the same
$$ \text{Accuracy}_i = b_0 + b_1 \text{Name}_i + \epsilon_i $$
The experiment in this puzzle again uses different entities in each group (i.e. different men took part in the fake name and own name conditions), so again we have an independent design, which requires us to conduct an independent ttest. When we conducted this test on the female participants (Puzzle 1) we assigned participants who took the maths test under their own name a 0 for the variable Name (the baseline group), and participants who took the test under a fake name (the experimental group) a value of 1 for the variable Name and we can do the same for the male participants.
Calculating the sum of squared deviations for $x$ ($SS_x$) and the sum of crossproduct deviations ($SCP$) between $x$ and $y$  

Accuracy $(y_i)$ 
Name $(x_i)$ 
$(x_i\bar{x})$  $(x_i\bar{x})^2$  $(y_i\bar{y})$  $(x_i\bar{x})(y_i\bar{y})$  
77  1  0.5  0.25  16.3  8.15  
85  1  0.5  0.25  24.3  12.15  
57  1  0.5  0.25  3.7  1.85  
31  1  0.5  0.25  29.7  14.85  
84  1  0.5  0.25  23.3  11.65  
67  1  0.5  0.25  6.3  3.15  
40  1  0.5  0.25  20.7  10.35  
17  1  0.5  0.25  43.7  21.85  
85  1  0.5  0.25  24.3  12.15  
57  1  0.5  0.25  3.7  1.85  
72  0  0.5  0.25  11.3  5.65  
70  0  0.5  0.25  9.3  4.65  
57  0  0.5  0.25  3.7  1.85  
57  0  0.5  0.25  3.7  1.85  
27  0  0.5  0.25  33.7  16.85  
83  0  0.5  0.25  22.3  11.15  
64  0  0.5  0.25  3.3  1.65  
57  0  0.5  0.25  3.7  1.85  
47  0  0.5  0.25  13.7  6.85  
80  0  0.5  0.25  19.3  9.65  
Mean  60.70  0.50  
Sum  5.00  −7.00 
To calculate an estimate of $ b_1 $ we divide the sum of crossproduct deviations by the sum of squared deviations for the predictor
$$ \hat{b}_1 = \frac{SCP}{SS_x} = \frac{−7.00}{5} = 1.4 $$
We could also calculate $ b_1 $ by taking the difference between the two group means (you can find the means in Table 15.5 (reproduced)):
$$ \hat{b}1 = \bar{X}\text{Fake}  \bar{X}̅_\text{Own} =6061.4 = 1.4 $$
Calculating $ b_0 $ is easy because it is equal to the mean of the baseline group (the group that we coded as zero), which in this case was the own name group:
$$ \hat{b}_0 = 61.4 $$
or
$$ \begin{aligned} \hat{Y}_i &= \hat{b}_0 + \hat{b}_1X_i \\ \hat{b}_0 &= \hat{Y}_i\hat{b}_1X_i \\ &= 60.7  (1.4 \times 0.50) \\ &= 61.4 \end{aligned} $$
Next, we need to calculate the sum of squared residuals, which I have done in the table below.
Calculating the sum of squared residuals $\sum(Y_i\hat{Y}_i)^2$  

Accuracy $(y_i)$ 
Name $(x_i)$ 
Model  $\hat{Y}$  Residual $Y_i\hat{Y}$ 
Residual squared $(Y_i\hat{Y})^2$ 

77  1  42.9 + (23.8 $\times$ 1)  66.7  10.3  106.09  
85  1  42.9 + (23.8 $\times$ 1)  66.7  18.3  334.89  
57  1  42.9 + (23.8 $\times$ 1)  66.7  9.7  94.09  
31  1  42.9 + (23.8 $\times$ 1)  66.7  35.7  1274.49  
84  1  42.9 + (23.8 $\times$ 1)  66.7  17.3  299.29  
67  1  42.9 + (23.8 $\times$ 1)  66.7  0.3  0.09  
40  1  42.9 + (23.8 $\times$ 1)  66.7  26.7  712.89  
17  1  42.9 + (23.8 $\times$ 1)  66.7  49.7  2470.09  
85  1  42.9 + (23.8 $\times$ 1)  66.7  18.3  334.89  
57  1  42.9 + (23.8 $\times$ 1)  66.7  9.7  94.09  
72  0  42.9 + (23.8 $\times$ 0)  42.9  29.1  846.81  
70  0  42.9 + (23.8 $\times$ 0)  42.9  27.1  734.41  
57  0  42.9 + (23.8 $\times$ 0)  42.9  14.1  198.81  
57  0  42.9 + (23.8 $\times$ 0)  42.9  14.1  198.81  
27  0  42.9 + (23.8 $\times$ 0)  42.9  15.9  252.81  
83  0  42.9 + (23.8 $\times$ 0)  42.9  40.1  1608.01  
64  0  42.9 + (23.8 $\times$ 0)  42.9  21.1  445.21  
57  0  42.9 + (23.8 $\times$ 0)  42.9  14.1  198.81  
47  0  42.9 + (23.8 $\times$ 0)  42.9  4.1  16.81  
80  0  42.9 + (23.8 $\times$ 0)  42.9  37.1  1376.41  
Sum  11,597.80 
I calculated the sum of squared residuals to be 7726.40 and we can turn this total error into an average by dividing by the degrees of freedom. The degrees of freedom will be $ N  p \(, where *p* is the number of parameters. There were 20 participants, *N*, and two parameters (\)
b_0 $ and $ b_1 $), so we divide the residual sum of squares, 7726.40 by $ 202 = 18 $. The resulting mean squared error in the model is 429.24.
$$ \text{MSE} = \frac{SS_\text{R}}{\text{df}} = \frac{\sum_{i = 1}^n(Y_i\hat{Y}_i)^2}{Np}) = \frac{7726.40}{18} = 429.24. $$
The standard error of the model will then be the square root of this value:
$$ SE_\text{Model} = \sqrt{\text{MSE}} = \sqrt{429.24} = 20.72. $$
We can then use the standard error in the model to calculate the standard error for the b by dividing by the square root of the sum of squares for the predictor (sum of cross products):
$$ SE_{b} = \frac{SE_\text{model}}{\sqrt{SS_x}} = \frac{20.72}{\sqrt{5}} = 9.25. $$
Now we can calculate the tvalue
$$ t_{(np)} = \frac{\hat{b}}{SE_{b}} = \frac{1.40}{9.25} = 0.15. $$
Now we need to look up the critical value for t (see the table ‘Critical values of the tdistribution’ at the back of the main textbook) at the 0.05 significance level with 18 degrees of freedom, which is 2.10. Our observed value of $ −0.15 $ is smaller than the critical value, indicating that we do not have a significant result. In other words, scores on the maths test were not significantly different in males who used their own name compared to those who used a fake name.
Puzzle 3
Using the analyses in Puzzles 1 and 2, calculate the Cohen’s ds for the effect of using a fake vs. own name for both males and females.
Let’s start with calculating Cohen’s d for the female data. I am going to calculate Cohen’s d using the pooled standard deviation to give you some practice of how to calculate it. The two groups both contained 10 participants so the Ns will both be 10. The standard deviations for the fake name and own name groups are given in Puzzle 1. Using these values we can obtain the pooled standard deviation
$$ \begin{aligned} s_p &= \sqrt{\frac{(N_11)s_1^2 + (N_21)s_2^2}{N_1 + N_2 2}} \\ &= \sqrt{\frac{(101)14.33^2 + (101)16.56^2}{10 + 10 2}} \\ &= \sqrt{\frac{3885.30}{18}} \\ &= \sqrt{215.85} \\ &= 14.69. \end{aligned} $$
We then use this pooled standard deviation to calculate d using the following equation:
$$ \begin{aligned} \hat{d} &= \frac{\bar{X}_F  \bar{X}_O}{s_p} \\ &= \frac{66.7042.90}{14.69} \\ &= 1.62 \end{aligned} $$
So we end up with an effect size of $ = 1.62 $. In other words, in the female data, scores in the fake name condition were 1.62 standard deviations higher than the own name condition, which is a large effect.
We can calculate Cohen’s d for the male data in exactly the same way. I am going to use the pooled standard deviation again and so we need to calculate this first, again we use the means and standard deviations given in Puzzle 1. The pooled standard deviation is
$$ \begin{aligned} s_p &= \sqrt{\frac{(N_11)s_1^2 + (N_21)s_2^2}{N_1 + N_2 2}} \\ &= \sqrt{\frac{(101)24.20^2 + (101)16.51^2}{10 + 10 2}} \\ &= \sqrt{\frac{6953.76}{18}} \\ &= \sqrt{386.32} \\ &= 19.66. \end{aligned} $$
We then use this pooled standard deviation to calculate d using the following equation:
$$ \begin{aligned} \hat{d} &= \frac{\bar{X}_F  \bar{X}_O}{s_p} \\ &= \frac{60.0061.40}{19.66} \\ &= 0.07 \end{aligned} $$
So we end up with an effect size of $ = −0.07 $. In other words, in the male data, scores in the fake name condition were 0.07 of a standard deviation lower than the own name condition, which is a tiny effect.
Puzzle 4
Output 15.7 (in book and reproduced below) and Output 15.8 (in book and reproduced below) show Bayesian analyses of the female and male data from Table 15.5 (in book an reproduced in Puzzle 1) Interpret these outputs.
For the female data the Bayes Factor is 12.92, which means that the data are 12.92 times more likely given the alternative hypothesis than they are given the null hypothesis, which would be regarded by Jefferys as ‘strong evidence’ for the alternative hypothesis, i.e., that women score higher on a maths test when using a fake name compared to their own name. The output also shows that the Bayesian estimate, assuming that the alternative hypothesis is true, of the difference between means (beta) is 19.836 with a standard error of 0.248. You can also use the 2.5% and 97.5% quantiles as the limits of the 95% credible interval for that difference. Again, assuming the alternative hypothesis is true, there is a 95% probability that the difference between means is somewhere between 2.44 and 34.21.
Bayes factor analysis

[1] Alt., r=0.707 : 12.9241 ±0%
Against denominator:
Null, mu1mu2 = 0

Bayes factor type: BFindepSample, JZS
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Timeseries SE
mu 54.838 3.7842 0.11967 0.11967
beta (Fake  Own) 19.836 7.8554 0.24841 0.29969
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 46.7436 52.5273 54.902 57.282 61.96
beta (Fake  Own) 2.4448 15.1047 20.140 24.759 34.21
For the male data the Bayes Factor is 0.40, which is less than 1 and therefore supports the null hypothesis (that there is no difference in test scores in men who used their own name compared to those who used a fake name) by suggesting that the probability of the data given the null is higher than the probability of the data given the alternative hypothesis. The output also shows that the Bayesian estimate, assuming that the alternative hypothesis is true, of the difference between means (beta) is $ −0.561 $ with a standard error of 0.243. You can also use the 2.5% and 97.5% quantiles as the limits of the 95% credible interval for that difference. Again, assuming the alternative hypothesis is true, there is a 95% probability that the difference between means is somewhere between $ −16.79 $ and 15.44.

[1] Alt., r=0.707 : 0.4006056 ±0%
Against denominator:
Null, mu1mu2 = 0

Bayes factor type: BFindepSample, JZS
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Timeseries SE
mu 60.59589 4.7344 0.14971 0.14971
beta (Fake  Own) 0.56118 7.6947 0.24333 0.24333
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 51.20770 57.8040 60.55002 63.5746 70.3617
beta (Fake  Own) 16.78788 5.2981 0.28806 4.0389 15.4411
Output 15.8: Abridged Bayes factor output for the male data in Table 15.5
Puzzle 5
Based on Puzzles 1 to 4 what can you conclude about the difference between males and females in the effect of taking a test using a fake name or your own name?
The analyses conducted in Puzzles 1 to 4 (ttests, Cohen’s ds and Bayes analyses) all indicate that women using a fake name tend to score higher on a maths test than those using their own name but men achieve similar scores whether they use a fake name or their own name.
Puzzle 6
Use the analyses in Puzzles 1 and 2 to write out the separate linear models for males and females that describe how accuracy scores are predicted from the type of name used. In These models, what do the $ b_1 $ and $ b_0 $ represent?
Let’s start with the females. As the question states, we are asking how well we can predict accuracy scores on a maths test based on whether participants used their own name or a fake name. This is a linear model with one dichotomous predictor:
$$ Y_i = \hat{b}_0 + \hat{b}_1X_i + \epsilon_i $$
We can replace the outcome, Y, with what we measured, Accuracy
scores and replace the predictor, X, with the group to which a person belonged (I called this variable Name
). The variable Name
is a dichotomous variable, in other words a nominal variable with two categories. We can’t put words into a statistical model so we must convert these two categories into numbers. In Puzzle 1 I used dummy coding and coded women who used their own name a 0 for the variable Name and women who used a fake name a 1 for the variable Name:
$$ \text{Accuracy}_i = \hat{b}_0 + \hat{b}_1\text{Name}_i + \epsilon_i. $$
In our model, $ b_0 $ tells us the accuracy score when the predictor is zero, i.e., it tells us the accuracy score when a woman used her own name (because I coded ‘own name’ as zero), and $ b_1 $ shows the relationship between the predictor (type of name used) and outcome (accuracy scores). We can replace the values of $ b_0 $ and $ b_1 $ with those that we calculated in Puzzle 1
$$ \text{Accuracy}_i = 42.90 + 23.8\text{Name}_i + \epsilon_i. $$
We can then use the model to predict accuracy scores in women who used their own name by replacing Name
in the equation with 0. The answer is 42.90, and if you look back at Puzzle 1 you will see that this was also the mean score of the ‘own name’ group:
$$ \hat{\text{Accuracy}}_i = 42.90 + 23.8 \times 0 = 42.90 $$
To predict accuracy scores in women who used a fake name we replace Name
in the equation with 1.
$$ \hat{\text{Accuracy}}_i = 42.90 + 23.8 \times 1 = 66.70. $$
The answer is 66.70, and if you look back at Puzzle 1 you will see that this was also the mean score of the ‘fake name’ group.
We can then do the same for the male participant data. The general model would be the same as for females:
$$ \text{Accuracy}_i = \hat{b}_0 + \hat{b}_1\text{Name}_i + \epsilon_i. $$
In the model above, $ b_0 $ tells us the accuracy score when the predictor is zero, i.e., it tells us the accuracy score when a man used his own name (because I coded ‘own name’ as zero), and $ b_1 $ shows the relationship between the predictor (type of name used) and outcome (accuracy scores). We can replace the values of $ b_0 $ and $ b_1 $ with those that we calculated in Puzzle 2
$$ \text{Accuracy}_i = 61.40 + (61.40)\text{Name}_i + \epsilon_i. $$
We can then use the model to predict accuracy scores in men who used their own name by replacing Name
in the equation with 0 (because I coded ‘own name’ as 0).
$$ \hat{\text{Accuracy}}_i = 61.40 1.40 \times 0 = 61.40. $$
The answer is 61.40, and if you look back at Puzzle 2 you will see that this was also the mean score of the ‘own name’ group. To predict accuracy scores in men who used a fake name we can replace Name
in the equation with 1 (because I coded ‘fake name’ as 1)
$$ \hat{\text{Accuracy}}_i = 61.40 1.40 \times 1 = 60.00. $$
The answer is 60.00, and if you look back at Puzzle 2 you will see that this was also the mean score of the ‘fake name’ group.
Puzzle 7
Alice’s research for JIG:SAW (Alice’s Lab Notes 15.1 in the book) built upon a study by Hogarth et al. (2104) that showed that calcite cloak could obscure what was behind it. Table 15.6 (in the book and reproduced below) shows the recognition scores for 10 participants in their study who had to recognize 20 objects hidden behind either a transparent cloak (the control) or a similar transparent cloak containing calcite. Carry out an analysis to see whether recognition scores were lower when objects were concealed by the control cloak or the calcite one.
To answer this Puzzle we need to conduct a pairedsamples ttest because the same participants took part in both conditions of the experiment (control cloak and calcite cloak). First, we need to calculate the difference scores, the mean difference score and the standard deviation, which I have done in the table.
Table 15.6 (reproduced): Data for 10 participants in Hogarth et al.'s (2104) experiment.  

Recognition scores  Difference, $D$ (Calcite  Control) 

Control  Calcite  
1  20  12  8 
2  17  8  9 
3  19  8  11 
4  12  9  3 
5  17  8  9 
6  9  16  7 
7  15  7  8 
8  18  0  18 
9  18  17  1 
10  12  4  8 
Mean  15.70  8.90  −6.80 
Variance  12.90  26.10  43.96 
SD  3.59  5.11  6.63 
From the table we can see that the mean difference is $ {D} = −6.80 $ and the standard deviation is $ s_{{D}} = 6.63 $. This suggests that on average, people recognise fewer objects when they are obscured by a transparent cloak containing calcite than when they are obscured by a transparent cloak that does not contain calcite.
Now we need to calculate the standard error of differences which will tell us how widely we can expect difference scores to be across samples. We can calculate the standard error of differences using the following equation:
$$ SE_{\bar{D}} = \frac{s_{\bar{D}}}{\sqrt{N}} = \frac{6.63}{\sqrt{10}} = 2.10. $$
We can now calculate the tvalue by dividing the mean difference by the standard error of differences as I have done in the equation below:
$$ t = \frac{\bar{D}}{SE_{\bar{D}}} = \frac{6.80}{2.10} = 3.24. $$
If we look up the critical value for t at the 0.05 significance level with 9 ($ N1 $) degrees of freedom, we can see that it is 2.262. Our observed $ t = −3.24 $ is bigger than the critical value (we can ignore the minus sign as that just tells us the direction of the effect) indicating a significant result. In other words, people recognised significantly fewer objects when they were concealed by a transparent cloak containing calcite than when they were concealed by a similar transparent cloak that did not contain calcite.
Puzzle 8
What is the effect size for the effect of calcite on recognition compared to the control?
Remember that we are dealing with difference scores when we compute the test statistic. One very simple way to standardized the difference between means is to use the difference scores instead of the raw score.
$$ \hat{d} = \frac{\bar{D}}{s_{\bar{D}}}\times \sqrt{2} = \frac{6.80\sqrt{2}}{6.63} = 1.45. $$
So we end up with an effect size of $ = −1.45 $. In other words, recognition in the calcite condition was 1.45 standard deviations lower than in the control condition.
Puzzle 9
Output 15.9 (in the book and reproduced below) shows a Bayesian analyses of the recognitions cores from Table 15.6. Interpret the output.
The Bayes factor is 6.15, which means that the data are 6.15 times more likely under the alternative hypothesis compared to under the null, which would be regarded by Jefferys as ‘evidence with substance’ for the alternative hypothesis, i.e., that a cloak containing calcite decreases recognition scores when compared to a control cloak. The difference between means is estimated as 5.87, and there is a 95% probability that the effect lies between 1.41 and 10.63, assuming the alternative hypothesis is true.
Bayes factor analysis

[1] Alt., r=0.707 : 6.148254 ±0%
Against denominator:
Null, mu = 0

Bayes factor type: BFoneSample, JZS
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Timeseries SE
mu 5.8705 2.3628 0.07472 0.08214
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 1.4121 4.3430 5.8204 7.300 10.631
Puzzle 10
What are the 95% credible intervals in Outputs 15.7, 15.8, and 15.9? What is the difference between a confidence interval and a credible interval?
Looking at Outputs 15.7, 15.8 and 15.9, you can use the 2.5% and 97.5% quantiles as the limits of the 95% credible interval for that difference. So assuming the alternative hypothesis is true, there is a 95% probability that the difference between means is somewhere between 2.44 and 34.21 in Output 15.7, a 95% probability that the difference between means is somewhere between −16.79 and 15.44 in Output 15.8, and a 95% probability that the difference between means is somewhere between 1.41 and 10.63 in Output 15.9.
A confidence interval is set so that before the data are collected there is a 95% chance that the interval will contain the true value of the parameter. Once the data are collected, your sample is either one of the 95% that produces an interval containing the true value, or it is one of the 5% that does not. In other words, having collected the data, the probability of the interval containing the true value of the parameter is either 0 (it does not contain it) or 1 (it does contain it) but you do not know which. A credible interval is different in that it reflects the plausible probability that the interval contains the true value. For example, a 95% credible interval has a 0.95 probability of containing the true value; this is not true of a 95% confidence interval (the probability of it containing the true value is either 0 or 1, but you don’t know which).
You cannot use a credible interval to test hypotheses because it is constructed assuming that the alternative hypothesis is true. It tells you the interval within which the effect will fall with a 95% probability, assuming that the effect exists. It tells you nothing about the null hypothesis.