Author: Kayla Keyue Chen
Last updated: 2024.4.28
Sampling Distribution of the Mean
The sampling distribution of a statistic is the distribution of that
statistic when derived from a random sample of size n.
It may be considered as the distribution of the statistic for all
possible samples from the same population of a given sample
size.
In many contexts, only one sample is observed, but the sampling
distribution can be found theoretically.
Let’s see some simulations to understand the concept.
We assume that IQ scores are normally distributed. Suppose it is true
that the population have mean of 100 and SD of 15 (note: we wouldn’t
know this, if it’s real-world research … we just assume this for our
simulation here).
- Now, we recruit 100 participants and assess their IQ.
# assess 100 participants and record their IQ scores
IQ1 <- rnorm(100, mean = 100, sd = 15)
# see the first 10 values
IQ1[1:10]
[1] 119.99916 85.70619 88.90712 59.95192 76.12173 106.15860 104.91585
[8] 82.89396 117.54255 93.64971
# get the mean of the sample
(mean1 <- mean(IQ1))
[1] 98.55371
- This is only one sample. Let’s have another four samples
# assess another 4 groups of 100 participants and record their IQ scores
IQ2 <- rnorm(100, mean = 100, sd = 15)
IQ3 <- rnorm(100, mean = 100, sd = 15)
IQ4 <- rnorm(100, mean = 100, sd = 15)
IQ5 <- rnorm(100, mean = 100, sd = 15)
# get the mean of the sample
mean2 <- mean(IQ2)
mean3 <- mean(IQ3)
mean4 <- mean(IQ4)
mean5 <- mean(IQ5)
# save all means into a vector
means <- c(mean1, mean2, mean3, mean4, mean5)
# calculate the mean of the sample means
mean(means) # which is quite close to 100!
[1] 99.77822
- Let’s create a function so that we don’t need to do it by hand each
time!
- n_exp = number of experiments
- sample_size = the sample size of each experiment
- mean and sd are the mean (100) and sd (15) of the population we
assumed
# define a function called do_experiment
# it's like we are recruiting participants from the population, and for each experiment, we record the mean of IQ of the group of participants.
do_experiment <- function(n_exp, sample_size, mean, sd){
mean_list <- c()
for (n in 1:n_exp){
one_mean <- mean(rnorm(sample_size, mean, sd))
mean_list <- c(mean_list, one_mean)
}
return(mean_list)
}
# Try 10 experiment, with each of sample size = 100.
# Note again the population is assumed to have the mean of 100, and the sd of 15.
do_experiment(n_exp=10, sample_size=100, mean=100, sd=15) # mean IQ of the 10 experiments we did are shown below
[1] 102.64072 100.89270 99.71262 98.86433 101.30048 100.22463 98.00525
[8] 99.35039 101.01348 99.52282
- Do three types of simulations listed below, calculate the mean and
SD of the sample means, and plot the distribution of the means using
histogram.
- Do 1000 experiments. Each time we recruit 5 participants.
- Do 1000 experiments. Each time we recruit 20 participants.
- Do 1000 experiments. Each time we recruit 100 participants.
set.seed(99)
# Do 1000 experiments. Each time we recruit 5, 20, or 100 participants.
results <- data.frame(size5 = do_experiment(1000, 5, 100, 15),
size20 = do_experiment(1000, 20, 100, 15),
size100 = do_experiment(1000, 100, 100, 15))
# show means from the first 6 experiments for each sample size
head(results)
size5 size20 size100
1 102.58741 95.90019 100.4875
2 94.27028 100.34994 101.9814
3 86.12904 97.99459 99.9001
4 95.89241 104.24758 102.4212
5 105.01035 93.62778 102.4371
6 102.16650 100.83881 100.3689
# draw a histogram of means for each sample size
results_long <- results %>% gather(size, means, 1:3) %>%
mutate(size = factor(size, levels = c("size5","size20","size100")))
ggplot(results_long, aes(x = means)) +
geom_histogram(binwidth = 2, fill = "white", color = "black") +
xlim(70, 130) +
facet_wrap(~size)

What do you find about the spreading of the distributions? - The
larger the sample size, the more spreading the distribution of sample
means is.
Standard Error of the Mean (SEM)
Recall that \(Standard\ Error\ of\ the\
Mean\ (SEM) = \frac{SD}{\sqrt{N}}\).
Since the denominator is the square root of the sample size (N), the
larger the sample size, the smaller the SEM. This is consistent with our
observation above. The larger the sample size, the more spreading the
distribution of sample means is!
- Let’s calculate the SEM using the equation above, and compare to the
SD of the distribution of sample means (we’ve got them from part
4).
compare <- data.frame(size5 = c(sd(rnorm(5, 100, 15))/sqrt(5), sd(results$size5)),
size20 = c(sd(rnorm(20, 100, 15))/sqrt(20), sd(results$size20)),
size100 = c(sd(rnorm(100, 100, 15))/sqrt(100), sd(results$size100)))
row.names(compare) <- c("Calculated", "Simulated")
compare
size5 size20 size100
Calculated 5.341117 2.86946 1.462043
Simulated 7.004362 3.42136 1.502792
Every time you run this chunk of code, the results would be slightly
different for the “Calculated” SEM. This is because every time you run
the code, you are recruiting new groups of participants. Nevertheless,
you can see that the results are close to those in our previous
simulations (in part 4) at least most of the time. They are closer for
larger sample size, 20 and 100.
NB: For the purpose of replication, you can set seed by
set.seed()
to get the same results every time (as long as
the seed is the same). This is what we did in some other code
chunks.
95% Confidence Interval (CI) of the Mean
Confidence interval (CI) is a range of estimates for an unknown
parameter. A confidence interval is computed at a designated confidence
level, most commonly 95%.
The equation to calculate 95% CI is \(CI =
Mean\ \pm\ 1.96 \times SEM\)
- Let’s define a function to do numerous experiments for us.
- n_exp = number of experiments
- sample_size = the sample size of each experiment
- p_mean and p_sd are the mean and sd of the population we
assumed
The function will record the following information:
- Whether CI includes the population mean (i.e., 100) in each
experiment
- The proportion of experiments in which CIs include the population
mean
- All the CIs (all pairs of upper and lower bounds from all
experiments)
# define a function called CI_include
# it's like we are recruiting participants from the population, and for each experiment, we calculate CI of the mean of IQ.
CI_include <- function(n_exp, sample_size, p_mean, p_sd){
all_results <- c()
include_list <- c()
CI <- c()
for (i in 1:n_exp){
samples <- rnorm(sample_size, p_mean, p_sd)
mean <- mean(samples)
sd <- sd(samples)
se <- sd/sqrt(sample_size)
CI[[i]] <- c(mean - 1.96 * se, mean + 1.96 * se)
if (CI[[i]][1] <= 100 & CI[[i]][2] >= 100){
include_list <- c(include_list, TRUE)}
else {
include_list <- c(include_list, FALSE)}
}
all_results[[1]] <- include_list # whether CI includes the population mean (i.e., 100) in each experiment
all_results[[2]] <- mean(include_list) # the proportion of experiments in which CIs include the population mean
all_results[[3]] <- CI # all the CIs (all pairs of upper and lower bounds from all experiments)
return(all_results)
}
- Do 1000 experiments, and see in what proportion of the experiment,
the calculated CIs (based on the equation above) contain the true
population mean (which we assume is 100).
set.seed(99)
# use the function CI_include() we defined above
# let's set the sample size to be 30 (ie. recruit 30 participants for each experiment)
CI_30 <- CI_include(1000, 30, 100, 15)
CI_30[[2]]
[1] 0.946
# In 94.6% of the 1000 experiments, we have calculated a CI that contains the true population mean (100).
Therefore, 95% confidence is a confidence that in the long-run 95% of
the CIs will include the population mean. It is a confidence in the
algorithm and not a statement about a single, specific CI.
---
title: "PLINSTAT lecture 2: SEM and CI Explained by Simulation"
output:
  html_document: 
    toc: true
    toc_float: true
    code_download: true
params:
  flex: TRUE
---
  
```{r setup, include=FALSE}
#students: this is the set up chunk, it can be ignored
knitr::opts_chunk$set(warning=FALSE, message=FALSE, comment = "")
library(knitr)
library(tidyverse)
```

Author: Kayla Keyue Chen

Last updated: 2024.4.28

## Sampling Distribution of the Mean

The sampling distribution of a statistic is the distribution of that statistic when derived from a random sample of size n. 

* It may be considered as the distribution of the statistic for all possible samples from the same population of a given sample size.

* In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.

Let's see some simulations to understand the concept. 

We assume that IQ scores are normally distributed. Suppose it is true that the population have mean of 100 and SD of 15 (note: we wouldn't know this, if it's real-world research ... we just assume this for our simulation here). 

1. Now, we recruit 100 participants and assess their IQ. 

```{r}
# assess 100 participants and record their IQ scores
IQ1 <- rnorm(100, mean = 100, sd = 15)
# see the first 10 values 
IQ1[1:10]
# get the mean of the sample
(mean1 <- mean(IQ1))
```
2. This is only one sample. Let's have another four samples 
```{r}
# assess another 4 groups of 100 participants and record their IQ scores
IQ2 <- rnorm(100, mean = 100, sd = 15)
IQ3 <- rnorm(100, mean = 100, sd = 15)
IQ4 <- rnorm(100, mean = 100, sd = 15)
IQ5 <- rnorm(100, mean = 100, sd = 15)
# get the mean of the sample
mean2 <- mean(IQ2)
mean3 <- mean(IQ3)
mean4 <- mean(IQ4)
mean5 <- mean(IQ5)

# save all means into a vector
means <- c(mean1, mean2, mean3, mean4, mean5)

# calculate the mean of the sample means 
mean(means) # which is quite close to 100! 
```

3. Let's create a function so that we don't need to do it by hand each time! 

* n_exp = number of experiments
* sample_size = the sample size of each experiment
* mean and sd are the mean (100) and sd (15) of the population we assumed 

```{r}
# define a function called do_experiment
# it's like we are recruiting participants from the population, and for each experiment, we record the mean of IQ of the group of participants. 
do_experiment <- function(n_exp, sample_size, mean, sd){
  mean_list <- c()
    for (n in 1:n_exp){
      one_mean <- mean(rnorm(sample_size, mean, sd))
      mean_list <- c(mean_list, one_mean)
    }
  return(mean_list)
}

# Try 10 experiment, with each of sample size = 100. 
# Note again the population is assumed to have the mean of 100, and the sd of 15.
do_experiment(n_exp=10, sample_size=100, mean=100, sd=15) # mean IQ of the 10 experiments we did are shown below
```

4. Do three types of simulations listed below, calculate the mean and SD of the sample means, and plot the distribution of the means using histogram. 

* Do 1000 experiments. Each time we recruit 5 participants. 
* Do 1000 experiments. Each time we recruit 20 participants. 
* Do 1000 experiments. Each time we recruit 100 participants. 

```{r, fig.width=8, fig.height=4}
set.seed(99)

# Do 1000 experiments. Each time we recruit 5, 20, or 100 participants. 
results <- data.frame(size5 = do_experiment(1000, 5, 100, 15),
                      size20 = do_experiment(1000, 20, 100, 15),
                      size100 = do_experiment(1000, 100, 100, 15))

# show means from the first 6 experiments for each sample size
head(results)

# draw a histogram of means for each sample size
results_long <- results %>% gather(size, means, 1:3) %>% 
  mutate(size = factor(size, levels = c("size5","size20","size100")))
ggplot(results_long, aes(x = means)) + 
  geom_histogram(binwidth = 2, fill = "white", color = "black") + 
  xlim(70, 130) + 
  facet_wrap(~size)
```

What do you find about the spreading of the distributions? - The larger the sample size, the more spreading the distribution of sample means is. 

## Standard Error of the Mean (SEM)

Recall that $Standard\ Error\ of\ the\ Mean\ (SEM) = \frac{SD}{\sqrt{N}}$. 

Since the denominator is the square root of the sample size (N), the larger the sample size, the smaller the SEM. This is consistent with our observation above. The larger the sample size, the more spreading the distribution of sample means is! 

5. Let's calculate the SEM using the equation above, and compare to the SD of the distribution of sample means (we've got them from part 4).

```{r}
compare <- data.frame(size5 = c(sd(rnorm(5, 100, 15))/sqrt(5), sd(results$size5)),
                      size20 = c(sd(rnorm(20, 100, 15))/sqrt(20), sd(results$size20)),
                      size100 = c(sd(rnorm(100, 100, 15))/sqrt(100), sd(results$size100)))
row.names(compare) <- c("Calculated", "Simulated")
compare
```

Every time you run this chunk of code, the results would be slightly different for the "Calculated" SEM. This is because every time you run the code, you are recruiting new groups of participants. Nevertheless, you can see that the results are close to those in our previous simulations (in part 4) at least most of the time. They are closer for larger sample size, 20 and 100.

NB: For the purpose of replication, you can set seed by `set.seed()` to get the same results every time (as long as the seed is the same). This is what we did in some other code chunks. 

## 95% Confidence Interval (CI) of the Mean 

Confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated confidence level, most commonly 95%.

The equation to calculate 95% CI is $CI = Mean\ \pm\ 1.96 \times SEM$

6. Let's define a function to do numerous experiments for us. 

* n_exp = number of experiments
* sample_size = the sample size of each experiment
* p_mean and p_sd are the mean and sd of the population we assumed 

The function will record the following information: 

* Whether CI includes the population mean (i.e., 100) in each experiment
* The proportion of experiments in which CIs include the population mean 
* All the CIs (all pairs of upper and lower bounds from all experiments)

```{r}
# define a function called CI_include
# it's like we are recruiting participants from the population, and for each experiment, we calculate CI of the mean of IQ. 
CI_include <- function(n_exp, sample_size, p_mean, p_sd){
  all_results <- c()
  include_list <- c()
  CI <- c()
    for (i in 1:n_exp){
      samples <- rnorm(sample_size, p_mean, p_sd)
      mean <- mean(samples)
      sd <- sd(samples)
      se <- sd/sqrt(sample_size)
      CI[[i]] <- c(mean - 1.96 * se, mean + 1.96 * se)
      if (CI[[i]][1] <= 100 & CI[[i]][2] >= 100){
        include_list <- c(include_list, TRUE)}
      else {
        include_list <- c(include_list, FALSE)}
    }
  all_results[[1]] <- include_list # whether CI includes the population mean (i.e., 100) in each experiment
  all_results[[2]] <- mean(include_list) # the proportion of experiments in which CIs include the population mean 
  all_results[[3]] <- CI # all the CIs (all pairs of upper and lower bounds from all experiments)
  return(all_results)
}
```

7. Do 1000 experiments, and see in what proportion of the experiment, the calculated CIs (based on the equation above) contain the true population mean (which we assume is 100). 

```{r}
set.seed(99)

# use the function CI_include() we defined above
# let's set the sample size to be 30 (ie. recruit 30 participants for each experiment)
CI_30 <- CI_include(1000, 30, 100, 15)
CI_30[[2]]
# In 94.6% of the 1000 experiments, we have calculated a CI that contains the true population mean (100). 
```
Therefore, 95% confidence is a confidence that in the long-run 95% of the CIs will include the population mean. It is a confidence in the algorithm and not a statement about a single, specific CI.
