Author: Kayla Keyue Chen

Last updated: 2024.4.28

Sampling Distribution of the Mean

The sampling distribution of a statistic is the distribution of that statistic when derived from a random sample of size n. 

  • It may be considered as the distribution of the statistic for all possible samples from the same population of a given sample size.

  • In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.

Let’s see some simulations to understand the concept.

We assume that IQ scores are normally distributed. Suppose it is true that the population have mean of 100 and SD of 15 (note: we wouldn’t know this, if it’s real-world research … we just assume this for our simulation here).

  1. Now, we recruit 100 participants and assess their IQ.
# assess 100 participants and record their IQ scores
IQ1 <- rnorm(100, mean = 100, sd = 15)
# see the first 10 values 
IQ1[1:10]
 [1] 119.99916  85.70619  88.90712  59.95192  76.12173 106.15860 104.91585
 [8]  82.89396 117.54255  93.64971
# get the mean of the sample
(mean1 <- mean(IQ1))
[1] 98.55371
  1. This is only one sample. Let’s have another four samples
# assess another 4 groups of 100 participants and record their IQ scores
IQ2 <- rnorm(100, mean = 100, sd = 15)
IQ3 <- rnorm(100, mean = 100, sd = 15)
IQ4 <- rnorm(100, mean = 100, sd = 15)
IQ5 <- rnorm(100, mean = 100, sd = 15)
# get the mean of the sample
mean2 <- mean(IQ2)
mean3 <- mean(IQ3)
mean4 <- mean(IQ4)
mean5 <- mean(IQ5)

# save all means into a vector
means <- c(mean1, mean2, mean3, mean4, mean5)

# calculate the mean of the sample means 
mean(means) # which is quite close to 100! 
[1] 99.77822
  1. Let’s create a function so that we don’t need to do it by hand each time!
  • n_exp = number of experiments
  • sample_size = the sample size of each experiment
  • mean and sd are the mean (100) and sd (15) of the population we assumed
# define a function called do_experiment
# it's like we are recruiting participants from the population, and for each experiment, we record the mean of IQ of the group of participants. 
do_experiment <- function(n_exp, sample_size, mean, sd){
  mean_list <- c()
    for (n in 1:n_exp){
      one_mean <- mean(rnorm(sample_size, mean, sd))
      mean_list <- c(mean_list, one_mean)
    }
  return(mean_list)
}

# Try 10 experiment, with each of sample size = 100. 
# Note again the population is assumed to have the mean of 100, and the sd of 15.
do_experiment(n_exp=10, sample_size=100, mean=100, sd=15) # mean IQ of the 10 experiments we did are shown below
 [1] 102.64072 100.89270  99.71262  98.86433 101.30048 100.22463  98.00525
 [8]  99.35039 101.01348  99.52282
  1. Do three types of simulations listed below, calculate the mean and SD of the sample means, and plot the distribution of the means using histogram.
  • Do 1000 experiments. Each time we recruit 5 participants.
  • Do 1000 experiments. Each time we recruit 20 participants.
  • Do 1000 experiments. Each time we recruit 100 participants.
set.seed(99)

# Do 1000 experiments. Each time we recruit 5, 20, or 100 participants. 
results <- data.frame(size5 = do_experiment(1000, 5, 100, 15),
                      size20 = do_experiment(1000, 20, 100, 15),
                      size100 = do_experiment(1000, 100, 100, 15))

# show means from the first 6 experiments for each sample size
head(results)
      size5    size20  size100
1 102.58741  95.90019 100.4875
2  94.27028 100.34994 101.9814
3  86.12904  97.99459  99.9001
4  95.89241 104.24758 102.4212
5 105.01035  93.62778 102.4371
6 102.16650 100.83881 100.3689
# draw a histogram of means for each sample size
results_long <- results %>% gather(size, means, 1:3) %>% 
  mutate(size = factor(size, levels = c("size5","size20","size100")))
ggplot(results_long, aes(x = means)) + 
  geom_histogram(binwidth = 2, fill = "white", color = "black") + 
  xlim(70, 130) + 
  facet_wrap(~size)

What do you find about the spreading of the distributions? - The larger the sample size, the more spreading the distribution of sample means is.

Standard Error of the Mean (SEM)

Recall that \(Standard\ Error\ of\ the\ Mean\ (SEM) = \frac{SD}{\sqrt{N}}\).

Since the denominator is the square root of the sample size (N), the larger the sample size, the smaller the SEM. This is consistent with our observation above. The larger the sample size, the more spreading the distribution of sample means is!

  1. Let’s calculate the SEM using the equation above, and compare to the SD of the distribution of sample means (we’ve got them from part 4).
compare <- data.frame(size5 = c(sd(rnorm(5, 100, 15))/sqrt(5), sd(results$size5)),
                      size20 = c(sd(rnorm(20, 100, 15))/sqrt(20), sd(results$size20)),
                      size100 = c(sd(rnorm(100, 100, 15))/sqrt(100), sd(results$size100)))
row.names(compare) <- c("Calculated", "Simulated")
compare
              size5  size20  size100
Calculated 5.341117 2.86946 1.462043
Simulated  7.004362 3.42136 1.502792

Every time you run this chunk of code, the results would be slightly different for the “Calculated” SEM. This is because every time you run the code, you are recruiting new groups of participants. Nevertheless, you can see that the results are close to those in our previous simulations (in part 4) at least most of the time. They are closer for larger sample size, 20 and 100.

NB: For the purpose of replication, you can set seed by set.seed() to get the same results every time (as long as the seed is the same). This is what we did in some other code chunks.

95% Confidence Interval (CI) of the Mean

Confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated confidence level, most commonly 95%.

The equation to calculate 95% CI is \(CI = Mean\ \pm\ 1.96 \times SEM\)

  1. Let’s define a function to do numerous experiments for us.
  • n_exp = number of experiments
  • sample_size = the sample size of each experiment
  • p_mean and p_sd are the mean and sd of the population we assumed

The function will record the following information:

  • Whether CI includes the population mean (i.e., 100) in each experiment
  • The proportion of experiments in which CIs include the population mean
  • All the CIs (all pairs of upper and lower bounds from all experiments)
# define a function called CI_include
# it's like we are recruiting participants from the population, and for each experiment, we calculate CI of the mean of IQ. 
CI_include <- function(n_exp, sample_size, p_mean, p_sd){
  all_results <- c()
  include_list <- c()
  CI <- c()
    for (i in 1:n_exp){
      samples <- rnorm(sample_size, p_mean, p_sd)
      mean <- mean(samples)
      sd <- sd(samples)
      se <- sd/sqrt(sample_size)
      CI[[i]] <- c(mean - 1.96 * se, mean + 1.96 * se)
      if (CI[[i]][1] <= 100 & CI[[i]][2] >= 100){
        include_list <- c(include_list, TRUE)}
      else {
        include_list <- c(include_list, FALSE)}
    }
  all_results[[1]] <- include_list # whether CI includes the population mean (i.e., 100) in each experiment
  all_results[[2]] <- mean(include_list) # the proportion of experiments in which CIs include the population mean 
  all_results[[3]] <- CI # all the CIs (all pairs of upper and lower bounds from all experiments)
  return(all_results)
}
  1. Do 1000 experiments, and see in what proportion of the experiment, the calculated CIs (based on the equation above) contain the true population mean (which we assume is 100).
set.seed(99)

# use the function CI_include() we defined above
# let's set the sample size to be 30 (ie. recruit 30 participants for each experiment)
CI_30 <- CI_include(1000, 30, 100, 15)
CI_30[[2]]
[1] 0.946
# In 94.6% of the 1000 experiments, we have calculated a CI that contains the true population mean (100). 

Therefore, 95% confidence is a confidence that in the long-run 95% of the CIs will include the population mean. It is a confidence in the algorithm and not a statement about a single, specific CI.

---
title: "PLINSTAT lecture 2: SEM and CI Explained by Simulation"
output:
  html_document: 
    toc: true
    toc_float: true
    code_download: true
params:
  flex: TRUE
---
  
```{r setup, include=FALSE}
#students: this is the set up chunk, it can be ignored
knitr::opts_chunk$set(warning=FALSE, message=FALSE, comment = "")
library(knitr)
library(tidyverse)
```

Author: Kayla Keyue Chen

Last updated: 2024.4.28

## Sampling Distribution of the Mean

The sampling distribution of a statistic is the distribution of that statistic when derived from a random sample of size n. 

* It may be considered as the distribution of the statistic for all possible samples from the same population of a given sample size.

* In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.

Let's see some simulations to understand the concept. 

We assume that IQ scores are normally distributed. Suppose it is true that the population have mean of 100 and SD of 15 (note: we wouldn't know this, if it's real-world research ... we just assume this for our simulation here). 

1. Now, we recruit 100 participants and assess their IQ. 

```{r}
# assess 100 participants and record their IQ scores
IQ1 <- rnorm(100, mean = 100, sd = 15)
# see the first 10 values 
IQ1[1:10]
# get the mean of the sample
(mean1 <- mean(IQ1))
```
2. This is only one sample. Let's have another four samples 
```{r}
# assess another 4 groups of 100 participants and record their IQ scores
IQ2 <- rnorm(100, mean = 100, sd = 15)
IQ3 <- rnorm(100, mean = 100, sd = 15)
IQ4 <- rnorm(100, mean = 100, sd = 15)
IQ5 <- rnorm(100, mean = 100, sd = 15)
# get the mean of the sample
mean2 <- mean(IQ2)
mean3 <- mean(IQ3)
mean4 <- mean(IQ4)
mean5 <- mean(IQ5)

# save all means into a vector
means <- c(mean1, mean2, mean3, mean4, mean5)

# calculate the mean of the sample means 
mean(means) # which is quite close to 100! 
```

3. Let's create a function so that we don't need to do it by hand each time! 

* n_exp = number of experiments
* sample_size = the sample size of each experiment
* mean and sd are the mean (100) and sd (15) of the population we assumed 

```{r}
# define a function called do_experiment
# it's like we are recruiting participants from the population, and for each experiment, we record the mean of IQ of the group of participants. 
do_experiment <- function(n_exp, sample_size, mean, sd){
  mean_list <- c()
    for (n in 1:n_exp){
      one_mean <- mean(rnorm(sample_size, mean, sd))
      mean_list <- c(mean_list, one_mean)
    }
  return(mean_list)
}

# Try 10 experiment, with each of sample size = 100. 
# Note again the population is assumed to have the mean of 100, and the sd of 15.
do_experiment(n_exp=10, sample_size=100, mean=100, sd=15) # mean IQ of the 10 experiments we did are shown below
```

4. Do three types of simulations listed below, calculate the mean and SD of the sample means, and plot the distribution of the means using histogram. 

* Do 1000 experiments. Each time we recruit 5 participants. 
* Do 1000 experiments. Each time we recruit 20 participants. 
* Do 1000 experiments. Each time we recruit 100 participants. 

```{r, fig.width=8, fig.height=4}
set.seed(99)

# Do 1000 experiments. Each time we recruit 5, 20, or 100 participants. 
results <- data.frame(size5 = do_experiment(1000, 5, 100, 15),
                      size20 = do_experiment(1000, 20, 100, 15),
                      size100 = do_experiment(1000, 100, 100, 15))

# show means from the first 6 experiments for each sample size
head(results)

# draw a histogram of means for each sample size
results_long <- results %>% gather(size, means, 1:3) %>% 
  mutate(size = factor(size, levels = c("size5","size20","size100")))
ggplot(results_long, aes(x = means)) + 
  geom_histogram(binwidth = 2, fill = "white", color = "black") + 
  xlim(70, 130) + 
  facet_wrap(~size)
```

What do you find about the spreading of the distributions? - The larger the sample size, the more spreading the distribution of sample means is. 

## Standard Error of the Mean (SEM)

Recall that $Standard\ Error\ of\ the\ Mean\ (SEM) = \frac{SD}{\sqrt{N}}$. 

Since the denominator is the square root of the sample size (N), the larger the sample size, the smaller the SEM. This is consistent with our observation above. The larger the sample size, the more spreading the distribution of sample means is! 

5. Let's calculate the SEM using the equation above, and compare to the SD of the distribution of sample means (we've got them from part 4).

```{r}
compare <- data.frame(size5 = c(sd(rnorm(5, 100, 15))/sqrt(5), sd(results$size5)),
                      size20 = c(sd(rnorm(20, 100, 15))/sqrt(20), sd(results$size20)),
                      size100 = c(sd(rnorm(100, 100, 15))/sqrt(100), sd(results$size100)))
row.names(compare) <- c("Calculated", "Simulated")
compare
```

Every time you run this chunk of code, the results would be slightly different for the "Calculated" SEM. This is because every time you run the code, you are recruiting new groups of participants. Nevertheless, you can see that the results are close to those in our previous simulations (in part 4) at least most of the time. They are closer for larger sample size, 20 and 100.

NB: For the purpose of replication, you can set seed by `set.seed()` to get the same results every time (as long as the seed is the same). This is what we did in some other code chunks. 

## 95% Confidence Interval (CI) of the Mean 

Confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated confidence level, most commonly 95%.

The equation to calculate 95% CI is $CI = Mean\ \pm\ 1.96 \times SEM$

6. Let's define a function to do numerous experiments for us. 

* n_exp = number of experiments
* sample_size = the sample size of each experiment
* p_mean and p_sd are the mean and sd of the population we assumed 

The function will record the following information: 

* Whether CI includes the population mean (i.e., 100) in each experiment
* The proportion of experiments in which CIs include the population mean 
* All the CIs (all pairs of upper and lower bounds from all experiments)

```{r}
# define a function called CI_include
# it's like we are recruiting participants from the population, and for each experiment, we calculate CI of the mean of IQ. 
CI_include <- function(n_exp, sample_size, p_mean, p_sd){
  all_results <- c()
  include_list <- c()
  CI <- c()
    for (i in 1:n_exp){
      samples <- rnorm(sample_size, p_mean, p_sd)
      mean <- mean(samples)
      sd <- sd(samples)
      se <- sd/sqrt(sample_size)
      CI[[i]] <- c(mean - 1.96 * se, mean + 1.96 * se)
      if (CI[[i]][1] <= 100 & CI[[i]][2] >= 100){
        include_list <- c(include_list, TRUE)}
      else {
        include_list <- c(include_list, FALSE)}
    }
  all_results[[1]] <- include_list # whether CI includes the population mean (i.e., 100) in each experiment
  all_results[[2]] <- mean(include_list) # the proportion of experiments in which CIs include the population mean 
  all_results[[3]] <- CI # all the CIs (all pairs of upper and lower bounds from all experiments)
  return(all_results)
}
```

7. Do 1000 experiments, and see in what proportion of the experiment, the calculated CIs (based on the equation above) contain the true population mean (which we assume is 100). 

```{r}
set.seed(99)

# use the function CI_include() we defined above
# let's set the sample size to be 30 (ie. recruit 30 participants for each experiment)
CI_30 <- CI_include(1000, 30, 100, 15)
CI_30[[2]]
# In 94.6% of the 1000 experiments, we have calculated a CI that contains the true population mean (100). 
```
Therefore, 95% confidence is a confidence that in the long-run 95% of the CIs will include the population mean. It is a confidence in the algorithm and not a statement about a single, specific CI.
