Part 1: Import your dataset
The first thing when writing a script is to set your working
directory to make sure that R knows where to import and export (read and
save) your files.
We can read csv files by using function read.csv()
, the
default parameters of this functions include header = TRUE
(the first row will be set as column names), sep = ","
,
(the separator of different columns is a comma) and
fill = TRUE
(missing values will be set to NA). If your
file is not formatted this way, you should set your parameters
accordingly. For example, if your file doesn’t have a header, you should
set header = "FALSE"
. Our csv files have header and use “,”
as the separator.
If you have an xlsx file, you can use read.xlsx()
function from the openxlsx package. This package have many functions
that enable you to create and edit complex xlsx files (with all the
tables, equations, format and styles) without using Excel! But at least,
this package will be very handy if you want to save multiple dataframes
to different sheets in one xlsx file (We will come back to saving files
later).
If you need help for a function (what does it do, what arguments does
it need, etc.), you can search it in “Help” which should be located at
the 4th tab of the bottom-right panel.
# set working directory
#setwd(......)
# import your dataset from csv file
# "acoustic" contains measures of pitch and dispersion of each voice
acoustic <- read.csv("voice_acoustics.csv")
# "ratings" contains rating of trustworthiness of each voice by 28 participants
ratings <- read.csv("voice_ratings.csv")
# recap: how to install and load a package
# install.packages("openxlsx")
library(openxlsx)
# import your dataset from xlsx file
acoustic <- read.xlsx("voice_acoustics.xlsx")
Part 2: Overview your dataset
Have a look at your data
First, we can have a look at our data. Basically, we want to make
sure that R has successfully imported all the columns and rows and the
numbers and characters look right. See last week’s lab materials if you
don’t remember how to do this.
# show first 6 rows
head(acoustic)
VoiceID sex measures value
1 1 M Pitch 118.6140
2 1 M Dispersion 1061.1148
3 2 M Pitch 215.2936
4 2 M Dispersion 1023.9048
5 3 M Pitch 147.9080
6 3 M Dispersion 1043.0630
# show last 6 rows
tail(acoustic)
VoiceID sex measures value
123 62 F Pitch 151.3625
124 62 F Dispersion 1010.2170
125 63 F Pitch 181.8106
126 63 F Dispersion 1113.5282
127 64 F Pitch 159.6188
128 64 F Dispersion 1044.4419
# View the dataset in a new tab. NB: the 'V' in View is capitalised
#View(acoustic)
# get the number of rows
nrow(acoustic)
[1] 128
# get the number of columns
ncol(acoustic)
[1] 4
# get the number of rows and columns (the first number is row, the second is column)
dim(acoustic)
[1] 128 4
# show a summary of each column
# if it is numeric data, R will return mininum, Q1, median, mean, Q3, and maximum, and the number of NAs (if any)
summary(acoustic)
VoiceID sex measures value
Min. : 1.00 Length:128 Length:128 Min. : 104.4
1st Qu.:16.75 Class :character Class :character 1st Qu.: 140.2
Median :32.50 Mode :character Mode :character Median : 573.3
Mean :32.50 Mean : 604.8
3rd Qu.:48.25 3rd Qu.:1052.0
Max. :64.00 Max. :1213.7
Calculate descriptive statistics
We can calculate descriptive statistics (both central tendencies and
dispersion) by using R functions, but remember that these functions only
accept vectors but not dataframes as their argument so
we’ll first need to extract a column (vector) from the dataframe using
$
. The function names are shown below. They all have very
straightforward function names!
# mean - mean()
mean(acoustic$value)
[1] 604.8287
# median - median()
median(acoustic$value)
[1] 573.3268
# standard deviation - sd()
sd(acoustic$value)
[1] 468.0679
# interquartile range. - IQR(), NB: upper case!
IQR(acoustic$value)
[1] 911.855
# minimum - min()
min(acoustic$value)
[1] 104.4441
# maximum - max()
max(acoustic$value)
[1] 1213.74
We can summarise a character/factor vector (i.e., categorical
variable) with table()
function. It will return the number
of observations in each category. For example, there are two categories
in the sex column, M and F (male and female). table()
will
count how many observations are M and how many observations are F. HINT:
We still need to use $
.
# summarise the sex column
table(acoustic$sex)
F M
64 64
# summarise the measures column
table(acoustic$measures)
Dispersion Pitch
64 64
R does not have a built-in function for the mode so we will need to
create our own function to calculate the mode. Let’s create a “getmode”
function. You can create your own function by using the syntax
function_name <- function(arguments) {...some code here ...}
.
After you run the code, you can use the function just like other
functions we have used, i.e.,
function_name(arguments=...)
- Create a bimodal vector (anything you like). HINT: use the
c()
function
- Create a function called getmode (this is already provided)
- Use the function to get the mode of your vector
# create a bimodal vector
some_numbers <- c(1,1,2,3,4,5,6,7,7,8,8,9)
# create a function called getmode
getmode <- function(x) {
count <- table(x)
count[which(count == max(count))]
}
# use the function
getmode(some_numbers)
x
1 7 8
2 2 2
# The first row is the observations
# The second row is the number of corresponding observations
Part 3: Data manipulation
Wide and long data format
In wide data, each row represents an individual
case, with observations for that case in separate columns. For example,
a row contains all data from one participant. The first column is the
participant’s response to the first trial, the second column is response
to the second trial, so on and so forth.
In long data, each row represents a single
observation, and the observations are grouped together into cases based
on the value of a variable. For example, one participant occupies
multiple rows. The first row is the participant’s response to the first
trial, the second row is response to the second trial, so on and so
forth.
Now have a look at the ratings dataset. Is it in the wide or long
format? HINT: use head()
# Have a look at the ratings dataset
head(ratings)
VoiceID P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17
1 1 7.0 7.5 5.5 4.5 5.0 4.0 5.5 5.0 3.0 4.0 3.5 2.5 3.0 4.0 7.0 6.5 3.5
2 2 8.0 7.5 7.5 6.5 3.5 4.0 5.5 6.0 5.0 6.5 7.5 7.0 5.0 8.0 6.0 7.5 4.0
3 3 7.5 7.0 6.5 8.0 7.0 5.5 7.0 6.0 5.5 5.0 6.0 5.5 6.0 3.0 6.5 7.5 4.5
4 4 6.0 6.5 4.0 5.5 5.0 4.0 3.0 3.5 4.0 3.5 1.5 4.0 4.5 3.5 4.5 5.0 4.5
5 5 4.0 5.0 4.0 3.5 3.0 4.0 4.0 4.5 3.5 4.0 4.0 3.5 2.5 2.5 6.0 5.5 3.5
6 6 5.0 6.0 2.5 4.0 4.5 3.5 3.5 3.5 2.5 2.5 2.5 4.5 4.5 2.5 5.0 6.0 4.5
P18 P19 P20 P21 P22 P23 P24 P25 P26 P27 P28
1 3.5 5.5 6.5 4.5 4.0 4.5 5.0 7.5 6.0 2.5 4.0
2 5.5 7.0 5.5 8.5 7.0 5.5 7.0 7.5 6.5 9.0 8.5
3 6.0 5.5 6.0 6.5 6.0 4.5 4.5 6.0 6.0 6.0 4.5
4 5.0 5.0 4.5 3.5 3.5 5.0 5.0 4.0 4.0 5.0 4.5
5 3.5 5.0 4.0 4.5 3.0 3.5 4.0 5.5 3.0 2.0 3.5
6 4.0 4.5 4.5 4.0 3.5 2.5 4.5 3.5 3.5 2.5 2.5
# It has a wide data format. Each row is a voice (identified by voice ID), and each column is an observation, i.e., the rating of the voice by one participant.
Then have a look at the acoustic dataset. Is it in the wide or long
format?
# Have a look at the acoustic dataset
head(acoustic)
VoiceID sex measures value
1 1 M Pitch 118.6140
2 1 M Dispersion 1061.1148
3 2 M Pitch 215.2936
4 2 M Dispersion 1023.9048
5 3 M Pitch 147.9080
6 3 M Dispersion 1043.0630
# It has a long data format. Each row is an observation. You can see that voice 1 has two rows: the first row is the measure of pitch, the second is dispersion.
We can use spread()
to transform a long format into a
wide format and use gather()
to transform a wide format
into a long format.
Use spread(dataframe, key, value)
to transform a long
format into a wide format. key=the column containing labels; value = the
column containing values
If we transform the acoustic dataset into a wide format, the key is
the column “measures” which contains the name of values (Pitch,
Dispersion). The value is the column “value” which contains the actual
measures in numbers (e.g., 119, 1061).
Use gather(dataframe, key, value, ...)
to transform a
wide format into a long format. key = a new column name for all old
column names, value = a new column name containing values, “…” should
include a vector of all old column names
If we transform the rating dataset into a long format, the key is a
new column name such as “participant” that we want to use to include
participants (P1, P2, etc.). The value is a new column name such as
“rating” that we want to use to include the actual ratings in numbers
(e.g., 7, 7.8, 8). “a vector of old columns to include” need to include
all columns from P1 to P28, you can write P1:P28
(given
that these columns are located consecutively without break in
between).
Now transform the acoustic dataset into a wide format, and then the
ratings dataset into a long format. Check if you have made the correct
transformation. What do the tranformed data look like?
# transform acoustic into a wide format
acoustic_wide <- spread(acoustic, key = measures, value = value)
# Have a look at the transformed dataset
head(acoustic_wide)
VoiceID sex Dispersion Pitch
1 1 M 1061.1148 118.6140
2 2 M 1023.9048 215.2936
3 3 M 1043.0630 147.9080
4 4 M 931.3600 104.4441
5 5 M 942.3216 106.2731
6 6 M 1049.7560 144.5150
# transform ratings into a long format
ratings_long <- gather(ratings, key = participant, value = rating, P1:P28)
# Have a look at the transformed dataset
head(ratings_long)
VoiceID participant rating
1 1 P1 7.0
2 2 P1 8.0
3 3 P1 7.5
4 4 P1 6.0
5 5 P1 4.0
6 6 P1 5.0
Group by and Summarise
Now we have each participants’ rating for each voice. How can we
calculate the average rating of each voice?
Group by: Do your operations within each category
specified in the group_by()
function (NB: this step only
specifies the groups, and nothing has done yet to these groups).
Summarise: Specify the column names and values to
give to the columns. The parameters take the format:
column_name = function(arguments, ...)
.
Operations/calculations will be done within each group specified by the
previous group_by()
function.
Pipe %>%
: A pipe allows you to pass
the result of a previous function as an argument to the next function
without saving it as an object. It’s like a production line!
We need to do this step by step and connect the steps by pipes, like
dataframe name <- dataframe name %>% group_by(some column name) %>% summarise(column name = function(arguments, ...))
.
Be careful! Which rating dataset do we need, wide or long format?
# calculate the average rating of each voice and save it to "rating_ave"
rating_ave <- ratings_long %>%
group_by(VoiceID) %>%
summarize(rating = mean(rating))
head(rating_ave)
# A tibble: 6 × 2
VoiceID rating
<int> <dbl>
1 1 4.80
2 2 6.52
3 3 5.91
4 4 4.34
5 5 3.88
6 6 3.80
Join two dataframe by inner_join
Now we have the average rating for each voice in the “rating_ave”
dataframe, and we have the measure of pitch and dispersion of each voice
in the “acoustic_wide” dataframe. To find out the relationship between
perceived trustworthiness (the rating scores) and acoustic features of
the voice (i.e., pitch and dispersion), we need to join the two
dataframe together.
We can join two dataframes by using the function
inner_join()
. It takes a parameter
by = the common column in the two dataframes
(the common
column can also be detected by R automatically). After this, we can
remove dataframes we don’t need by using rm()
.
HINT: inner_join(dataframe1, dataframe2, by="...")
.
Remember to save the combined dataframe as “data”.
data <- inner_join(acoustic_wide, rating_ave, by = "VoiceID")
head(data)
VoiceID sex Dispersion Pitch rating
1 1 M 1061.1148 118.6140 4.803571
2 2 M 1023.9048 215.2936 6.517857
3 3 M 1043.0630 147.9080 5.910714
4 4 M 931.3600 104.4441 4.339286
5 5 M 942.3216 106.2731 3.875000
6 6 M 1049.7560 144.5150 3.803571
# we can remove dataframes we don't need any more
rm(acoustic_wide, ratings, ratings_long)
Calculate Mean, SD, SEM/SE, and CI.
We can get basic descriptive statics by using summary()
,
but it doesn’t calculate Standard Error of the Mean (SEM) (or simply SE)
or Confidence Interval of the mean (CI). We can calculate SE and CI
ourselves based on the equations (review the lecture slides for the
equations!)
Let’s calculate these statistics for each sec (male and female) and
each measures (pitch and dispersion).
- We have all data combined in one dataframe called “data”, but it is
in a wide format. To make things easier, we first transform the data to
a long format.
- Then we use
group_by()
and summarise()
as
we’ve practiced before. This time, we calculate more statistics,
including
Mean = ..., SD = ..., SE = ..., CI_lower = ..., CI_upper = ...
HINT: n()
can be used to count the number of
observations in a given group.
# to make things easier, we first change the data to a long format
data_long <- data %>%
gather(key = measures, value = value, Dispersion:rating)
# summarise by sex and measures, and save the results to a dataframe called "summary"
summary <- data_long %>%
group_by(sex, measures) %>%
summarise(Mean = mean(value),
SD = sd(value),
SE = SD/sqrt(n()),
CI_upper = Mean + 1.96 * SE,
CI_lower = Mean - 1.96 * SE)
Sometimes we may want to use a subset of the data. For example, if
you’re interested in female voices but not males, you will need to
filter out all the observations of males. We can achieve this by
filter()
or subset()
.
filter(dataframe, logical operations)
subset(dataframe, logical operations)
You may already note that the parameters for the two functions are
the same, and they do the same thing.
Logical operations return TRUE or FALSE. For example, if we write
sex == "M"
(sex equals M), all observations of “M” will
return TRUE, and “F” will return FALSE. Other operations include: not
equal to !=
, more than >
, less than
<
. You can also use &
(and) and
|
(or) to make more complex logical operations.
# using filter to filter out males in the summary dataframe
sum_f <- filter(summary, sex == "F")
sum_f
# A tibble: 3 × 7
# Groups: sex [1]
sex measures Mean SD SE CI_upper CI_lower
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 F Dispersion 1059. 62.7 11.1 1081. 1037.
2 F Pitch 141. 27.8 4.92 151. 132.
3 F rating 5.37 0.685 0.121 5.60 5.13
# using subset to subset female voices in the summary dataframe
sum_f2 <- subset(summary, sex == "F")
sum_f2
# A tibble: 3 × 7
# Groups: sex [1]
sex measures Mean SD SE CI_upper CI_lower
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 F Dispersion 1059. 62.7 11.1 1081. 1037.
2 F Pitch 141. 27.8 4.92 151. 132.
3 F rating 5.37 0.685 0.121 5.60 5.13
Part 4: Data visualisation
We can use ggplot()
function to create all the plots we
need. This function is included in the ggplot package which is
specialized in data visualisation, but because the ggplot package is
included in tidyverse, we don’t need to install and load ggplot once we
have installed and load tidyverse.
The general syntax for ggplot is
ggplot(dataframe_name, aes(x = , y = , ...)) + geom_xxx() + ....
.
Inside the aes()
is the variable on x and y axis and your
grouping variable (if any). The geom_xxx()
specifies which
plot to create, e.g., geom_boxplot()
(boxplot),
geom_bar()
(barplot), etc.
For the following practice, think carefully which dataframe to use,
and what should be put on the x and y axis.
Boxplot
A boxplot is specified by geom_boxplot()
.
# Create a boxplot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) +
geom_boxplot() +
# specify x and y axis title and the title of the plot
labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")
Violin plot
A violin plot is specified by geom_violin()
.
# (with points) Create a violin plot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) +
geom_violin() +
# adding points on a violin, adding random horizontal shift, changing the shape of the points
geom_point(position = position_jitter(width=0.1), shape = 1) +
# specify x and y axis title and the title of the plot
labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")
# (with boxplot) Create a violin plot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) +
geom_violin() +
# adding box and whiskers, changing transparency
geom_boxplot(alpha = 0.5) +
# specify x and y axis title and the title of the plot
labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")
Histogram
A histogram is specified by geom_histogram()
.
# Create a histogram to show the distribution of Dispersion
ggplot(data, aes(x = Dispersion)) +
# changing binwith, changing fill and color
geom_histogram(binwidth = 10, fill = "white", color = "black") +
# specify x and y axis title and the title of the plot
labs(x = "Dispersion", y = "Frequency", title = "Histogram of Dispersion")
Density plot
A density plot is specified by geom_density()
.
# Create a density plot to show the distribution of Dispersion
ggplot(data, aes(x = Dispersion)) +
# changing fill and color
geom_density(fill = "white", color = "black") +
# specify x and y axis title and the title of the plot
labs(x = "Dispersion", y = "Density", title = "Density of Dispersion")
How to plot histogram and density together when their y axis is in
different scale? In order to overlay a kernel density estimate over a
histogram in ggplot2 you will need to pass
aes(y = ..density..)
to geom_histogram
and add
geom_density
.
# Create a histogram to show the distribution of Dispersion, and overlay the density
ggplot(data, aes(x = Dispersion)) +
# adding a histogram(use aes(y = ..density..)!), changing color and fill
geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
# adding density
geom_density() +
# specify x and y axis title and the title of the plot
labs(x = "Dispersion", y = "Frequency", title = "Histogram of Dispersion")
Scatter plot
A scatter plot is specified by geom_point()
.
If we want to show the relationship between dispersion and
trustworthiness for male and female voices separately, we can specify
color = sex
in the aes()
to show male and
female voices in the same plot but with different color. Alternatively,
we can add + facet_wrap(~sex)
to draw two separate plots
for male and female voices.
We can use + geom_smooth(method = "lm")
to add a linear
regression line.
# Create a scatter plot to reflect the relationship between dispersion and trustworthiness
# (separate plot) Group by sex
ggplot(data, aes(x = Dispersion, y = rating)) +
geom_point() +
# adding a regression line
geom_smooth(method = "lm") +
# specify x and y axis title and the title of the plot
labs(x = "Dispersion", y = "Perceived trustworthiness") +
# group by sex (draw separate plot for sex)
facet_wrap(~sex)
# (different color in one plot) Group by sex
ggplot(data, aes(x = Dispersion, y = rating, color = sex)) +
geom_point() +
# adding a regression line
geom_smooth(method = "lm") +
# specify x and y axis title and the title of the plot
labs(x = "Dispersion", y = "Perceived trustworthiness") +
# change the color
scale_color_manual(values = c("black", "blue")) +
# change theme
theme_bw()
Bar plot
A bar plot is specified by geom_bar()
. A bar plot will
count the number of observations in different categories, but people
usually use bar plot to show mean and CI (which is called error bar, we
can add with + geom_errorbar(aes(ymin = ..., ymax = ...))
.
For this purpose, we will need to specify a parameter
stat = "identity"
. Alternatively, we can use
geom_col()
, which is equivalent to
geom_bar(stat = "identity")
.
NB: We have created a summary dataframe with mean and CI of each
measure for male and female voices. We’ll need to filter each measure a
time.
# create a bar plot with mean and CI of Dispersion for male and female
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean)) +
# adding the bar, changing fill and color
geom_bar(stat = "identity", fill = "white", color = "black") +
# adding error bar, changing the width
geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.5) +
# change y axis range
ylim(0, 1300) +
# specify x and y axis title and the title of the plot
labs(x = "Sex", y = "Mean Dispersion", title = "")
# it's the same to use geom_col, no need to specify stat = "identity" then
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean)) +
# adding the bar, changing fill and color
geom_col(fill = "white", color = "black") +
# adding error bar, changing the width
geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.5) +
# change y axis range
ylim(0, 1500) +
# specify x and y axis title and the title of the plot
labs(x = "Sex", y = "Mean Dispersion", title = "")
line plot
A line plot is specified by geom_line()
. It’s also used
to show the mean and CI with error bars. HINT:
+ geom_errorbar(aes(ymin = ..., ymax = ...))
NB: We have created a summary dataframe with mean and CI of each
measure for male and female voices. We’ll need to filter each measure a
time.
# create a line plot with mean and CI of Dispersion and group by sex
# If we don't have a second factor, we should specify group = 1. Otherwise, group = factor name
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean, group = 1)) +
# adding point to show the mean
geom_point() +
# adding a line to connect the mean
geom_line() +
# adding error bars to show CI
geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.4) +
# change the range of y axis
ylim(1000, 1130) +
# specify x and y axis title and the title of the plot
labs(x = "Sex", y = "Mean Dispersion", title = "")
# What if you want to plot Dispersion, Pitch and rating all in one?
# We can specify group = measures (and not filtering measures == "Dispersion")
# If you want to give different colors to different measures, specify color = measures
ggplot(summary, aes(x = sex, y = Mean, group = measures, color = measures)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.4) +
# specify x and y axis title and the title of the plot
labs(x = "Sex", y = "Mean Dispersion", title = "")
Because the scales of the measures are so different (about 1000, 150,
or 5!), this plot is quite confusing, thus not a good choice. This is
just used to illustrate how to use the group=...
argument
in aes()
to plot separate lines for different groups.
Save dataframes and plots to your folder
We can use write.csv(dataframe, filename)
to save a
dataframe to csv file, or write.xlsx(dataframe, filename)
(from the openxlsx library) to save as xlsx file. If output directory is
not specified, It will save the file to the working directory.
NB: By default, in write.csv()
the argument
row.names = TRUE
will assign a number to each row starting
from 1 (excluding the header row). If you don’t want this, remember to
set row.names = FALSE
.
write.csv(summary, "summary_3_measures.csv", row.names = FALSE)
write.xlsx(summary, "summary_3_measures.csv")
You can also use functions in the openxlsx package to save dataframes
to multiple sheets in one xlsx file.
- Create a work book
createWorkbook()
.
- Add a sheet to the work book and name the sheet
addWorksheet()
.
- Add a dataframe to a specified sheet
writeData()
.
- Save the work book to your computer by a given name
saveWorkbook()
.
# first we create a work book
wb <- createWorkbook()
# add a sheet to the work book, and name the sheet
addWorksheet(wb, "data_joined")
addWorksheet(wb, "summary")
# add a dataframe to the sheet. You need to specify the sheet name and the dataframe name
writeData(wb, "data_joined", data)
writeData(wb, "summary", summary)
# save the work book
saveWorkbook(wb, "joined_data_summary.xlsx", overwrite = TRUE)
You can save the most recent plot by ggsave(filename)
.
You can choose from many common file types, such as pdf, png, and
jpg.
# Let's create a plot
ggplot(data, aes(x = sex, y = Dispersion)) +
geom_boxplot() +
labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")
# save it by ggsave (pdf is recommended)
ggsave("dispersion_line.pdf")
ggsave("dispersion_line.png")
ggsave("dispersion_line.jpg")
Part 5: Normality check
Normality check is an important step before the a statistical test
(along with other assumption checking).
We’ve learnt four ways to check the normality of your data, and they
are often used together.
- Check normality visually with histogram (please see above).
ggplot(data, aes(x = Dispersion)) +
geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
geom_density()
- Check normality visually with Normal Q-Q plot. Normal Q-Q Plots
compare your data set’s probability distribution to the normal
distribution. If they are similar enough, you will see the data points
form a nice straight line (or approximate one). We use
qqnorm(dataframe$colum)
to create dots and
qqnorm(dataframe$colum)
to add the line.
# check the normality of Dispersion
qqnorm(data$Dispersion)
qqline(data$Dispersion)
- Check normality numerically (Skewness & Kurtosis values). There
are different functions for this depending on the packages you use.
Using the ‘psych’ package, we can check skewness and kurtosis with
function
describe(dataframe$colum)
and look for the
skew
and kurtosis
columns. This function also
returns many other descriptive statistics.
# install and load the psych package
# install.packages("psych")
library(psych)
# check descriptive statistics using describe()
describe(data$Dispersion)
vars n mean sd median trimmed mad min max range skew
X1 1 64 1068.22 67.96 1054.32 1067.25 68.31 931.36 1213.74 282.38 0.2
kurtosis se
X1 -0.82 8.5
# what's the Skewness and Kurtosis values?
# skew: 0.2; kurtosis -0.82
- Check normality statistically (Shapiro-Wilk Test). Shapiro-Wilk Test
compares your distribution to the normal distribution (like a Q-Q plot,
but with a p-value output). Remember the null hypothesis is there is no
significant difference between the distribution of this data set and the
normal distribution (i.e. this data set is normally distributed). If p
> .05, the data is normally distributed because we can’t reject the
null hypothesis. Use
shapiro.test(dataframe$colum)
to run
the test.
# run the Shapiro-Wilk Test
shapiro.test(data$Dispersion)
Shapiro-Wilk normality test
data: data$Dispersion
W = 0.97459, p-value = 0.2086
# what's the p value?
# p = 0.2086, not significant, so the distribution is normal!
---
title: "PLINSTAT lab 2: Data manipulation and visualisation"
output:
  html_document: 
    toc: true
    toc_float: true
    code_download: true
params:
  flex: TRUE
---

```{r setup, include=FALSE}
#students: this is the set up chunk, it can be ignored
knitr::opts_chunk$set(warning=FALSE, message=FALSE, comment = "")
library(knitr)
library(tidyverse)
library(openxlsx)
library(psych)
```

Author: Kayla Keyue Chen

Last updated: 2024.4.28

## Learning objectives 

* Import and export your dataset: Learn how to import your dataset from csv and xlsx files and export dataframe object to csv and xlsx files. 

* Overview your dataset: A recap of functions learnt last week (in your practice!): `head()`, `tail()`, `summary()`, `View()`, `ncol()`, `nrow()`, `dim()`. Learn some functions to calculate descriptive statistics: `mean()`, `median()`, `sd()`, `IQR()`, `max()`, `min()`, `table()` (and we will need to create our own function to calculate the mode). 

* Data manipulation: Learn how to use `gather()`, `spread()`, `inner_join()`, `filter()` from the tidyverse package and `subset()` which is a base R function to manipulate dataframes, and compute descriptive statistics such as mean, standard deviation, and standard error and confidence interval by ourselves using `group_by()` and `summarise()`. We will learn how to write multiple lines within a pipe using `%>%` in tidyverse. 

* Data visualisation: Learn how to use `ggplot()` in ggplot2 package (included in the tidyverse) to create a variety of plots: boxplot, violin plot, histogram, density plot, scatter plot, bar plot, and line plot.

* Normality check: Get familiar with four ways to check whether your data is normally distributed. (1) Histogram. (2) Normal Q-Q Plot. (3) Skewness and Kurtosis values. (4) Shapiro-Wilk Test.

## About the dataset

Today we will be working with real data to explore the question of whether there is a relationship between voice acoustics and ratings of perceived trustworthiness.

* **Pitch** (or Fundamental Frequency or F0) is a measure of the vibration of the vocal chords in Hertz (Hz); males have on average a lower pitch than females for example. 

* **Dispersion** (measured again in Hz) is effectively a measure of the length of someone’s vocal tract (or neck). Height and neck length are suggested to be negatively correlated with formant dispersion, so tall people tend to have smaller formant dispersion.

More recently, work has focussed on what the sound of your voice suggests about your personality. [McAleer, Todorov and Belin (2014)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090779) suggested that vocal acoustics give a perception of your trustworthiness and dominance to others, regardless of whether or not it is accurate. 

Today we will look at data describing male and female voices, their **pitch** and **dispersion**, and associated measures of **perceived trustworthiness**.

## Part 1: Import your dataset

The first thing when writing a script is to set your working directory to make sure that R knows where to import and export (read and save) your files. 

We can read csv files by using function `read.csv()`, the default parameters of this functions include `header = TRUE` (the first row will be set as column names), `sep = ","`, (the separator of different columns is a comma) and `fill = TRUE` (missing values will be set to NA). If your file is not formatted this way, you should set your parameters accordingly. For example, if your file doesn't have a header, you should set `header = "FALSE"`. Our csv files have header and use "," as the separator.

If you have an xlsx file, you can use `read.xlsx()` function from the openxlsx package. This package have many functions that enable you to create and edit complex xlsx files (with all the tables, equations, format and styles) without using Excel! But at least, this package will be very handy if you want to save multiple dataframes to different sheets in one xlsx file (We will come back to saving files later).

If you need help for a function (what does it do, what arguments does it need, etc.), you can search it in "Help" which should be located at the 4th tab of the bottom-right panel.  

```{r}
# set working directory
#setwd(......)

# import your dataset from csv file
# "acoustic" contains measures of pitch and dispersion of each voice
acoustic <- read.csv("voice_acoustics.csv")
# "ratings" contains rating of trustworthiness of each voice by 28 participants
ratings <- read.csv("voice_ratings.csv")

# recap: how to install and load a package 
# install.packages("openxlsx")
library(openxlsx) 

# import your dataset from xlsx file
acoustic <- read.xlsx("voice_acoustics.xlsx")
```

## Part 2: Overview your dataset

### Have a look at your data

First, we can have a look at our data. Basically, we want to make sure that R has successfully imported all the columns and rows and the numbers and characters look right. See last week's lab materials if you don't remember how to do this. 

```{r}
# show first 6 rows
head(acoustic)
# show last 6 rows
tail(acoustic)
# View the dataset in a new tab. NB: the 'V' in View is capitalised
#View(acoustic) 
# get the number of rows
nrow(acoustic)
# get the number of columns
ncol(acoustic)
# get the number of rows and columns (the first number is row, the second is column)
dim(acoustic)
# show a summary of each column
# if it is numeric data, R will return mininum, Q1, median, mean, Q3, and maximum, and the number of NAs (if any)
summary(acoustic)
```

### Calculate descriptive statistics

We can calculate descriptive statistics (both central tendencies and dispersion) by using R functions, but remember that these functions only accept **vectors** but not dataframes as their argument so we'll first need to extract a column (vector) from the dataframe using `$`. The function names are shown below. They all have very straightforward function names!

```{r}
# mean - mean()
mean(acoustic$value)
# median - median()
median(acoustic$value)
# standard deviation - sd()
sd(acoustic$value)
# interquartile range. - IQR(), NB: upper case!
IQR(acoustic$value)
# minimum - min()
min(acoustic$value)
# maximum - max()
max(acoustic$value)
```

We can summarise a character/factor vector (i.e., categorical variable) with `table()` function. It will return the number of observations in each category. For example, there are two categories in the sex column, M and F (male and female). `table()` will count how many observations are M and how many observations are F. HINT: We still need to use `$`.

```{r}
# summarise the sex column
table(acoustic$sex)

# summarise the measures column
table(acoustic$measures)
```

R does not have a built-in function for the mode so we will need to create our own function to calculate the mode. Let's create a "getmode" function. You can create your own function by using the syntax `function_name <- function(arguments) {...some code here ...}`. After you run the code, you can use the function just like other functions we have used, i.e., `function_name(arguments=...)`

* Create a bimodal vector (anything you like). HINT: use the `c()` function
* Create a function called getmode (this is already provided)
* Use the function to get the mode of your vector

```{r}
# create a bimodal vector
some_numbers <- c(1,1,2,3,4,5,6,7,7,8,8,9)

# create a function called getmode
getmode <- function(x) {
  count <- table(x)
  count[which(count == max(count))]
}

# use the function
getmode(some_numbers)
# The first row is the observations
# The second row is the number of corresponding observations 
```

## Part 3: Data manipulation

### Wide and long data format 

In **wide** data, each row represents an individual case, with observations for that case in separate columns. For example, a row contains all data from one participant. The first column is the participant's response to the first trial, the second column is response to the second trial, so on and so forth. 

In **long** data, each row represents a single observation, and the observations are grouped together into cases based on the value of a variable. For example, one participant occupies multiple rows. The first row is the participant's response to the first trial, the second row is response to the second trial, so on and so forth.  

Now have a look at the ratings dataset. Is it in the wide or long format? HINT: use `head()`

```{r}
# Have a look at the ratings dataset
head(ratings)
# It has a wide data format. Each row is a voice (identified by voice ID), and each column is an observation, i.e., the rating of the voice by one participant. 
```

Then have a look at the acoustic dataset. Is it in the wide or long format? 

```{r}
# Have a look at the acoustic dataset
head(acoustic)
# It has a long data format. Each row is an observation. You can see that voice 1 has two rows: the first row is the measure of pitch, the second is dispersion. 
```

We can use `spread()` to transform a long format into a wide format and use `gather()` to transform a wide format into a long format. 

Use `spread(dataframe, key, value)` to transform a long format into a wide format. key=the column containing labels; value = the column containing values

If we transform the acoustic dataset into a wide format, the key is the column "measures" which contains the name of values (Pitch, Dispersion). The value is the column "value" which contains the actual measures in numbers (e.g., 119, 1061). 

Use `gather(dataframe, key, value, ...)` to transform a wide format into a long format. key = a new column name for all old column names, value = a new column name containing values, "..." should include a vector of all old column names

If we transform the rating dataset into a long format, the key is a new column name such as "participant" that we want to use to include participants (P1, P2, etc.). The value is a new column name such as "rating" that we want to use to include the actual ratings in numbers (e.g., 7, 7.8, 8). "a vector of old columns to include" need to include all columns from P1 to P28, you can write `P1:P28` (given that these columns are located consecutively without break in between). 

Now transform the acoustic dataset into a wide format, and then the ratings dataset into a long format. Check if you have made the correct transformation. What do the tranformed data look like? 

```{r}
# transform acoustic into a wide format
acoustic_wide <- spread(acoustic, key = measures, value = value)
# Have a look at the transformed dataset
head(acoustic_wide)

# transform ratings into a long format
ratings_long <- gather(ratings, key = participant, value = rating, P1:P28)
# Have a look at the transformed dataset
head(ratings_long)
```

### Group by and Summarise

Now we have each participants' rating for each voice. How can we calculate the average rating of each voice? 

**Group by**: Do your operations within each category specified in the `group_by()` function (NB: this step only specifies the groups, and nothing has done yet to these groups).

**Summarise**: Specify the column names and values to give to the columns. The parameters take the format: `column_name = function(arguments, ...)`. Operations/calculations will be done within each group specified by the previous `group_by()` function. 

**Pipe `%>%`**: A pipe allows you to pass the result of a previous function as an argument to the next function without saving it as an object. It's like a production line! 

We need to do this step by step and connect the steps by pipes, like `dataframe name <- dataframe name %>% group_by(some column name) %>% summarise(column name = function(arguments, ...))`. 

Be careful! Which rating dataset do we need, wide or long format? 

```{r}
# calculate the average rating of each voice and save it to "rating_ave"
rating_ave <- ratings_long %>% 
  group_by(VoiceID) %>% 
  summarize(rating = mean(rating))
head(rating_ave)
```

### Join two dataframe by inner_join

Now we have the average rating for each voice in the "rating_ave" dataframe, and we have the measure of pitch and dispersion of each voice in the "acoustic_wide" dataframe. To find out the relationship between perceived trustworthiness (the rating scores) and acoustic features of the voice (i.e., pitch and dispersion), we need to join the two dataframe together. 

We can join two dataframes by using the function `inner_join()`. It takes a parameter `by = the common column in the two dataframes` (the common column can also be detected by R automatically). After this, we can remove dataframes we don't need by using `rm()`. 

HINT: `inner_join(dataframe1, dataframe2, by="...")`. Remember to save the combined dataframe as "data". 

```{r}
data <- inner_join(acoustic_wide, rating_ave, by = "VoiceID")
head(data)

# we can remove dataframes we don't need any more
rm(acoustic_wide, ratings, ratings_long)
```

### Calculate Mean, SD, SEM/SE, and CI. 

We can get basic descriptive statics by using `summary()`, but it doesn't calculate Standard Error of the Mean (SEM) (or simply SE) or Confidence Interval of the mean (CI). We can calculate SE and CI ourselves based on the equations (review the lecture slides for the equations!)

Let's calculate these statistics for each sec (male and female) and each measures (pitch and dispersion).

* We have all data combined in one dataframe called "data", but it is in a wide format. To make things easier, we first transform the data to a long format. 
* Then we use `group_by()` and `summarise()` as we've practiced before. This time, we calculate more statistics, including `Mean = ..., SD = ..., SE = ..., CI_lower = ..., CI_upper = ...`

HINT: `n()` can be used to count the number of observations in a given group. 

```{r}
# to make things easier, we first change the data to a long format
data_long <- data %>%
  gather(key = measures, value = value, Dispersion:rating)

# summarise by sex and measures, and save the results to a dataframe called "summary"
summary <- data_long %>%
  group_by(sex, measures) %>%
  summarise(Mean = mean(value),
            SD = sd(value),
            SE = SD/sqrt(n()),
            CI_upper = Mean + 1.96 * SE,
            CI_lower = Mean - 1.96 * SE)
```

Sometimes we may want to use a subset of the data. For example, if you're interested in female voices but not males, you will need to filter out all the observations of males. We can achieve this by `filter()` or `subset()`. 

``filter(dataframe, logical operations)``

``subset(dataframe, logical operations)``

You may already note that the parameters for the two functions are the same, and they do the same thing. 

Logical operations return TRUE or FALSE. For example, if we write `sex == "M"` (sex equals M), all observations of "M" will return TRUE, and "F" will return FALSE. Other operations include: not equal to `!=`, more than `>`, less than `<`. You can also use `&` (and) and `|` (or) to make more complex logical operations. 

```{r}
# using filter to filter out males in the summary dataframe
sum_f <- filter(summary, sex == "F")
sum_f

# using subset to subset female voices in the summary dataframe
sum_f2 <- subset(summary, sex == "F")
sum_f2
```




## Part 4: Data visualisation

We can use `ggplot()` function to create all the plots we need. This function is included in the ggplot package which is specialized in data visualisation, but because the ggplot package is included in tidyverse, we don't need to install and load ggplot once we have installed and load tidyverse. 

The general syntax for ggplot is `ggplot(dataframe_name, aes(x = , y = , ...)) + geom_xxx() + ....`. Inside the `aes()` is the variable on x and y axis and your grouping variable (if any). The `geom_xxx()` specifies which plot to create, e.g., `geom_boxplot()` (boxplot), `geom_bar()` (barplot), etc. 

For the following practice, think carefully which dataframe to use, and what should be put on the x and y axis. 

### Boxplot 

A boxplot is specified by `geom_boxplot()`. 

```{r}
# Create a boxplot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_boxplot() + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex") 
```

### Violin plot

A violin plot is specified by `geom_violin()`. 

```{r}
# (with points) Create a violin plot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_violin() + 
  # adding points on a violin, adding random horizontal shift, changing the shape of the points
  geom_point(position = position_jitter(width=0.1), shape = 1) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")

# (with boxplot) Create a violin plot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_violin() + 
  # adding box and whiskers, changing transparency 
  geom_boxplot(alpha = 0.5) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")
```


### Histogram

A histogram is specified by `geom_histogram()`. 

```{r}
# Create a histogram to show the distribution of Dispersion
ggplot(data, aes(x = Dispersion)) +
  # changing binwith, changing fill and color 
  geom_histogram(binwidth = 10, fill = "white", color = "black") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Frequency", title = "Histogram of Dispersion")
```

### Density plot

A density plot is specified by `geom_density()`. 

```{r}
# Create a density plot to show the distribution of Dispersion
ggplot(data, aes(x = Dispersion)) +
  # changing fill and color 
  geom_density(fill = "white", color = "black") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Density", title = "Density of Dispersion")
```

How to plot histogram and density together when their y axis is in different scale? In order to overlay a kernel density estimate over a histogram in ggplot2 you will need to pass `aes(y = ..density..)` to `geom_histogram` and add `geom_density`.

```{r}
# Create a histogram to show the distribution of Dispersion, and overlay the density
ggplot(data, aes(x = Dispersion)) + 
  # adding a histogram(use aes(y = ..density..)!), changing color and fill
  geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
  # adding density
  geom_density() + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Frequency", title = "Histogram of Dispersion")
```

### Scatter plot

A scatter plot is specified by `geom_point()`. 

If we want to show the relationship between dispersion and trustworthiness for male and female voices separately, we can specify `color = sex` in the `aes()` to show male and female voices in the same plot but with different color. Alternatively, we can add `+ facet_wrap(~sex)` to draw two separate plots for male and female voices.

We can use `+ geom_smooth(method = "lm")` to add a linear regression line. 

```{r}
# Create a scatter plot to reflect the relationship between dispersion and trustworthiness
# (separate plot) Group by sex
ggplot(data, aes(x = Dispersion, y = rating)) + 
  geom_point() +
  # adding a regression line
  geom_smooth(method = "lm") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Perceived trustworthiness") + 
  # group by sex (draw separate plot for sex)
  facet_wrap(~sex)

# (different color in one plot) Group by sex
ggplot(data, aes(x = Dispersion, y = rating, color = sex)) + 
  geom_point() +
  # adding a regression line
  geom_smooth(method = "lm") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Perceived trustworthiness") + 
  # change the color
  scale_color_manual(values = c("black", "blue")) +
  # change theme
  theme_bw()
```

### Bar plot

A bar plot is specified by `geom_bar()`. A bar plot will count the number of observations in different categories, but people usually use bar plot to show mean and CI (which is called error bar, we can add with `+ geom_errorbar(aes(ymin = ..., ymax = ...))`. For this purpose, we will need to specify a parameter `stat = "identity"`. Alternatively, we can use `geom_col()`, which is equivalent to `geom_bar(stat = "identity")`.

NB: We have created a summary dataframe with mean and CI of each measure for male and female voices. We'll need to filter each measure a time. 

```{r, fig.width=4, fig.height=6}
# create a bar plot with mean and CI of Dispersion for male and female
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean)) + 
  # adding the bar, changing fill and color
  geom_bar(stat = "identity", fill = "white", color = "black") + 
  # adding error bar, changing the width
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.5) + 
  # change y axis range
  ylim(0, 1300) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")
```

```{r, eval=FALSE}
# it's the same to use geom_col, no need to specify stat = "identity" then
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean)) + 
  # adding the bar, changing fill and color
  geom_col(fill = "white", color = "black") + 
  # adding error bar, changing the width
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.5) + 
  # change y axis range
  ylim(0, 1500) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")
```


### line plot

A line plot is specified by `geom_line()`. It's also used to show the mean and CI with error bars. HINT: `+ geom_errorbar(aes(ymin = ..., ymax = ...))`

NB: We have created a summary dataframe with mean and CI of each measure for male and female voices. We'll need to filter each measure a time. 

```{r, fig.width=4, fig.height=6}
# create a line plot with mean and CI of Dispersion and group by sex
# If we don't have a second factor, we should specify group = 1. Otherwise, group = factor name
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean, group = 1)) + 
  # adding point to show the mean
  geom_point() + 
  # adding a line to connect the mean
  geom_line() + 
  # adding error bars to show CI
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.4) + 
  # change the range of y axis
  ylim(1000, 1130) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")
```

```{r, fig.width=4, fig.height=10}
# What if you want to plot Dispersion, Pitch and rating all in one?
# We can specify group = measures (and not filtering measures == "Dispersion")
# If you want to give different colors to different measures, specify color = measures
ggplot(summary, aes(x = sex, y = Mean, group = measures, color = measures)) + 
  geom_point() + 
  geom_line() + 
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.4) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")
```


Because the scales of the measures are so different (about 1000, 150, or 5!), this plot is quite confusing, thus not a good choice. This is just used to illustrate how to use the `group=...` argument in `aes()` to plot separate lines for different groups. 

### Save dataframes and plots to your folder

We can use `write.csv(dataframe, filename)` to save a dataframe to csv file, or `write.xlsx(dataframe, filename)` (from the openxlsx library) to save as xlsx file. If output directory is not specified, It will save the file to the working directory. 

NB: By default, in `write.csv()` the argument `row.names = TRUE` will assign a number to each row starting from 1 (excluding the header row). If you don't want this, remember to set `row.names = FALSE`.

```{r, eval=FALSE}
write.csv(summary, "summary_3_measures.csv", row.names = FALSE)
write.xlsx(summary, "summary_3_measures.csv")
```

You can also use functions in the openxlsx package to save dataframes to multiple sheets in one xlsx file. 

1. Create a work book `createWorkbook()`.
2. Add a sheet to the work book and name the sheet `addWorksheet()`.
3. Add a dataframe to a specified sheet `writeData()`. 
4. Save the work book to your computer by a given name `saveWorkbook()`. 

```{r}
# first we create a work book
wb <- createWorkbook()
# add a sheet to the work book, and name the sheet
addWorksheet(wb, "data_joined")
addWorksheet(wb, "summary")
# add a dataframe to the sheet. You need to specify the sheet name and the dataframe name
writeData(wb, "data_joined", data)
writeData(wb, "summary", summary)
# save the work book
saveWorkbook(wb, "joined_data_summary.xlsx", overwrite = TRUE)
```

You can save the most recent plot by `ggsave(filename)`. You can choose from many common file types, such as pdf, png, and jpg. 

```{r, eval=FALSE}
# Let's create a plot
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_boxplot() + 
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")

# save it by ggsave (pdf is recommended)
ggsave("dispersion_line.pdf")
ggsave("dispersion_line.png")
ggsave("dispersion_line.jpg")
```


## Part 5: Normality check

Normality check is an important step before the a statistical test (along with other assumption checking). 

We've learnt four ways to check the normality of your data, and they are often used together. 

1. Check normality visually with histogram (please see above). 

```{r}
ggplot(data, aes(x = Dispersion)) + 
  geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
  geom_density()
```

2. Check normality visually with Normal Q-Q plot. Normal Q-Q Plots compare your data set’s probability distribution to the normal distribution. If they are similar enough, you will see the data points form a nice straight line (or approximate one). We use `qqnorm(dataframe$colum)` to create dots and `qqnorm(dataframe$colum)` to add the line. 

```{r}
# check the normality of Dispersion
qqnorm(data$Dispersion)
qqline(data$Dispersion)
```

3. Check normality numerically (Skewness & Kurtosis values). There are different functions for this depending on the packages you use. Using the 'psych' package, we can check skewness and kurtosis with function `describe(dataframe$colum)` and look for the `skew` and `kurtosis` columns. This function also returns many other descriptive statistics. 

```{r}
# install and load the psych package
# install.packages("psych")
library(psych)

# check descriptive statistics using describe()
describe(data$Dispersion)
# what's the Skewness and Kurtosis values?
# skew: 0.2; kurtosis -0.82
```

4. Check normality statistically (Shapiro-Wilk Test). Shapiro-Wilk Test compares your distribution to the normal distribution (like a Q-Q plot, but with a p-value output). Remember the null hypothesis is there is no significant difference between the distribution of this data set and the normal distribution (i.e. this data set is normally distributed). If p > .05, the data is normally distributed because we can't reject the null hypothesis. Use `shapiro.test(dataframe$colum)` to run the test. 

```{r}
# run the Shapiro-Wilk Test
shapiro.test(data$Dispersion)
# what's the p value? 
# p = 0.2086, not significant, so the distribution is normal! 
```

