Author: Kayla Keyue Chen

Last updated: 2024.4.28

Learning objectives

  • Import and export your dataset: Learn how to import your dataset from csv and xlsx files and export dataframe object to csv and xlsx files.

  • Overview your dataset: A recap of functions learnt last week (in your practice!): head(), tail(), summary(), View(), ncol(), nrow(), dim(). Learn some functions to calculate descriptive statistics: mean(), median(), sd(), IQR(), max(), min(), table() (and we will need to create our own function to calculate the mode).

  • Data manipulation: Learn how to use gather(), spread(), inner_join(), filter() from the tidyverse package and subset() which is a base R function to manipulate dataframes, and compute descriptive statistics such as mean, standard deviation, and standard error and confidence interval by ourselves using group_by() and summarise(). We will learn how to write multiple lines within a pipe using %>% in tidyverse.

  • Data visualisation: Learn how to use ggplot() in ggplot2 package (included in the tidyverse) to create a variety of plots: boxplot, violin plot, histogram, density plot, scatter plot, bar plot, and line plot.

  • Normality check: Get familiar with four ways to check whether your data is normally distributed. (1) Histogram. (2) Normal Q-Q Plot. (3) Skewness and Kurtosis values. (4) Shapiro-Wilk Test.

About the dataset

Today we will be working with real data to explore the question of whether there is a relationship between voice acoustics and ratings of perceived trustworthiness.

  • Pitch (or Fundamental Frequency or F0) is a measure of the vibration of the vocal chords in Hertz (Hz); males have on average a lower pitch than females for example.

  • Dispersion (measured again in Hz) is effectively a measure of the length of someone’s vocal tract (or neck). Height and neck length are suggested to be negatively correlated with formant dispersion, so tall people tend to have smaller formant dispersion.

More recently, work has focussed on what the sound of your voice suggests about your personality. McAleer, Todorov and Belin (2014) suggested that vocal acoustics give a perception of your trustworthiness and dominance to others, regardless of whether or not it is accurate.

Today we will look at data describing male and female voices, their pitch and dispersion, and associated measures of perceived trustworthiness.

Part 1: Import your dataset

The first thing when writing a script is to set your working directory to make sure that R knows where to import and export (read and save) your files.

We can read csv files by using function read.csv(), the default parameters of this functions include header = TRUE (the first row will be set as column names), sep = ",", (the separator of different columns is a comma) and fill = TRUE (missing values will be set to NA). If your file is not formatted this way, you should set your parameters accordingly. For example, if your file doesn’t have a header, you should set header = "FALSE". Our csv files have header and use “,” as the separator.

If you have an xlsx file, you can use read.xlsx() function from the openxlsx package. This package have many functions that enable you to create and edit complex xlsx files (with all the tables, equations, format and styles) without using Excel! But at least, this package will be very handy if you want to save multiple dataframes to different sheets in one xlsx file (We will come back to saving files later).

If you need help for a function (what does it do, what arguments does it need, etc.), you can search it in “Help” which should be located at the 4th tab of the bottom-right panel.

# set working directory
#setwd(......)

# import your dataset from csv file
# "acoustic" contains measures of pitch and dispersion of each voice
acoustic <- read.csv("voice_acoustics.csv")
# "ratings" contains rating of trustworthiness of each voice by 28 participants
ratings <- read.csv("voice_ratings.csv")

# recap: how to install and load a package 
# install.packages("openxlsx")
library(openxlsx) 

# import your dataset from xlsx file
acoustic <- read.xlsx("voice_acoustics.xlsx")

Part 2: Overview your dataset

Have a look at your data

First, we can have a look at our data. Basically, we want to make sure that R has successfully imported all the columns and rows and the numbers and characters look right. See last week’s lab materials if you don’t remember how to do this.

# show first 6 rows
head(acoustic)
  VoiceID sex   measures     value
1       1   M      Pitch  118.6140
2       1   M Dispersion 1061.1148
3       2   M      Pitch  215.2936
4       2   M Dispersion 1023.9048
5       3   M      Pitch  147.9080
6       3   M Dispersion 1043.0630
# show last 6 rows
tail(acoustic)
    VoiceID sex   measures     value
123      62   F      Pitch  151.3625
124      62   F Dispersion 1010.2170
125      63   F      Pitch  181.8106
126      63   F Dispersion 1113.5282
127      64   F      Pitch  159.6188
128      64   F Dispersion 1044.4419
# View the dataset in a new tab. NB: the 'V' in View is capitalised
#View(acoustic) 
# get the number of rows
nrow(acoustic)
[1] 128
# get the number of columns
ncol(acoustic)
[1] 4
# get the number of rows and columns (the first number is row, the second is column)
dim(acoustic)
[1] 128   4
# show a summary of each column
# if it is numeric data, R will return mininum, Q1, median, mean, Q3, and maximum, and the number of NAs (if any)
summary(acoustic)
    VoiceID          sex              measures             value       
 Min.   : 1.00   Length:128         Length:128         Min.   : 104.4  
 1st Qu.:16.75   Class :character   Class :character   1st Qu.: 140.2  
 Median :32.50   Mode  :character   Mode  :character   Median : 573.3  
 Mean   :32.50                                         Mean   : 604.8  
 3rd Qu.:48.25                                         3rd Qu.:1052.0  
 Max.   :64.00                                         Max.   :1213.7  

Calculate descriptive statistics

We can calculate descriptive statistics (both central tendencies and dispersion) by using R functions, but remember that these functions only accept vectors but not dataframes as their argument so we’ll first need to extract a column (vector) from the dataframe using $. The function names are shown below. They all have very straightforward function names!

# mean - mean()
mean(acoustic$value)
[1] 604.8287
# median - median()
median(acoustic$value)
[1] 573.3268
# standard deviation - sd()
sd(acoustic$value)
[1] 468.0679
# interquartile range. - IQR(), NB: upper case!
IQR(acoustic$value)
[1] 911.855
# minimum - min()
min(acoustic$value)
[1] 104.4441
# maximum - max()
max(acoustic$value)
[1] 1213.74

We can summarise a character/factor vector (i.e., categorical variable) with table() function. It will return the number of observations in each category. For example, there are two categories in the sex column, M and F (male and female). table() will count how many observations are M and how many observations are F. HINT: We still need to use $.

# summarise the sex column
table(acoustic$sex)

 F  M 
64 64 
# summarise the measures column
table(acoustic$measures)

Dispersion      Pitch 
        64         64 

R does not have a built-in function for the mode so we will need to create our own function to calculate the mode. Let’s create a “getmode” function. You can create your own function by using the syntax function_name <- function(arguments) {...some code here ...}. After you run the code, you can use the function just like other functions we have used, i.e., function_name(arguments=...)

  • Create a bimodal vector (anything you like). HINT: use the c() function
  • Create a function called getmode (this is already provided)
  • Use the function to get the mode of your vector
# create a bimodal vector
some_numbers <- c(1,1,2,3,4,5,6,7,7,8,8,9)

# create a function called getmode
getmode <- function(x) {
  count <- table(x)
  count[which(count == max(count))]
}

# use the function
getmode(some_numbers)
x
1 7 8 
2 2 2 
# The first row is the observations
# The second row is the number of corresponding observations 

Part 3: Data manipulation

Wide and long data format

In wide data, each row represents an individual case, with observations for that case in separate columns. For example, a row contains all data from one participant. The first column is the participant’s response to the first trial, the second column is response to the second trial, so on and so forth.

In long data, each row represents a single observation, and the observations are grouped together into cases based on the value of a variable. For example, one participant occupies multiple rows. The first row is the participant’s response to the first trial, the second row is response to the second trial, so on and so forth.

Now have a look at the ratings dataset. Is it in the wide or long format? HINT: use head()

# Have a look at the ratings dataset
head(ratings)
  VoiceID  P1  P2  P3  P4  P5  P6  P7  P8  P9 P10 P11 P12 P13 P14 P15 P16 P17
1       1 7.0 7.5 5.5 4.5 5.0 4.0 5.5 5.0 3.0 4.0 3.5 2.5 3.0 4.0 7.0 6.5 3.5
2       2 8.0 7.5 7.5 6.5 3.5 4.0 5.5 6.0 5.0 6.5 7.5 7.0 5.0 8.0 6.0 7.5 4.0
3       3 7.5 7.0 6.5 8.0 7.0 5.5 7.0 6.0 5.5 5.0 6.0 5.5 6.0 3.0 6.5 7.5 4.5
4       4 6.0 6.5 4.0 5.5 5.0 4.0 3.0 3.5 4.0 3.5 1.5 4.0 4.5 3.5 4.5 5.0 4.5
5       5 4.0 5.0 4.0 3.5 3.0 4.0 4.0 4.5 3.5 4.0 4.0 3.5 2.5 2.5 6.0 5.5 3.5
6       6 5.0 6.0 2.5 4.0 4.5 3.5 3.5 3.5 2.5 2.5 2.5 4.5 4.5 2.5 5.0 6.0 4.5
  P18 P19 P20 P21 P22 P23 P24 P25 P26 P27 P28
1 3.5 5.5 6.5 4.5 4.0 4.5 5.0 7.5 6.0 2.5 4.0
2 5.5 7.0 5.5 8.5 7.0 5.5 7.0 7.5 6.5 9.0 8.5
3 6.0 5.5 6.0 6.5 6.0 4.5 4.5 6.0 6.0 6.0 4.5
4 5.0 5.0 4.5 3.5 3.5 5.0 5.0 4.0 4.0 5.0 4.5
5 3.5 5.0 4.0 4.5 3.0 3.5 4.0 5.5 3.0 2.0 3.5
6 4.0 4.5 4.5 4.0 3.5 2.5 4.5 3.5 3.5 2.5 2.5
# It has a wide data format. Each row is a voice (identified by voice ID), and each column is an observation, i.e., the rating of the voice by one participant. 

Then have a look at the acoustic dataset. Is it in the wide or long format?

# Have a look at the acoustic dataset
head(acoustic)
  VoiceID sex   measures     value
1       1   M      Pitch  118.6140
2       1   M Dispersion 1061.1148
3       2   M      Pitch  215.2936
4       2   M Dispersion 1023.9048
5       3   M      Pitch  147.9080
6       3   M Dispersion 1043.0630
# It has a long data format. Each row is an observation. You can see that voice 1 has two rows: the first row is the measure of pitch, the second is dispersion. 

We can use spread() to transform a long format into a wide format and use gather() to transform a wide format into a long format.

Use spread(dataframe, key, value) to transform a long format into a wide format. key=the column containing labels; value = the column containing values

If we transform the acoustic dataset into a wide format, the key is the column “measures” which contains the name of values (Pitch, Dispersion). The value is the column “value” which contains the actual measures in numbers (e.g., 119, 1061).

Use gather(dataframe, key, value, ...) to transform a wide format into a long format. key = a new column name for all old column names, value = a new column name containing values, “…” should include a vector of all old column names

If we transform the rating dataset into a long format, the key is a new column name such as “participant” that we want to use to include participants (P1, P2, etc.). The value is a new column name such as “rating” that we want to use to include the actual ratings in numbers (e.g., 7, 7.8, 8). “a vector of old columns to include” need to include all columns from P1 to P28, you can write P1:P28 (given that these columns are located consecutively without break in between).

Now transform the acoustic dataset into a wide format, and then the ratings dataset into a long format. Check if you have made the correct transformation. What do the tranformed data look like?

# transform acoustic into a wide format
acoustic_wide <- spread(acoustic, key = measures, value = value)
# Have a look at the transformed dataset
head(acoustic_wide)
  VoiceID sex Dispersion    Pitch
1       1   M  1061.1148 118.6140
2       2   M  1023.9048 215.2936
3       3   M  1043.0630 147.9080
4       4   M   931.3600 104.4441
5       5   M   942.3216 106.2731
6       6   M  1049.7560 144.5150
# transform ratings into a long format
ratings_long <- gather(ratings, key = participant, value = rating, P1:P28)
# Have a look at the transformed dataset
head(ratings_long)
  VoiceID participant rating
1       1          P1    7.0
2       2          P1    8.0
3       3          P1    7.5
4       4          P1    6.0
5       5          P1    4.0
6       6          P1    5.0

Group by and Summarise

Now we have each participants’ rating for each voice. How can we calculate the average rating of each voice?

Group by: Do your operations within each category specified in the group_by() function (NB: this step only specifies the groups, and nothing has done yet to these groups).

Summarise: Specify the column names and values to give to the columns. The parameters take the format: column_name = function(arguments, ...). Operations/calculations will be done within each group specified by the previous group_by() function.

Pipe %>%: A pipe allows you to pass the result of a previous function as an argument to the next function without saving it as an object. It’s like a production line!

We need to do this step by step and connect the steps by pipes, like dataframe name <- dataframe name %>% group_by(some column name) %>% summarise(column name = function(arguments, ...)).

Be careful! Which rating dataset do we need, wide or long format?

# calculate the average rating of each voice and save it to "rating_ave"
rating_ave <- ratings_long %>% 
  group_by(VoiceID) %>% 
  summarize(rating = mean(rating))
head(rating_ave)
# A tibble: 6 × 2
  VoiceID rating
    <int>  <dbl>
1       1   4.80
2       2   6.52
3       3   5.91
4       4   4.34
5       5   3.88
6       6   3.80

Join two dataframe by inner_join

Now we have the average rating for each voice in the “rating_ave” dataframe, and we have the measure of pitch and dispersion of each voice in the “acoustic_wide” dataframe. To find out the relationship between perceived trustworthiness (the rating scores) and acoustic features of the voice (i.e., pitch and dispersion), we need to join the two dataframe together.

We can join two dataframes by using the function inner_join(). It takes a parameter by = the common column in the two dataframes (the common column can also be detected by R automatically). After this, we can remove dataframes we don’t need by using rm().

HINT: inner_join(dataframe1, dataframe2, by="..."). Remember to save the combined dataframe as “data”.

data <- inner_join(acoustic_wide, rating_ave, by = "VoiceID")
head(data)
  VoiceID sex Dispersion    Pitch   rating
1       1   M  1061.1148 118.6140 4.803571
2       2   M  1023.9048 215.2936 6.517857
3       3   M  1043.0630 147.9080 5.910714
4       4   M   931.3600 104.4441 4.339286
5       5   M   942.3216 106.2731 3.875000
6       6   M  1049.7560 144.5150 3.803571
# we can remove dataframes we don't need any more
rm(acoustic_wide, ratings, ratings_long)

Calculate Mean, SD, SEM/SE, and CI.

We can get basic descriptive statics by using summary(), but it doesn’t calculate Standard Error of the Mean (SEM) (or simply SE) or Confidence Interval of the mean (CI). We can calculate SE and CI ourselves based on the equations (review the lecture slides for the equations!)

Let’s calculate these statistics for each sec (male and female) and each measures (pitch and dispersion).

  • We have all data combined in one dataframe called “data”, but it is in a wide format. To make things easier, we first transform the data to a long format.
  • Then we use group_by() and summarise() as we’ve practiced before. This time, we calculate more statistics, including Mean = ..., SD = ..., SE = ..., CI_lower = ..., CI_upper = ...

HINT: n() can be used to count the number of observations in a given group.

# to make things easier, we first change the data to a long format
data_long <- data %>%
  gather(key = measures, value = value, Dispersion:rating)

# summarise by sex and measures, and save the results to a dataframe called "summary"
summary <- data_long %>%
  group_by(sex, measures) %>%
  summarise(Mean = mean(value),
            SD = sd(value),
            SE = SD/sqrt(n()),
            CI_upper = Mean + 1.96 * SE,
            CI_lower = Mean - 1.96 * SE)

Sometimes we may want to use a subset of the data. For example, if you’re interested in female voices but not males, you will need to filter out all the observations of males. We can achieve this by filter() or subset().

filter(dataframe, logical operations)

subset(dataframe, logical operations)

You may already note that the parameters for the two functions are the same, and they do the same thing.

Logical operations return TRUE or FALSE. For example, if we write sex == "M" (sex equals M), all observations of “M” will return TRUE, and “F” will return FALSE. Other operations include: not equal to !=, more than >, less than <. You can also use & (and) and | (or) to make more complex logical operations.

# using filter to filter out males in the summary dataframe
sum_f <- filter(summary, sex == "F")
sum_f
# A tibble: 3 × 7
# Groups:   sex [1]
  sex   measures      Mean     SD     SE CI_upper CI_lower
  <chr> <chr>        <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1 F     Dispersion 1059.   62.7   11.1    1081.    1037.  
2 F     Pitch       141.   27.8    4.92    151.     132.  
3 F     rating        5.37  0.685  0.121     5.60     5.13
# using subset to subset female voices in the summary dataframe
sum_f2 <- subset(summary, sex == "F")
sum_f2
# A tibble: 3 × 7
# Groups:   sex [1]
  sex   measures      Mean     SD     SE CI_upper CI_lower
  <chr> <chr>        <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1 F     Dispersion 1059.   62.7   11.1    1081.    1037.  
2 F     Pitch       141.   27.8    4.92    151.     132.  
3 F     rating        5.37  0.685  0.121     5.60     5.13

Part 4: Data visualisation

We can use ggplot() function to create all the plots we need. This function is included in the ggplot package which is specialized in data visualisation, but because the ggplot package is included in tidyverse, we don’t need to install and load ggplot once we have installed and load tidyverse.

The general syntax for ggplot is ggplot(dataframe_name, aes(x = , y = , ...)) + geom_xxx() + ..... Inside the aes() is the variable on x and y axis and your grouping variable (if any). The geom_xxx() specifies which plot to create, e.g., geom_boxplot() (boxplot), geom_bar() (barplot), etc.

For the following practice, think carefully which dataframe to use, and what should be put on the x and y axis.

Boxplot

A boxplot is specified by geom_boxplot().

# Create a boxplot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_boxplot() + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex") 

Violin plot

A violin plot is specified by geom_violin().

# (with points) Create a violin plot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_violin() + 
  # adding points on a violin, adding random horizontal shift, changing the shape of the points
  geom_point(position = position_jitter(width=0.1), shape = 1) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")

# (with boxplot) Create a violin plot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_violin() + 
  # adding box and whiskers, changing transparency 
  geom_boxplot(alpha = 0.5) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")

Histogram

A histogram is specified by geom_histogram().

# Create a histogram to show the distribution of Dispersion
ggplot(data, aes(x = Dispersion)) +
  # changing binwith, changing fill and color 
  geom_histogram(binwidth = 10, fill = "white", color = "black") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Frequency", title = "Histogram of Dispersion")

Density plot

A density plot is specified by geom_density().

# Create a density plot to show the distribution of Dispersion
ggplot(data, aes(x = Dispersion)) +
  # changing fill and color 
  geom_density(fill = "white", color = "black") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Density", title = "Density of Dispersion")

How to plot histogram and density together when their y axis is in different scale? In order to overlay a kernel density estimate over a histogram in ggplot2 you will need to pass aes(y = ..density..) to geom_histogram and add geom_density.

# Create a histogram to show the distribution of Dispersion, and overlay the density
ggplot(data, aes(x = Dispersion)) + 
  # adding a histogram(use aes(y = ..density..)!), changing color and fill
  geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
  # adding density
  geom_density() + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Frequency", title = "Histogram of Dispersion")

Scatter plot

A scatter plot is specified by geom_point().

If we want to show the relationship between dispersion and trustworthiness for male and female voices separately, we can specify color = sex in the aes() to show male and female voices in the same plot but with different color. Alternatively, we can add + facet_wrap(~sex) to draw two separate plots for male and female voices.

We can use + geom_smooth(method = "lm") to add a linear regression line.

# Create a scatter plot to reflect the relationship between dispersion and trustworthiness
# (separate plot) Group by sex
ggplot(data, aes(x = Dispersion, y = rating)) + 
  geom_point() +
  # adding a regression line
  geom_smooth(method = "lm") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Perceived trustworthiness") + 
  # group by sex (draw separate plot for sex)
  facet_wrap(~sex)

# (different color in one plot) Group by sex
ggplot(data, aes(x = Dispersion, y = rating, color = sex)) + 
  geom_point() +
  # adding a regression line
  geom_smooth(method = "lm") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Perceived trustworthiness") + 
  # change the color
  scale_color_manual(values = c("black", "blue")) +
  # change theme
  theme_bw()

Bar plot

A bar plot is specified by geom_bar(). A bar plot will count the number of observations in different categories, but people usually use bar plot to show mean and CI (which is called error bar, we can add with + geom_errorbar(aes(ymin = ..., ymax = ...)). For this purpose, we will need to specify a parameter stat = "identity". Alternatively, we can use geom_col(), which is equivalent to geom_bar(stat = "identity").

NB: We have created a summary dataframe with mean and CI of each measure for male and female voices. We’ll need to filter each measure a time.

# create a bar plot with mean and CI of Dispersion for male and female
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean)) + 
  # adding the bar, changing fill and color
  geom_bar(stat = "identity", fill = "white", color = "black") + 
  # adding error bar, changing the width
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.5) + 
  # change y axis range
  ylim(0, 1300) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")

# it's the same to use geom_col, no need to specify stat = "identity" then
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean)) + 
  # adding the bar, changing fill and color
  geom_col(fill = "white", color = "black") + 
  # adding error bar, changing the width
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.5) + 
  # change y axis range
  ylim(0, 1500) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")

line plot

A line plot is specified by geom_line(). It’s also used to show the mean and CI with error bars. HINT: + geom_errorbar(aes(ymin = ..., ymax = ...))

NB: We have created a summary dataframe with mean and CI of each measure for male and female voices. We’ll need to filter each measure a time.

# create a line plot with mean and CI of Dispersion and group by sex
# If we don't have a second factor, we should specify group = 1. Otherwise, group = factor name
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean, group = 1)) + 
  # adding point to show the mean
  geom_point() + 
  # adding a line to connect the mean
  geom_line() + 
  # adding error bars to show CI
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.4) + 
  # change the range of y axis
  ylim(1000, 1130) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")

# What if you want to plot Dispersion, Pitch and rating all in one?
# We can specify group = measures (and not filtering measures == "Dispersion")
# If you want to give different colors to different measures, specify color = measures
ggplot(summary, aes(x = sex, y = Mean, group = measures, color = measures)) + 
  geom_point() + 
  geom_line() + 
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.4) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")

Because the scales of the measures are so different (about 1000, 150, or 5!), this plot is quite confusing, thus not a good choice. This is just used to illustrate how to use the group=... argument in aes() to plot separate lines for different groups.

Save dataframes and plots to your folder

We can use write.csv(dataframe, filename) to save a dataframe to csv file, or write.xlsx(dataframe, filename) (from the openxlsx library) to save as xlsx file. If output directory is not specified, It will save the file to the working directory.

NB: By default, in write.csv() the argument row.names = TRUE will assign a number to each row starting from 1 (excluding the header row). If you don’t want this, remember to set row.names = FALSE.

write.csv(summary, "summary_3_measures.csv", row.names = FALSE)
write.xlsx(summary, "summary_3_measures.csv")

You can also use functions in the openxlsx package to save dataframes to multiple sheets in one xlsx file.

  1. Create a work book createWorkbook().
  2. Add a sheet to the work book and name the sheet addWorksheet().
  3. Add a dataframe to a specified sheet writeData().
  4. Save the work book to your computer by a given name saveWorkbook().
# first we create a work book
wb <- createWorkbook()
# add a sheet to the work book, and name the sheet
addWorksheet(wb, "data_joined")
addWorksheet(wb, "summary")
# add a dataframe to the sheet. You need to specify the sheet name and the dataframe name
writeData(wb, "data_joined", data)
writeData(wb, "summary", summary)
# save the work book
saveWorkbook(wb, "joined_data_summary.xlsx", overwrite = TRUE)

You can save the most recent plot by ggsave(filename). You can choose from many common file types, such as pdf, png, and jpg.

# Let's create a plot
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_boxplot() + 
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")

# save it by ggsave (pdf is recommended)
ggsave("dispersion_line.pdf")
ggsave("dispersion_line.png")
ggsave("dispersion_line.jpg")

Part 5: Normality check

Normality check is an important step before the a statistical test (along with other assumption checking).

We’ve learnt four ways to check the normality of your data, and they are often used together.

  1. Check normality visually with histogram (please see above).
ggplot(data, aes(x = Dispersion)) + 
  geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
  geom_density()

  1. Check normality visually with Normal Q-Q plot. Normal Q-Q Plots compare your data set’s probability distribution to the normal distribution. If they are similar enough, you will see the data points form a nice straight line (or approximate one). We use qqnorm(dataframe$colum) to create dots and qqnorm(dataframe$colum) to add the line.
# check the normality of Dispersion
qqnorm(data$Dispersion)
qqline(data$Dispersion)

  1. Check normality numerically (Skewness & Kurtosis values). There are different functions for this depending on the packages you use. Using the ‘psych’ package, we can check skewness and kurtosis with function describe(dataframe$colum) and look for the skew and kurtosis columns. This function also returns many other descriptive statistics.
# install and load the psych package
# install.packages("psych")
library(psych)

# check descriptive statistics using describe()
describe(data$Dispersion)
   vars  n    mean    sd  median trimmed   mad    min     max  range skew
X1    1 64 1068.22 67.96 1054.32 1067.25 68.31 931.36 1213.74 282.38  0.2
   kurtosis  se
X1    -0.82 8.5
# what's the Skewness and Kurtosis values?
# skew: 0.2; kurtosis -0.82
  1. Check normality statistically (Shapiro-Wilk Test). Shapiro-Wilk Test compares your distribution to the normal distribution (like a Q-Q plot, but with a p-value output). Remember the null hypothesis is there is no significant difference between the distribution of this data set and the normal distribution (i.e. this data set is normally distributed). If p > .05, the data is normally distributed because we can’t reject the null hypothesis. Use shapiro.test(dataframe$colum) to run the test.
# run the Shapiro-Wilk Test
shapiro.test(data$Dispersion)

    Shapiro-Wilk normality test

data:  data$Dispersion
W = 0.97459, p-value = 0.2086
# what's the p value? 
# p = 0.2086, not significant, so the distribution is normal! 
---
title: "PLINSTAT lab 2: Data manipulation and visualisation"
output:
  html_document: 
    toc: true
    toc_float: true
    code_download: true
params:
  flex: TRUE
---

```{r setup, include=FALSE}
#students: this is the set up chunk, it can be ignored
knitr::opts_chunk$set(warning=FALSE, message=FALSE, comment = "")
library(knitr)
library(tidyverse)
library(openxlsx)
library(psych)
```

Author: Kayla Keyue Chen

Last updated: 2024.4.28

## Learning objectives 

* Import and export your dataset: Learn how to import your dataset from csv and xlsx files and export dataframe object to csv and xlsx files. 

* Overview your dataset: A recap of functions learnt last week (in your practice!): `head()`, `tail()`, `summary()`, `View()`, `ncol()`, `nrow()`, `dim()`. Learn some functions to calculate descriptive statistics: `mean()`, `median()`, `sd()`, `IQR()`, `max()`, `min()`, `table()` (and we will need to create our own function to calculate the mode). 

* Data manipulation: Learn how to use `gather()`, `spread()`, `inner_join()`, `filter()` from the tidyverse package and `subset()` which is a base R function to manipulate dataframes, and compute descriptive statistics such as mean, standard deviation, and standard error and confidence interval by ourselves using `group_by()` and `summarise()`. We will learn how to write multiple lines within a pipe using `%>%` in tidyverse. 

* Data visualisation: Learn how to use `ggplot()` in ggplot2 package (included in the tidyverse) to create a variety of plots: boxplot, violin plot, histogram, density plot, scatter plot, bar plot, and line plot.

* Normality check: Get familiar with four ways to check whether your data is normally distributed. (1) Histogram. (2) Normal Q-Q Plot. (3) Skewness and Kurtosis values. (4) Shapiro-Wilk Test.

## About the dataset

Today we will be working with real data to explore the question of whether there is a relationship between voice acoustics and ratings of perceived trustworthiness.

* **Pitch** (or Fundamental Frequency or F0) is a measure of the vibration of the vocal chords in Hertz (Hz); males have on average a lower pitch than females for example. 

* **Dispersion** (measured again in Hz) is effectively a measure of the length of someone’s vocal tract (or neck). Height and neck length are suggested to be negatively correlated with formant dispersion, so tall people tend to have smaller formant dispersion.

More recently, work has focussed on what the sound of your voice suggests about your personality. [McAleer, Todorov and Belin (2014)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090779) suggested that vocal acoustics give a perception of your trustworthiness and dominance to others, regardless of whether or not it is accurate. 

Today we will look at data describing male and female voices, their **pitch** and **dispersion**, and associated measures of **perceived trustworthiness**.

## Part 1: Import your dataset

The first thing when writing a script is to set your working directory to make sure that R knows where to import and export (read and save) your files. 

We can read csv files by using function `read.csv()`, the default parameters of this functions include `header = TRUE` (the first row will be set as column names), `sep = ","`, (the separator of different columns is a comma) and `fill = TRUE` (missing values will be set to NA). If your file is not formatted this way, you should set your parameters accordingly. For example, if your file doesn't have a header, you should set `header = "FALSE"`. Our csv files have header and use "," as the separator.

If you have an xlsx file, you can use `read.xlsx()` function from the openxlsx package. This package have many functions that enable you to create and edit complex xlsx files (with all the tables, equations, format and styles) without using Excel! But at least, this package will be very handy if you want to save multiple dataframes to different sheets in one xlsx file (We will come back to saving files later).

If you need help for a function (what does it do, what arguments does it need, etc.), you can search it in "Help" which should be located at the 4th tab of the bottom-right panel.  

```{r}
# set working directory
#setwd(......)

# import your dataset from csv file
# "acoustic" contains measures of pitch and dispersion of each voice
acoustic <- read.csv("voice_acoustics.csv")
# "ratings" contains rating of trustworthiness of each voice by 28 participants
ratings <- read.csv("voice_ratings.csv")

# recap: how to install and load a package 
# install.packages("openxlsx")
library(openxlsx) 

# import your dataset from xlsx file
acoustic <- read.xlsx("voice_acoustics.xlsx")
```

## Part 2: Overview your dataset

### Have a look at your data

First, we can have a look at our data. Basically, we want to make sure that R has successfully imported all the columns and rows and the numbers and characters look right. See last week's lab materials if you don't remember how to do this. 

```{r}
# show first 6 rows
head(acoustic)
# show last 6 rows
tail(acoustic)
# View the dataset in a new tab. NB: the 'V' in View is capitalised
#View(acoustic) 
# get the number of rows
nrow(acoustic)
# get the number of columns
ncol(acoustic)
# get the number of rows and columns (the first number is row, the second is column)
dim(acoustic)
# show a summary of each column
# if it is numeric data, R will return mininum, Q1, median, mean, Q3, and maximum, and the number of NAs (if any)
summary(acoustic)
```

### Calculate descriptive statistics

We can calculate descriptive statistics (both central tendencies and dispersion) by using R functions, but remember that these functions only accept **vectors** but not dataframes as their argument so we'll first need to extract a column (vector) from the dataframe using `$`. The function names are shown below. They all have very straightforward function names!

```{r}
# mean - mean()
mean(acoustic$value)
# median - median()
median(acoustic$value)
# standard deviation - sd()
sd(acoustic$value)
# interquartile range. - IQR(), NB: upper case!
IQR(acoustic$value)
# minimum - min()
min(acoustic$value)
# maximum - max()
max(acoustic$value)
```

We can summarise a character/factor vector (i.e., categorical variable) with `table()` function. It will return the number of observations in each category. For example, there are two categories in the sex column, M and F (male and female). `table()` will count how many observations are M and how many observations are F. HINT: We still need to use `$`.

```{r}
# summarise the sex column
table(acoustic$sex)

# summarise the measures column
table(acoustic$measures)
```

R does not have a built-in function for the mode so we will need to create our own function to calculate the mode. Let's create a "getmode" function. You can create your own function by using the syntax `function_name <- function(arguments) {...some code here ...}`. After you run the code, you can use the function just like other functions we have used, i.e., `function_name(arguments=...)`

* Create a bimodal vector (anything you like). HINT: use the `c()` function
* Create a function called getmode (this is already provided)
* Use the function to get the mode of your vector

```{r}
# create a bimodal vector
some_numbers <- c(1,1,2,3,4,5,6,7,7,8,8,9)

# create a function called getmode
getmode <- function(x) {
  count <- table(x)
  count[which(count == max(count))]
}

# use the function
getmode(some_numbers)
# The first row is the observations
# The second row is the number of corresponding observations 
```

## Part 3: Data manipulation

### Wide and long data format 

In **wide** data, each row represents an individual case, with observations for that case in separate columns. For example, a row contains all data from one participant. The first column is the participant's response to the first trial, the second column is response to the second trial, so on and so forth. 

In **long** data, each row represents a single observation, and the observations are grouped together into cases based on the value of a variable. For example, one participant occupies multiple rows. The first row is the participant's response to the first trial, the second row is response to the second trial, so on and so forth.  

Now have a look at the ratings dataset. Is it in the wide or long format? HINT: use `head()`

```{r}
# Have a look at the ratings dataset
head(ratings)
# It has a wide data format. Each row is a voice (identified by voice ID), and each column is an observation, i.e., the rating of the voice by one participant. 
```

Then have a look at the acoustic dataset. Is it in the wide or long format? 

```{r}
# Have a look at the acoustic dataset
head(acoustic)
# It has a long data format. Each row is an observation. You can see that voice 1 has two rows: the first row is the measure of pitch, the second is dispersion. 
```

We can use `spread()` to transform a long format into a wide format and use `gather()` to transform a wide format into a long format. 

Use `spread(dataframe, key, value)` to transform a long format into a wide format. key=the column containing labels; value = the column containing values

If we transform the acoustic dataset into a wide format, the key is the column "measures" which contains the name of values (Pitch, Dispersion). The value is the column "value" which contains the actual measures in numbers (e.g., 119, 1061). 

Use `gather(dataframe, key, value, ...)` to transform a wide format into a long format. key = a new column name for all old column names, value = a new column name containing values, "..." should include a vector of all old column names

If we transform the rating dataset into a long format, the key is a new column name such as "participant" that we want to use to include participants (P1, P2, etc.). The value is a new column name such as "rating" that we want to use to include the actual ratings in numbers (e.g., 7, 7.8, 8). "a vector of old columns to include" need to include all columns from P1 to P28, you can write `P1:P28` (given that these columns are located consecutively without break in between). 

Now transform the acoustic dataset into a wide format, and then the ratings dataset into a long format. Check if you have made the correct transformation. What do the tranformed data look like? 

```{r}
# transform acoustic into a wide format
acoustic_wide <- spread(acoustic, key = measures, value = value)
# Have a look at the transformed dataset
head(acoustic_wide)

# transform ratings into a long format
ratings_long <- gather(ratings, key = participant, value = rating, P1:P28)
# Have a look at the transformed dataset
head(ratings_long)
```

### Group by and Summarise

Now we have each participants' rating for each voice. How can we calculate the average rating of each voice? 

**Group by**: Do your operations within each category specified in the `group_by()` function (NB: this step only specifies the groups, and nothing has done yet to these groups).

**Summarise**: Specify the column names and values to give to the columns. The parameters take the format: `column_name = function(arguments, ...)`. Operations/calculations will be done within each group specified by the previous `group_by()` function. 

**Pipe `%>%`**: A pipe allows you to pass the result of a previous function as an argument to the next function without saving it as an object. It's like a production line! 

We need to do this step by step and connect the steps by pipes, like `dataframe name <- dataframe name %>% group_by(some column name) %>% summarise(column name = function(arguments, ...))`. 

Be careful! Which rating dataset do we need, wide or long format? 

```{r}
# calculate the average rating of each voice and save it to "rating_ave"
rating_ave <- ratings_long %>% 
  group_by(VoiceID) %>% 
  summarize(rating = mean(rating))
head(rating_ave)
```

### Join two dataframe by inner_join

Now we have the average rating for each voice in the "rating_ave" dataframe, and we have the measure of pitch and dispersion of each voice in the "acoustic_wide" dataframe. To find out the relationship between perceived trustworthiness (the rating scores) and acoustic features of the voice (i.e., pitch and dispersion), we need to join the two dataframe together. 

We can join two dataframes by using the function `inner_join()`. It takes a parameter `by = the common column in the two dataframes` (the common column can also be detected by R automatically). After this, we can remove dataframes we don't need by using `rm()`. 

HINT: `inner_join(dataframe1, dataframe2, by="...")`. Remember to save the combined dataframe as "data". 

```{r}
data <- inner_join(acoustic_wide, rating_ave, by = "VoiceID")
head(data)

# we can remove dataframes we don't need any more
rm(acoustic_wide, ratings, ratings_long)
```

### Calculate Mean, SD, SEM/SE, and CI. 

We can get basic descriptive statics by using `summary()`, but it doesn't calculate Standard Error of the Mean (SEM) (or simply SE) or Confidence Interval of the mean (CI). We can calculate SE and CI ourselves based on the equations (review the lecture slides for the equations!)

Let's calculate these statistics for each sec (male and female) and each measures (pitch and dispersion).

* We have all data combined in one dataframe called "data", but it is in a wide format. To make things easier, we first transform the data to a long format. 
* Then we use `group_by()` and `summarise()` as we've practiced before. This time, we calculate more statistics, including `Mean = ..., SD = ..., SE = ..., CI_lower = ..., CI_upper = ...`

HINT: `n()` can be used to count the number of observations in a given group. 

```{r}
# to make things easier, we first change the data to a long format
data_long <- data %>%
  gather(key = measures, value = value, Dispersion:rating)

# summarise by sex and measures, and save the results to a dataframe called "summary"
summary <- data_long %>%
  group_by(sex, measures) %>%
  summarise(Mean = mean(value),
            SD = sd(value),
            SE = SD/sqrt(n()),
            CI_upper = Mean + 1.96 * SE,
            CI_lower = Mean - 1.96 * SE)
```

Sometimes we may want to use a subset of the data. For example, if you're interested in female voices but not males, you will need to filter out all the observations of males. We can achieve this by `filter()` or `subset()`. 

``filter(dataframe, logical operations)``

``subset(dataframe, logical operations)``

You may already note that the parameters for the two functions are the same, and they do the same thing. 

Logical operations return TRUE or FALSE. For example, if we write `sex == "M"` (sex equals M), all observations of "M" will return TRUE, and "F" will return FALSE. Other operations include: not equal to `!=`, more than `>`, less than `<`. You can also use `&` (and) and `|` (or) to make more complex logical operations. 

```{r}
# using filter to filter out males in the summary dataframe
sum_f <- filter(summary, sex == "F")
sum_f

# using subset to subset female voices in the summary dataframe
sum_f2 <- subset(summary, sex == "F")
sum_f2
```




## Part 4: Data visualisation

We can use `ggplot()` function to create all the plots we need. This function is included in the ggplot package which is specialized in data visualisation, but because the ggplot package is included in tidyverse, we don't need to install and load ggplot once we have installed and load tidyverse. 

The general syntax for ggplot is `ggplot(dataframe_name, aes(x = , y = , ...)) + geom_xxx() + ....`. Inside the `aes()` is the variable on x and y axis and your grouping variable (if any). The `geom_xxx()` specifies which plot to create, e.g., `geom_boxplot()` (boxplot), `geom_bar()` (barplot), etc. 

For the following practice, think carefully which dataframe to use, and what should be put on the x and y axis. 

### Boxplot 

A boxplot is specified by `geom_boxplot()`. 

```{r}
# Create a boxplot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_boxplot() + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex") 
```

### Violin plot

A violin plot is specified by `geom_violin()`. 

```{r}
# (with points) Create a violin plot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_violin() + 
  # adding points on a violin, adding random horizontal shift, changing the shape of the points
  geom_point(position = position_jitter(width=0.1), shape = 1) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")

# (with boxplot) Create a violin plot to compare dispersion of voice betweem male and female
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_violin() + 
  # adding box and whiskers, changing transparency 
  geom_boxplot(alpha = 0.5) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")
```


### Histogram

A histogram is specified by `geom_histogram()`. 

```{r}
# Create a histogram to show the distribution of Dispersion
ggplot(data, aes(x = Dispersion)) +
  # changing binwith, changing fill and color 
  geom_histogram(binwidth = 10, fill = "white", color = "black") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Frequency", title = "Histogram of Dispersion")
```

### Density plot

A density plot is specified by `geom_density()`. 

```{r}
# Create a density plot to show the distribution of Dispersion
ggplot(data, aes(x = Dispersion)) +
  # changing fill and color 
  geom_density(fill = "white", color = "black") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Density", title = "Density of Dispersion")
```

How to plot histogram and density together when their y axis is in different scale? In order to overlay a kernel density estimate over a histogram in ggplot2 you will need to pass `aes(y = ..density..)` to `geom_histogram` and add `geom_density`.

```{r}
# Create a histogram to show the distribution of Dispersion, and overlay the density
ggplot(data, aes(x = Dispersion)) + 
  # adding a histogram(use aes(y = ..density..)!), changing color and fill
  geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
  # adding density
  geom_density() + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Frequency", title = "Histogram of Dispersion")
```

### Scatter plot

A scatter plot is specified by `geom_point()`. 

If we want to show the relationship between dispersion and trustworthiness for male and female voices separately, we can specify `color = sex` in the `aes()` to show male and female voices in the same plot but with different color. Alternatively, we can add `+ facet_wrap(~sex)` to draw two separate plots for male and female voices.

We can use `+ geom_smooth(method = "lm")` to add a linear regression line. 

```{r}
# Create a scatter plot to reflect the relationship between dispersion and trustworthiness
# (separate plot) Group by sex
ggplot(data, aes(x = Dispersion, y = rating)) + 
  geom_point() +
  # adding a regression line
  geom_smooth(method = "lm") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Perceived trustworthiness") + 
  # group by sex (draw separate plot for sex)
  facet_wrap(~sex)

# (different color in one plot) Group by sex
ggplot(data, aes(x = Dispersion, y = rating, color = sex)) + 
  geom_point() +
  # adding a regression line
  geom_smooth(method = "lm") + 
  # specify x and y axis title and the title of the plot
  labs(x = "Dispersion", y = "Perceived trustworthiness") + 
  # change the color
  scale_color_manual(values = c("black", "blue")) +
  # change theme
  theme_bw()
```

### Bar plot

A bar plot is specified by `geom_bar()`. A bar plot will count the number of observations in different categories, but people usually use bar plot to show mean and CI (which is called error bar, we can add with `+ geom_errorbar(aes(ymin = ..., ymax = ...))`. For this purpose, we will need to specify a parameter `stat = "identity"`. Alternatively, we can use `geom_col()`, which is equivalent to `geom_bar(stat = "identity")`.

NB: We have created a summary dataframe with mean and CI of each measure for male and female voices. We'll need to filter each measure a time. 

```{r, fig.width=4, fig.height=6}
# create a bar plot with mean and CI of Dispersion for male and female
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean)) + 
  # adding the bar, changing fill and color
  geom_bar(stat = "identity", fill = "white", color = "black") + 
  # adding error bar, changing the width
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.5) + 
  # change y axis range
  ylim(0, 1300) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")
```

```{r, eval=FALSE}
# it's the same to use geom_col, no need to specify stat = "identity" then
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean)) + 
  # adding the bar, changing fill and color
  geom_col(fill = "white", color = "black") + 
  # adding error bar, changing the width
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.5) + 
  # change y axis range
  ylim(0, 1500) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")
```


### line plot

A line plot is specified by `geom_line()`. It's also used to show the mean and CI with error bars. HINT: `+ geom_errorbar(aes(ymin = ..., ymax = ...))`

NB: We have created a summary dataframe with mean and CI of each measure for male and female voices. We'll need to filter each measure a time. 

```{r, fig.width=4, fig.height=6}
# create a line plot with mean and CI of Dispersion and group by sex
# If we don't have a second factor, we should specify group = 1. Otherwise, group = factor name
ggplot(filter(summary, measures == "Dispersion"), aes(x = sex, y = Mean, group = 1)) + 
  # adding point to show the mean
  geom_point() + 
  # adding a line to connect the mean
  geom_line() + 
  # adding error bars to show CI
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.4) + 
  # change the range of y axis
  ylim(1000, 1130) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")
```

```{r, fig.width=4, fig.height=10}
# What if you want to plot Dispersion, Pitch and rating all in one?
# We can specify group = measures (and not filtering measures == "Dispersion")
# If you want to give different colors to different measures, specify color = measures
ggplot(summary, aes(x = sex, y = Mean, group = measures, color = measures)) + 
  geom_point() + 
  geom_line() + 
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.4) + 
  # specify x and y axis title and the title of the plot
  labs(x = "Sex", y = "Mean Dispersion", title = "")
```


Because the scales of the measures are so different (about 1000, 150, or 5!), this plot is quite confusing, thus not a good choice. This is just used to illustrate how to use the `group=...` argument in `aes()` to plot separate lines for different groups. 

### Save dataframes and plots to your folder

We can use `write.csv(dataframe, filename)` to save a dataframe to csv file, or `write.xlsx(dataframe, filename)` (from the openxlsx library) to save as xlsx file. If output directory is not specified, It will save the file to the working directory. 

NB: By default, in `write.csv()` the argument `row.names = TRUE` will assign a number to each row starting from 1 (excluding the header row). If you don't want this, remember to set `row.names = FALSE`.

```{r, eval=FALSE}
write.csv(summary, "summary_3_measures.csv", row.names = FALSE)
write.xlsx(summary, "summary_3_measures.csv")
```

You can also use functions in the openxlsx package to save dataframes to multiple sheets in one xlsx file. 

1. Create a work book `createWorkbook()`.
2. Add a sheet to the work book and name the sheet `addWorksheet()`.
3. Add a dataframe to a specified sheet `writeData()`. 
4. Save the work book to your computer by a given name `saveWorkbook()`. 

```{r}
# first we create a work book
wb <- createWorkbook()
# add a sheet to the work book, and name the sheet
addWorksheet(wb, "data_joined")
addWorksheet(wb, "summary")
# add a dataframe to the sheet. You need to specify the sheet name and the dataframe name
writeData(wb, "data_joined", data)
writeData(wb, "summary", summary)
# save the work book
saveWorkbook(wb, "joined_data_summary.xlsx", overwrite = TRUE)
```

You can save the most recent plot by `ggsave(filename)`. You can choose from many common file types, such as pdf, png, and jpg. 

```{r, eval=FALSE}
# Let's create a plot
ggplot(data, aes(x = sex, y = Dispersion)) + 
  geom_boxplot() + 
  labs(x = "Sex", y = "Dispersion", title = "Dispersion by sex")

# save it by ggsave (pdf is recommended)
ggsave("dispersion_line.pdf")
ggsave("dispersion_line.png")
ggsave("dispersion_line.jpg")
```


## Part 5: Normality check

Normality check is an important step before the a statistical test (along with other assumption checking). 

We've learnt four ways to check the normality of your data, and they are often used together. 

1. Check normality visually with histogram (please see above). 

```{r}
ggplot(data, aes(x = Dispersion)) + 
  geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
  geom_density()
```

2. Check normality visually with Normal Q-Q plot. Normal Q-Q Plots compare your data set’s probability distribution to the normal distribution. If they are similar enough, you will see the data points form a nice straight line (or approximate one). We use `qqnorm(dataframe$colum)` to create dots and `qqnorm(dataframe$colum)` to add the line. 

```{r}
# check the normality of Dispersion
qqnorm(data$Dispersion)
qqline(data$Dispersion)
```

3. Check normality numerically (Skewness & Kurtosis values). There are different functions for this depending on the packages you use. Using the 'psych' package, we can check skewness and kurtosis with function `describe(dataframe$colum)` and look for the `skew` and `kurtosis` columns. This function also returns many other descriptive statistics. 

```{r}
# install and load the psych package
# install.packages("psych")
library(psych)

# check descriptive statistics using describe()
describe(data$Dispersion)
# what's the Skewness and Kurtosis values?
# skew: 0.2; kurtosis -0.82
```

4. Check normality statistically (Shapiro-Wilk Test). Shapiro-Wilk Test compares your distribution to the normal distribution (like a Q-Q plot, but with a p-value output). Remember the null hypothesis is there is no significant difference between the distribution of this data set and the normal distribution (i.e. this data set is normally distributed). If p > .05, the data is normally distributed because we can't reject the null hypothesis. Use `shapiro.test(dataframe$colum)` to run the test. 

```{r}
# run the Shapiro-Wilk Test
shapiro.test(data$Dispersion)
# what's the p value? 
# p = 0.2086, not significant, so the distribution is normal! 
```

