Author: Kayla Keyue Chen
Last updated: 2024.4.28
Objectives
In this exercise, we will be recapping data manipulation and
visualisation skills and check normality to prepare for inferential
tests. We’ll also learn more functions to manipulate your
dataset, which we don’t have time to cover in the lab:
select()
and mutate()
from the tidyverse
package.
Dataset
This exercise will use a R built in dataset called
USJudgeRatings.
The data contains average ratings of 43 different judges’ abilities
by lawyers in the US. Each row is a lawyer (i.e., the rater), and each
column is a measure. You can type “USJudgeRatings” into the Help pane
and read more about it. For example,
- The first column CONT gives the mean number of
contacts each judge has had with the lawyers doing the ratings.
- The following columns give mean ratings out of 10 by the lawyers of
the judge’s abilities in different areas of their job.
- The final column RTEN gives the lawyers’ judgement
about whether the judge is worthy of retention.
We will be recapping data manipulation and visualisation skills and
check normality to prepare for inferential tests.
1. Set-up
Q1. Set working directory to your folder containing the files. Load
in the tidyverse and USJudgeRatings
dataset, saving it as
data
HINT: setwd()
, library()
,
read.csv()
#setwd("...")
library(tidyverse)
data <- read.csv("USJudgeRatings.csv")
Q2. Is the dataframe in WIDE or LONG format?
# type your answer here: wide
2. Data manipulation
Q3. First remove the column reflecting ‘Worthy of retention’ (i.e.,
RTEN) as we are not interested in that variable (save the new object as
data
)
HINT: Use select(dataframe_name, <column names>)
to select the columns to keep. For exmaple,
select(data, c("CONT", "INTG", "DMNR"))
. Alternatively, you
can drop columns that you don’t need any more by using a minus symbol
-
. For example, to drop the INTG column, you can use
select(data, -INTG)
.
data <- select(data, -RTEN)
Q4. Next create a new column ID
so that we can identify
the judges anonymously. The first row has ID of 1, the second has ID of
2, etc (recall that a row is a lawyer who rated the judge so ID is the
participant ID).
HINT: Use mutate(column_name = ...)
to create a new
column or edit an existing column.
data <- data %>% mutate(ID=1:43)
Q5. Next we want to convert the table to a long format with three
columns: ID
, Variable
and Score
:
save this as object data2
HINT: Use gather()
data2 <- data %>% gather(key=Variable, value=Score, CONT:PHYS)
2. Descriptive Statistics
Q6. Now we can summarise the data. Make a table of mean, SD, SE and
95% confidence intervals for each variable and save it as
summary
.
HINT: Use group_by()
, summarise()
summary <- data2 %>% group_by(Variable) %>%
summarise(mean = mean(Score),
sd = sd(Score),
se = sd/sqrt(n()),
ci_lower = mean - 1.96*se,
ci_upper = mean + 1.96*se)
3. Further manipulation
Q7. Create a new dataframe (data3
) that comprises
ID
and Total
. Total is the sum
score of all variables except number of contacts with judge
(i.e., CONT).
HINT: Use filter()
, group_by()
and
summarise()
data3 <- data2 %>%
filter(Variable != 'CONT') %>%
group_by(ID) %>%
summarise(Total = sum(Score))
Q8. Now we will integrate the total (in data3
) with all
other measures (in data
) and save it as
data4
.
HINT: Use inner_join()
data4 <- data3 %>% inner_join(data, by='ID')
Now we have ID, Number of contacts of lawyer with judge (CONT), 10
measures of the judge (from INTG to PHYS), and a total score of all
measures. Take a look at first 6 rows of the dataframe.
HINT: Use head()
head(data4)
# A tibble: 6 × 13
ID Total CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 74 5.7 7.9 7.7 7.3 7.1 7.4 7.1 7.1 7.1 7 8.3
2 2 82.3 6.8 8.9 8.8 8.5 7.8 8.1 8 8 7.8 7.9 8.5
3 3 76.4 7.2 8.1 7.8 7.8 7.5 7.6 7.5 7.5 7.3 7.4 7.9
4 4 86 6.8 8.8 8.5 8.8 8.3 8.5 8.7 8.7 8.4 8.5 8.8
5 5 56.7 7.3 6.4 4.3 6.5 6 6.2 5.7 5.7 5.1 5.3 5.5
6 6 82.6 6.2 8.8 8.7 8.5 7.9 8 8.1 8 8 8 8.6
PS: You can always name the dataframe data
instead of
data2
, data3
, data4
to avoid too
many (useless) objects in your environment. However, this would
overwrite your original data
dataframe, so you won’t be
able to come back to it. There is a balance between creating new objects
and keeping the environment tidy…
4. Data visualisation
Q9. Make a graph to help you decide whether the total ratings are
normally distributed, check the Q-Q plot, Skewness and Kurtosis values,
and test normality using Shapiro-Wilk test.
HINT: Histogram, qqnorm()
, qqline()
,
describe()
(from the psych package),
shapiro.test()
# histogram
ggplot(data4, aes(x = Total)) +
geom_histogram(binwidth = 5, fill = "white", color = "black")
# Q-Q plot
qqnorm(data4$Total)
qqline(data4$Total)
# skewness and kurtosis
library(psych)
describe(data4$Total)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 43 75.84 8.89 77.4 76.73 7.71 53.5 89.2 35.7 -0.76 -0.07 1.36
# skewness = -0.76, kurtosis = -0.07
# shapiro-wilk test
shapiro.test(data4$Total)
Shapiro-Wilk normality test
data: data4$Total
W = 0.93925, p-value = 0.02444
# p value = 0.024
Is the Shapiro-Wilk test significant? What does it mean?
# type your answer here: It is significant so the Total score is not normally distributed.
Q10. Next plot the relationship between number of contacts and total
rating (add an estimated regression line).
HINT: scatter plot
ggplot(data4, aes(x=CONT, y=Total)) +
geom_point() +
geom_smooth(method = "lm")
Q11. Create a bar plot of Judicial integrity (INTG), Demeanor (DMNR),
Diligence (DILG), including mean, SD, and CI.
HINT: Use the summary
dataframe, filter relevant
variables, use stat = "identity"
in the
geom_bar()
function, add CI with
geom_errorbar()
ggplot(filter(summary, Variable %in% c("INTG", "DMNR", "DILG")), aes(x = Variable, y = mean)) +
geom_bar(stat = "identity", fill = "white", color = "black") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) +
ylim(0, 10)
