Author: Kayla Keyue Chen

Last updated: 2024.4.28

Objectives

In this exercise, we will be recapping data manipulation and visualisation skills and check normality to prepare for inferential tests. We’ll also learn more functions to manipulate your dataset, which we don’t have time to cover in the lab: select() and mutate() from the tidyverse package.

Dataset

This exercise will use a R built in dataset called USJudgeRatings.

The data contains average ratings of 43 different judges’ abilities by lawyers in the US. Each row is a lawyer (i.e., the rater), and each column is a measure. You can type “USJudgeRatings” into the Help pane and read more about it. For example,

  • The first column CONT gives the mean number of contacts each judge has had with the lawyers doing the ratings.
  • The following columns give mean ratings out of 10 by the lawyers of the judge’s abilities in different areas of their job.
  • The final column RTEN gives the lawyers’ judgement about whether the judge is worthy of retention.

We will be recapping data manipulation and visualisation skills and check normality to prepare for inferential tests.

1. Set-up

Q1. Set working directory to your folder containing the files. Load in the tidyverse and USJudgeRatings dataset, saving it as data

HINT: setwd(), library(), read.csv()

#setwd("...")
library(tidyverse)

data <- read.csv("USJudgeRatings.csv")

Q2. Is the dataframe in WIDE or LONG format?

# type your answer here: wide

2. Data manipulation

Q3. First remove the column reflecting ‘Worthy of retention’ (i.e., RTEN) as we are not interested in that variable (save the new object as data)

HINT: Use select(dataframe_name, <column names>) to select the columns to keep. For exmaple, select(data, c("CONT", "INTG", "DMNR")). Alternatively, you can drop columns that you don’t need any more by using a minus symbol -. For example, to drop the INTG column, you can use select(data, -INTG).

data <- select(data, -RTEN)

Q4. Next create a new column ID so that we can identify the judges anonymously. The first row has ID of 1, the second has ID of 2, etc (recall that a row is a lawyer who rated the judge so ID is the participant ID).

HINT: Use mutate(column_name = ...) to create a new column or edit an existing column.

data <- data %>% mutate(ID=1:43)

Q5. Next we want to convert the table to a long format with three columns: ID, Variable and Score: save this as object data2

HINT: Use gather()

data2 <- data %>% gather(key=Variable, value=Score, CONT:PHYS)

2. Descriptive Statistics

Q6. Now we can summarise the data. Make a table of mean, SD, SE and 95% confidence intervals for each variable and save it as summary.

HINT: Use group_by(), summarise()

summary <- data2 %>% group_by(Variable) %>% 
  summarise(mean = mean(Score), 
            sd = sd(Score), 
            se = sd/sqrt(n()), 
            ci_lower = mean - 1.96*se, 
            ci_upper = mean + 1.96*se)

3. Further manipulation

Q7. Create a new dataframe (data3) that comprises ID and Total. Total is the sum score of all variables except number of contacts with judge (i.e., CONT).

HINT: Use filter(), group_by() and summarise()

data3 <- data2 %>% 
  filter(Variable != 'CONT') %>%
  group_by(ID) %>% 
  summarise(Total = sum(Score))

Q8. Now we will integrate the total (in data3) with all other measures (in data) and save it as data4.

HINT: Use inner_join()

data4 <- data3 %>% inner_join(data, by='ID')

Now we have ID, Number of contacts of lawyer with judge (CONT), 10 measures of the judge (from INTG to PHYS), and a total score of all measures. Take a look at first 6 rows of the dataframe.

HINT: Use head()

head(data4)
# A tibble: 6 × 13
     ID Total  CONT  INTG  DMNR  DILG  CFMG  DECI  PREP  FAMI  ORAL  WRIT  PHYS
  <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1  74     5.7   7.9   7.7   7.3   7.1   7.4   7.1   7.1   7.1   7     8.3
2     2  82.3   6.8   8.9   8.8   8.5   7.8   8.1   8     8     7.8   7.9   8.5
3     3  76.4   7.2   8.1   7.8   7.8   7.5   7.6   7.5   7.5   7.3   7.4   7.9
4     4  86     6.8   8.8   8.5   8.8   8.3   8.5   8.7   8.7   8.4   8.5   8.8
5     5  56.7   7.3   6.4   4.3   6.5   6     6.2   5.7   5.7   5.1   5.3   5.5
6     6  82.6   6.2   8.8   8.7   8.5   7.9   8     8.1   8     8     8     8.6

PS: You can always name the dataframe data instead of data2, data3, data4 to avoid too many (useless) objects in your environment. However, this would overwrite your original data dataframe, so you won’t be able to come back to it. There is a balance between creating new objects and keeping the environment tidy…

4. Data visualisation

Q9. Make a graph to help you decide whether the total ratings are normally distributed, check the Q-Q plot, Skewness and Kurtosis values, and test normality using Shapiro-Wilk test.

HINT: Histogram, qqnorm(), qqline(), describe() (from the psych package), shapiro.test()

# histogram
ggplot(data4, aes(x = Total)) + 
  geom_histogram(binwidth = 5, fill = "white", color = "black")

# Q-Q plot
qqnorm(data4$Total)
qqline(data4$Total)

# skewness and kurtosis
library(psych)
describe(data4$Total)
   vars  n  mean   sd median trimmed  mad  min  max range  skew kurtosis   se
X1    1 43 75.84 8.89   77.4   76.73 7.71 53.5 89.2  35.7 -0.76    -0.07 1.36
# skewness = -0.76, kurtosis = -0.07

# shapiro-wilk test
shapiro.test(data4$Total)

    Shapiro-Wilk normality test

data:  data4$Total
W = 0.93925, p-value = 0.02444
# p value = 0.024

Is the Shapiro-Wilk test significant? What does it mean?

# type your answer here: It is significant so the Total score is not normally distributed. 

Q10. Next plot the relationship between number of contacts and total rating (add an estimated regression line).

HINT: scatter plot

ggplot(data4, aes(x=CONT, y=Total)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Q11. Create a bar plot of Judicial integrity (INTG), Demeanor (DMNR), Diligence (DILG), including mean, SD, and CI.

HINT: Use the summary dataframe, filter relevant variables, use stat = "identity" in the geom_bar() function, add CI with geom_errorbar()

ggplot(filter(summary, Variable %in% c("INTG", "DMNR", "DILG")), aes(x = Variable, y = mean)) + 
  geom_bar(stat = "identity", fill = "white", color = "black") + 
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) + 
  ylim(0, 10)

---
title: "PLINSTAT lab 2 exercise"
output:
  html_document: 
    toc: true
    toc_float: true
    code_download: true
params:
  flex: TRUE
---

```{r setup, include=FALSE}
#students: this is the set up chunk, it can be ignored
knitr::opts_chunk$set(warning=FALSE, message=FALSE, comment = "")
library(knitr)
library(tidyverse)
library(psych)
```

Author: Kayla Keyue Chen

Last updated: 2024.4.28

## Objectives

In this exercise, we will be recapping data manipulation and visualisation skills and check normality to prepare for inferential tests. **We'll also learn more functions to manipulate your dataset, which we don't have time to cover in the lab:** `select()` and `mutate()` from the tidyverse package. 

## Dataset

This exercise will use a R built in dataset called USJudgeRatings.

The data contains average ratings of 43 different judges' abilities by lawyers in the US. Each row is a lawyer (i.e., the rater), and each column is a measure. You can type "USJudgeRatings" into the Help pane and read more about it. For example, 

* The first column **CONT** gives the mean number of contacts each judge has had with the lawyers doing the ratings. 
* The following columns give mean ratings out of 10 by the lawyers of the judge’s abilities in different areas of their job. 
* The final column **RTEN** gives the lawyers' judgement about whether the judge is worthy of retention.

We will be recapping data manipulation and visualisation skills and check normality to prepare for inferential tests. 

## 1. Set-up

Q1. Set working directory to your folder containing the files. Load in the tidyverse and `USJudgeRatings` dataset, saving it as `data`

HINT: `setwd()`, `library()`, `read.csv()`

```{r, eval=FALSE}
#setwd("...")
```

```{r}
library(tidyverse)

data <- read.csv("USJudgeRatings.csv")
```

Q2. Is the dataframe in WIDE or LONG format?

```{r}
# type your answer here: wide
```

## 2. Data manipulation

Q3. First remove the column reflecting 'Worthy of retention' (i.e., RTEN) as we are not interested in that variable (save the new object as `data`)

HINT: Use `select(dataframe_name, <column names>)` to select the columns to keep. For exmaple, `select(data, c("CONT", "INTG", "DMNR"))`. Alternatively, you can drop columns that you don't need any more by using a minus symbol `-`. For example, to drop the INTG column, you can use `select(data, -INTG)`. 

```{r}
data <- select(data, -RTEN)
```

Q4. Next create a new column `ID` so that we can identify the judges anonymously. The first row has ID of 1, the second has ID of 2, etc (recall that a row is a lawyer who rated the judge so ID is the participant ID). 

HINT: Use `mutate(column_name = ...)` to create a new column or edit an existing column. 

```{r}
data <- data %>% mutate(ID=1:43)
```

Q5. Next we want to convert the table to a long format with three columns: `ID`, `Variable` and `Score`: save this as object data2

HINT: Use `gather()`

```{r}
data2 <- data %>% gather(key=Variable, value=Score, CONT:PHYS)
```

## 2. Descriptive Statistics

Q6. Now we can summarise the data. Make a table of mean, SD, SE and 95% confidence intervals for each variable and save it as `summary`.

HINT: Use `group_by()`, `summarise()`

```{r}
summary <- data2 %>% group_by(Variable) %>% 
  summarise(mean = mean(Score), 
            sd = sd(Score), 
            se = sd/sqrt(n()), 
            ci_lower = mean - 1.96*se, 
            ci_upper = mean + 1.96*se)
```

## 3. Further manipulation

Q7. Create a new dataframe (`data3`) that comprises `ID` and `Total`. Total is the **sum score** of all variables except number of contacts with judge (i.e., CONT).

HINT: Use `filter()`, `group_by()` and `summarise()`

```{r}
data3 <- data2 %>% 
  filter(Variable != 'CONT') %>%
  group_by(ID) %>% 
  summarise(Total = sum(Score))
```

Q8. Now we will integrate the total (in `data3`) with all other measures (in `data`) and save it as `data4`.

HINT: Use `inner_join()`

```{r}
data4 <- data3 %>% inner_join(data, by='ID')
```

Now we have ID, Number of contacts of lawyer with judge (CONT), 10 measures of the judge (from INTG to PHYS), and a total score of all measures. Take a look at first 6 rows of the dataframe. 

HINT: Use `head()`

```{r}
head(data4)
```

PS: You can always name the dataframe `data` instead of `data2`, `data3`, `data4` to avoid too many (useless) objects in your environment. However, this would overwrite your original `data` dataframe, so you won't be able to come back to it. There is a balance between creating new objects and keeping the environment tidy... 

## 4. Data visualisation

Q9. Make a graph to help you decide whether the total ratings are normally distributed, check the Q-Q plot, Skewness and Kurtosis values, and test normality using Shapiro-Wilk test. 

HINT: Histogram, `qqnorm()`, `qqline()`, `describe()` (from the psych package), `shapiro.test()`

```{r}
# histogram
ggplot(data4, aes(x = Total)) + 
  geom_histogram(binwidth = 5, fill = "white", color = "black")

# Q-Q plot
qqnorm(data4$Total)
qqline(data4$Total)

# skewness and kurtosis
library(psych)
describe(data4$Total)
# skewness = -0.76, kurtosis = -0.07

# shapiro-wilk test
shapiro.test(data4$Total)
# p value = 0.024
```

Is the Shapiro-Wilk test significant? What does it mean? 

```{r}
# type your answer here: It is significant so the Total score is not normally distributed. 
```


Q10. Next plot the relationship between number of contacts and total rating (add an estimated regression line).

HINT: scatter plot

```{r}
ggplot(data4, aes(x=CONT, y=Total)) + 
  geom_point() + 
  geom_smooth(method = "lm")
```

Q11. Create a bar plot of Judicial integrity (INTG), Demeanor (DMNR), Diligence (DILG), including mean, SD, and CI. 

HINT: Use the `summary` dataframe, filter relevant variables, use `stat = "identity"` in the `geom_bar()` function, add CI with `geom_errorbar()`

```{r, fig.width=5, fig.height=4}
ggplot(filter(summary, Variable %in% c("INTG", "DMNR", "DILG")), aes(x = Variable, y = mean)) + 
  geom_bar(stat = "identity", fill = "white", color = "black") + 
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) + 
  ylim(0, 10)
```

