Author: Kayla Keyue Chen

Last updated: 2024.4.18

Learning objectives

  • Get familiar with R interface: (1) Create, save, and open a script. (2) Run your code. (3) Set working directory.

  • Learn how to install and load packages. We will be using “tidyverse” a lot because it offers many handy functions, and it also contains many other useful packages so we don’t need to install and load them one by one.

  • Differentiate object and function. Learn how to create and manipulate scalars, vectors, and dataframes in R. Use the environment to overview all the objects you have. Read the Console for effects of your code.

Get familiar with R interface

Create, save, and open a script

  1. Create a script: File -> New File -> R Script. In the Untitled1 file, type 1+2 in your script so we will have something to run later.

  2. Save: There are different ways to save a script: (1) File -> Save. (2) Click on the floppy disk icon. (3) Short cut: Ctrl/Command + S. If it is the first time you save the script, you need to give it a name.

  3. Open an existing script: You can open a script from RStudio interface or directly in your folder. (1) To open a script in RStudio, File -> Open File -> choose the one you want to open. (2) To open a script in your folder, you just need to double click on it

Run your code

  1. Run a line: Move the cursor to the line you want to run, and press Ctrl/Command + Enter and R will run the current line and automatically move the cursor to the next line.

  2. Run multiple lines: Highlight the lines you want to run, and press Ctrl/Command + Enter.

  3. Run the entire script: First select all lines by pressing Ctrl/Command + A, then run the code by pressing Ctrl/Command + Enter.

You can learn more keyboard shortcuts in the RStudio from their online support.

You can also click ‘Run’ in the top right corner of the Source window to run selected lines.

Set working directory

It’s very useful to set working directory at the beginning of your script so that R will know where to load and save your files. There are several ways to set working directory.

  1. You can use setwd() to set the directory and getwd() to know the current directory. Note: you need to use forward slash /.
# set working directory
setwd("D:/PLINSTAT R for linguists/2024 materials/week1")
# get working directory
getwd()
  1. You can also set working directory manually, from Session -> Set Working Directory -> Choose Directory … -> choose the folder you want to use in the pop-up window.

  2. If you have saved the script in the folder that you want to set as working directory, you can also use Session -> Set Working Directory -> To Source File Location.

  3. If you have opened the folder in the Files pane (the right bottem pane, Files tab), you can use Session -> Set Working Directory -> To Files Pane Location.

Install and load packages

An R package is simply a bunch of data (functions, help menus) stored in one neat package. Different packages have different functions that can be very useful and save a lot of trouble! You need to install (only for the first time) and load the package if you want to use the functions offered by the package.

Install packages using install.packages("name") where “name” is the name of the package. Note that you need to use quotation marks around the package name.

Load a package using library(name) where “name” is the name of the package. Here quotation marks are not needed.

# install tidyverse (you only need to install a package once)
install.packages("tidyverse")
# load tidyverse (you need to do this every time you open R)
library(tidyverse)

Create and manipulate objects in R

Objects and functions

Imagine we have a bunch of values, e.g. 2, 4, 6, 9, 13, 25. These are our objects. Now imagine we want to combine these objects into one sequence (this is called a vector) To combine objects into a vector, we use the ‘combine’ function c(). It looks like this c(x, y, z) - where c() is our function and x, y, z are placeholders for our objects. You can put as many objects into this function as you want.

# Now try the function c(x, y, z, ...) to combine 2, 4, 6, 9, 13, 25
c(2, 4, 6, 9, 13, 25)
[1]  2  4  6  9 13 25
# Check the Console: What happened? -- the code returns 2  4  6  9 13 25
# Check the Environment: Did anything happen? -- nothing 

Creating and saving objects

Now let’s look at how to create and save an object. This allows you to save them to your working space under a name. To create and save an object you:

  1. Type the name you want to give your object (e.g. pizza)

  2. Type this symbol: <-

  3. Type the function for the object you wish to create, e.g., c(2, 4, 6, 9, 13, 25)

  4. Press Ctrl/Command + Enter to run the code

After you run the code, you will see an object called pizza in the Environment pane under the Values section. It has “num [1:6] 2 4 6 9 13 25” in the second column. “num” means that this vector contains numeric data. “[1:6]” shows that there are 6 numbers in this vector, and then it shows the first several numbers for a preview.

# Now type the code to create the object pizza
pizza <- c(2, 4, 6, 9, 13, 25)
# Check the Console: What happened? -- the code didn't return the values 
# Check the Environment: Did anything happen? -- new object pizza in the Values section

# Now, call your object by typing the name, then Ctrl/Command + Enter
pizza
[1]  2  4  6  9 13 25
# Check the Environment: What do you see? -- the code returns the values 2  4  6  9 13 25

Scalars and operations

Numeric scalar

Scalar is atomic quantity that can hold only one value at a time. We can create numerical scalars. For example, let’s create a scalar called x with the value 4 and another scalar called y with the value 7.

# Create x
x <- 4

# Create y
y <- 7

# Check your Environment!

We can check what kind of scalar a given object is by using the class() function. For example: class(x) will return the the type of the object x, which is “numeric”.

# check the class of x
class(x)
[1] "numeric"
# what kind of scalar is x?

We can perform arithmetic operations on our scalars, e.g., adding (+), subtracting (-), multiplying (*), dividing (/), squaring (use ^ followed by a number, e.g., ^2), taking the root (use the function sqrt())

# try the following arithmetic operations, use any number you like

# adding
2+4 
[1] 6
# subtracting
1-100 
[1] -99
# multiplying
4*23 
[1] 92
# dividing
88/8 
[1] 11
# squaring
5^3 
[1] 125
# taking the root
sqrt(10) 
[1] 3.162278
# combination of different operations
sqrt((1+3)/2*9)^2 
[1] 18

Logical scalar

We can also create logical scalars (i.e. TRUE and FALSE). For example let’s create a scalar m that is defined as x > y, and a scalar n that is defined as x < y, and a scalar p that is defined as x = y. = is expressed as == in R language. Note: x and y are defined as the values we assigned to them earlier (x=4, and y=7).

# Define m
m <- x>y
# Define n
n <- x<y
# Define p
p <- x==y
# Check the Environment: what are the values of m, n, and p?
# m is FALSE, n is TRUE, and p is FALSE
# What kind of scalar is m? (hint: use class() function)
class(m) 
[1] "logical"

Logical operators include AND and OR. For example: x > y and x < y can be written as x > y & x < y or m & n; x > y or x < y can be written as x > y | x < y or m | n.

# Evaluate the code given as examples in the instruction
x > y & x < y 
[1] FALSE
m & n 
[1] FALSE

Character scalar

We can make character/string scalars. We must use quotation marks (either single or double) to indicate that they are characters. Create a character hello world and save it to an object named mystring.

# Create mystring
mystring <- "hello world"
# what kind of scalar is 'mystring'? (hint: use class() function)
class(mystring) 
[1] "character"

We can’t use operations on character scalars. Note that when a number is surrounded by quotation marks, it becomes a character. For example, 1 is a number, “1” is a character. You can use class() function to test if this is true.

# Try this
string1 <- "1" # ATTENTION: This is a character scalar because we put quotation marks!
string2 <- "2"
# Try adding them and see what happens :)
string1 + string2 # Error in string1 + string2 : non-numeric argument to binary operator

Vectors and operations

Create a vector

We can combine scalars into larger objects of the same data type: These are called vectors. Use c() function which has been introduced earlier.

# creates vector a which contains values 1,2,3,4 (hint: a <- c(x,y,z,...))
a <- c(1,2,3,4)
# creates vector b containing numbers from 1 to 10
b <- c(1,2,3,4,5,6,7,8,9,10)

Two useful functions to create scalar vectors with some rules.

  1. To create continuous integer numbers you can use : within c() function. For example, c(1:10) returns numbers from 1 to 10.

  2. seq(from, to, by) creates vector with sequence of values. “from” defines the starting point, “to” defines the ending point, and “by” defines the step. For example, seq(from=1,to=5,by=1) creates a vector containing 1,2,3,4,5.

  3. rep(x, times, each) creates vector with values repeated. “x” defines the vector to be repeated, “times” defines the number of repetition of the entire vector, and “each” defines the number of repetition of each element in the vector. For example, `rep(c(1,2,3),times=2,each=2)`` creates a new vector containing 1,1,2,2,3,3,1,1,2,2,3,3.

# generate numbers from 1 to 100 using c()
c(1:100)
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100
# create vector c with sequence of values from 1, to 20, by 2
c <- seq(from=1,to=20,by=2)
# create vector d with vector (2,4,6) repeated 2 times, with each element repeated 3 times
d <- rep(c(2,4,6),times=2,each=3)

Arithmetic operations with vectors

# Make a simple numeric vector (e)
e <- c(1,2,3)
# Make another one (f)
f <- c(5,7,9)
# Make another one (g)
g <- c(5,7,9,11)
# Check the Environment to see the length of the vectors
# e and f have 3 numbers, g has 4 numbers

We can do arithmetic operations on vectors too! If we want to do operations between two vectors, they must have the same length, or one is a multiple of the other.

# adds 6 to each value in e
e+6
[1] 7 8 9
# squaring each value of f
f^2
[1] 25 49 81
# adding e and f (each corresponding value in the sequence is summed)
e+f
[1]  6  9 12
# multiplying e and f (each corresponding value in the sequence is multiplied)
e*f
[1]  5 14 27
# Note: If you want to save these results to your Environment as vectors, you need to give them a name
# Now, try adding e and g, what happened? 
e+g # returns Warning: longer object length is not a multiple of shorter object length[1]  6  9 12 12

Logical operations with vectors

We can also do logical operations with vectors. For example, we can evaluate whether each item in the vector is larger than 10 by vector_name > 10.

# remember that we have vector b which contains integer numbers from 1 to 10
# find out whether each item in vector b is larger than 6
b > 6
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# find out whether each item in vector b is larger than 5 AND smaller than 8
b > 5 & b < 8
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE

The %in% operation allows you to combine multiple OR comparisons to check whether given values are present in a set. For example: You have a vector that describes favourite letters fave.letters <- c('a', 't', 'a', 'b', 'z'). To check whether the each letter in fave.letters is one of ‘a’, ‘b’ or ‘c’, you can type: fave.letters %in% c('a', 'b', 'c'). The results will be TRUE FALSE TRUE TRUE FALSE, i.e., only the 2nd letter ‘t’ and the 5th letter ‘z’ are not one of ‘a’, ‘b’ or ‘c’.

This is helpful when you want to filter your participants based on their responses. For example, only keep participants whose favourite letter is one of ‘a’, ‘b’ or ‘c’.

# find out whether each item in vector b is in the set 1,7,13
b %in% c(1,7,13)
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
# find out whether each item in vector b is in the vector a
b %in% a
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Note that because FALSE = 0 and TRUE = 1, we can sum() the logical vectors to find out the number of items that satisfy the operation, and use mean() to find out the proportion of items that satisfy the operation.

# find out the number of items in vector b that are larger than 6
sum(b > 6)
[1] 4
# find out the number of items in vector b that are in the set 3,5,0
sum(b %in% c(3,5,0))
[1] 2
# find out the proportion of items in vector b are larger than 5 AND smaller than 8
mean(b > 5 & b < 8)
[1] 0.2

Indexing a vector

You can ask R to return the values of specific items within your vectors by typing the vector name followed by [], and put the order number of the item into []. For example, to get the 2nd item in vector a, you write a[2]. To return multiple items, you can use c() to combine all indices you are interested in.

# remember that we have vector b which contains integer numbers from 1 to 10
# Get the 3rd item in vector b
b[3]
[1] 3
# Get the 2nd and 4th item in vector b
b[c(2,4)]
[1] 2 4
# Get the 2nd to 8th item in vector b
b[c(2:8)]
[1] 2 3 4 5 6 7 8

We can also use logical operations inside indexing []. For example, if we want to know what items in a vector is smaller than 5, we can write vector_name[vector_name < 5]. This is because vector_name < 5 returns a vector of logical values (i.e., FALSEs & TRUEs), and then indexing [] will return the items that have TRUEs.

which() function can return the indices where the value is TRUE. We can write which(vector_name < 5) to return the indices of numbers which is smaller than 5.

# find out what items in vector b are smaller than 4
b[b<4]
[1] 1 2 3
# find out what items in vector b are larger than 4 but smaller than 6
b[b>4 & b<6]
[1] 5
# find out the indices of items in vector b are smaller than 4
which(b<4) # note that because b contains integer numbers from 1 to 10, the indices are the same as number values. 
[1] 1 2 3

Dataframes

To create a dataframe, you need to specify column names and values in each column. For example, let’s create a dataframe called df1 with two columns, “sex” and “age”. We have 3 males and they are 99, 46, 23 years old, and 2 females and they are 54, 23 years old. Use data.frame() function, each argument will be in the format “column name = c(value1, value2, value3, …)”, e.g., “age = c(99,46,23,54,23)”.

# Create df1 using the data.frame() function
df1 <- data.frame(sex = c("male","male","male","female","female"), 
                  age = c(99,46,23,54,23))
# Now call on your dataframe: what does R return?
df1
     sex age
1   male  99
2   male  46
3   male  23
4 female  54
5 female  23

Your column/vector names are your variables, you call on specific variables within a dataframe with the dollar symbol $, e.g., df1$sex (this means you extract a vector from the dataframe!).

# Extract column age and sex 
df1$sex
[1] "male"   "male"   "male"   "female" "female"
df1$age
[1] 99 46 23 54 23

To get column names, you can use colnames(dataframe) function. You can change a specified column name using function rename(dataframe, new name = old name) (Note: rename() is provided by tidyverse package. Remember to load the package using library() first!). Another option is to use colnames(dataframe) <- c("xxx", "xxx", ...) to change ALL the column names at once.

#library(tidyverse) if you haven't
# Get the column names of df1
colnames(df1)
[1] "sex" "age"
# change column name "sex" to "gender" using rename()
rename(df1, "gender" = "sex")
  gender age
1   male  99
2   male  46
3   male  23
4 female  54
5 female  23
# change column names to "GENDER" and "AGE" using colnames()
colnames(df1) <- c("GENDER", "AGE")
# call on df1 to see the change
df1
  GENDER AGE
1   male  99
2   male  46
3   male  23
4 female  54
5 female  23

Factor variable

We can set the sex column to a factor variable using as.factor() function. If sex is set to a factor variable, it is categorical and new observations can only be one of the specified levels. The levels are by default arranged by alphabetical order. You can manually specify the order using function factor() (no “as”!) and argument levels = c("xxx", "xxx", ...).

Remember you must give the results back to the object to save the change! For example, a+1 will add 1 to the number scalar a, but the value of a won’t change, whereas a <- a+1 will update the value of a.

# Turn GENDER column into a factor variable using as.factor()
df1$GENDER <- as.factor(df1$GENDER)
# Set male to be level1 and female to be level2 using factor()
df1$GENDER <- factor(df1$GENDER, levels = c("male", "female"))
# Call on the GENDER column and see what levels are there in the factor (hint: use the dollor symbol $)
df1$GENDER
[1] male   male   male   female female
Levels: male female
# Add a 6th observation to the column. Can you add "female"? Can you add "other"?
# first we extract the vector GENDER <- df1$GENDER
# then try GENDER[6] <- "female" and then GENDER[6] <- "other", What happened? 
GENDER <- df1$GENDER
GENDER[6] <- "female" # works fine
GENDER[6] <- "other" # returns Warning: invalid factor level, NA generated, "other" is replaced with <NA> 

Let’s practice !

  1. Create a vector for each of the columns in the table (see the last slide for Lab1)

    • Tip 1: use the combine function c() to create the vectors

    • Tip 2: assign vectors to column names, e.g., column_name = c(value1, value2, …)

lang_name <- c('Mandarin', 'Spanish', 'English', 'Hindi', 'Portuguese', 'Bengali', 'Russian', 'Japanese')
lang_order <- c('SVO', 'SVO', 'SVO', 'SOV', 'SVO', 'SOV', 'SVO', 'SOV')
lang_popL1 <- c(921.2, 471.4, 369.9, 342.2, 232.4, 228.7, 153.7, 126.3)
lang_popL2 <- c(198.7, 71.5, 978.2, 258.3, 25.2, 39.0, 104.3, 0.12)
lang_indoeuro <- c(FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)
  1. Combine the vectors into a dataframe

    • Tip 1: use the dataframe function data.frame() to create the dataframe

    • Tip 2: save the dataframe with the name lang_df

lang_df <- data.frame(lang_name, lang_order, lang_popL1, lang_popL2,lang_indoeuro)
  1. Have a look at the dataframe

    • Call on the entire dataframe

    • Call on specific columns (vectors) in your dataframe

    • Call on all column names

# Call on the entire dataframe
lang_df
   lang_name lang_order lang_popL1 lang_popL2 lang_indoeuro
1   Mandarin        SVO      921.2     198.70         FALSE
2    Spanish        SVO      471.4      71.50          TRUE
3    English        SVO      369.9     978.20          TRUE
4      Hindi        SOV      342.2     258.30          TRUE
5 Portuguese        SVO      232.4      25.20          TRUE
6    Bengali        SOV      228.7      39.00          TRUE
7    Russian        SVO      153.7     104.30          TRUE
8   Japanese        SOV      126.3       0.12         FALSE
# Call on specific columns (vectors) in your dataframe
lang_df$lang_name
[1] "Mandarin"   "Spanish"    "English"    "Hindi"      "Portuguese"
[6] "Bengali"    "Russian"    "Japanese"  
lang_df$lang_popL1
[1] 921.2 471.4 369.9 342.2 232.4 228.7 153.7 126.3
# return all column names
colnames(lang_df)
[1] "lang_name"     "lang_order"    "lang_popL1"    "lang_popL2"   
[5] "lang_indoeuro"
  1. Try executing the following functions on your dataframe and see what the functions return.
# head(df_name)
head(lang_df) # returns first several rows
   lang_name lang_order lang_popL1 lang_popL2 lang_indoeuro
1   Mandarin        SVO      921.2      198.7         FALSE
2    Spanish        SVO      471.4       71.5          TRUE
3    English        SVO      369.9      978.2          TRUE
4      Hindi        SOV      342.2      258.3          TRUE
5 Portuguese        SVO      232.4       25.2          TRUE
6    Bengali        SOV      228.7       39.0          TRUE
# tail(df_name)
tail(lang_df) # returns last several rows
   lang_name lang_order lang_popL1 lang_popL2 lang_indoeuro
3    English        SVO      369.9     978.20          TRUE
4      Hindi        SOV      342.2     258.30          TRUE
5 Portuguese        SVO      232.4      25.20          TRUE
6    Bengali        SOV      228.7      39.00          TRUE
7    Russian        SVO      153.7     104.30          TRUE
8   Japanese        SOV      126.3       0.12         FALSE
# View(df_name) N.B. the 'V' in View is capitalised
View(lang_df) # this will open a new tab in the Source pane and show the table of lang_df
# nrow(df_name)
nrow(lang_df) # returns total number of rows
[1] 8
# ncol(df_name)
ncol(lang_df) # returns total number of columns
[1] 5
# dim(df_name)
dim(lang_df) # returns total number of rows and columns 
[1] 8 5
# summary(df_name)
summary(lang_df) # returns a summary of values in each column
  lang_name          lang_order          lang_popL1      lang_popL2    
 Length:8           Length:8           Min.   :126.3   Min.   :  0.12  
 Class :character   Class :character   1st Qu.:209.9   1st Qu.: 35.55  
 Mode  :character   Mode  :character   Median :287.3   Median : 87.90  
                                       Mean   :355.7   Mean   :209.41  
                                       3rd Qu.:395.3   3rd Qu.:213.60  
                                       Max.   :921.2   Max.   :978.20  
 lang_indoeuro  
 Mode :logical  
 FALSE:2        
 TRUE :6        
                
                
                
# What do each of these functions do?
  1. Indexing a vector (column) in dataframe
# What is the L1 language population of the first language (ie, Mandarin)?
lang_df$lang_popL1[1]
[1] 921.2
# What are the word orders of the first 5 languages?
lang_df$lang_order[1:5]
[1] "SVO" "SVO" "SVO" "SOV" "SVO"
# Whether each L1 language population are greater than 300 million?
lang_df$lang_popL1 > 300
[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
# Whether each language orders are SVO? 
lang_df$lang_order == "SVO"
[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
# What are the names of languages with L2 population lower than 100 million?
lang_df$lang_name[lang_df$lang_popL2 < 100]
[1] "Spanish"    "Portuguese" "Bengali"    "Japanese"  
# What are the names of languages with SVO order and L2 language population more than 100 million?
lang_df$lang_name[lang_df$lang_order == "SVO" & lang_df$lang_popL2 > 100]
[1] "Mandarin" "English"  "Russian" 
  1. Counting the number of observations satisfying a specific criteria
# How many languages have SVO order? What is the proportion?
sum(lang_df$lang_order == "SVO")
[1] 5
mean(lang_df$lang_order == "SVO")
[1] 0.625
# How many languages have L2 population greater than 200 million? What is the proportion?
sum(lang_df$lang_popL2 > 200)
[1] 2
mean(lang_df$lang_popL2 > 200)
[1] 0.25
