Author: Kayla Keyue Chen
Last updated: 2024.4.18
Learning objectives
Get familiar with R interface: (1) Create, save, and open a
script. (2) Run your code. (3) Set working directory.
Learn how to install and load packages. We will be using
“tidyverse” a lot because it offers many handy functions, and it also
contains many other useful packages so we don’t need to install and load
them one by one.
Differentiate object and function. Learn how to create and
manipulate scalars, vectors, and dataframes in R. Use the environment to
overview all the objects you have. Read the Console for effects of your
code.
Get familiar with R interface
Create, save, and open a script
Create a script: File -> New File -> R Script. In
the Untitled1 file, type 1+2
in your script so we will have
something to run later.
Save: There are different ways to save a script: (1) File
-> Save. (2) Click on the floppy disk icon. (3) Short cut:
Ctrl/Command + S. If it is the first time you save the script, you need
to give it a name.
Open an existing script: You can open a script from
RStudio interface or directly in your folder. (1) To open a script in
RStudio, File -> Open File -> choose the one you want to open. (2)
To open a script in your folder, you just need to double click on
it
Run your code
Run a line: Move the cursor to the line you want to run,
and press Ctrl/Command + Enter and R will run the current line and
automatically move the cursor to the next line.
Run multiple lines: Highlight the lines you want to run,
and press Ctrl/Command + Enter.
Run the entire script: First select all lines by pressing
Ctrl/Command + A, then run the code by pressing Ctrl/Command +
Enter.
You can learn more keyboard shortcuts in the RStudio from their online
support.
You can also click ‘Run’ in the top right corner of the Source window
to run selected lines.
Set working directory
It’s very useful to set working directory at the beginning of your
script so that R will know where to load and save your files. There are
several ways to set working directory.
- You can use
setwd()
to set the directory and
getwd()
to know the current directory. Note: you need to
use forward slash /
.
# set working directory
setwd("D:/PLINSTAT R for linguists/2024 materials/week1")
# get working directory
getwd()
You can also set working directory manually, from Session ->
Set Working Directory -> Choose Directory … -> choose the folder
you want to use in the pop-up window.
If you have saved the script in the folder that you want to set
as working directory, you can also use Session -> Set Working
Directory -> To Source File Location.
If you have opened the folder in the Files pane (the right bottem
pane, Files tab), you can use Session -> Set Working Directory ->
To Files Pane Location.
Install and load packages
An R package is simply a bunch of data (functions, help menus) stored
in one neat package. Different packages have different functions that
can be very useful and save a lot of trouble! You need to install (only
for the first time) and load the package if you want to use the
functions offered by the package.
Install packages using install.packages("name")
where
“name” is the name of the package. Note that you need to use quotation
marks around the package name.
Load a package using library(name)
where “name” is the
name of the package. Here quotation marks are not needed.
# install tidyverse (you only need to install a package once)
install.packages("tidyverse")
# load tidyverse (you need to do this every time you open R)
library(tidyverse)
Create and manipulate objects in R
Objects and functions
Imagine we have a bunch of values, e.g. 2, 4, 6, 9, 13, 25. These are
our objects. Now imagine we want to combine these objects into one
sequence (this is called a vector) To combine objects into a
vector, we use the ‘combine’ function c()
. It looks like
this c(x, y, z)
- where c()
is our function
and x, y, z are placeholders for our objects. You can put as many
objects into this function as you want.
# Now try the function c(x, y, z, ...) to combine 2, 4, 6, 9, 13, 25
c(2, 4, 6, 9, 13, 25)
[1] 2 4 6 9 13 25
# Check the Console: What happened? -- the code returns 2 4 6 9 13 25
# Check the Environment: Did anything happen? -- nothing
Creating and saving objects
Now let’s look at how to create and save an object. This allows you
to save them to your working space under a name. To create and save an
object you:
Type the name you want to give your object (e.g. pizza)
Type this symbol: <-
Type the function for the object you wish to create, e.g.,
c(2, 4, 6, 9, 13, 25)
Press Ctrl/Command + Enter to run the code
After you run the code, you will see an object called pizza in the
Environment pane under the Values section. It has “num [1:6] 2 4 6 9 13
25” in the second column. “num” means that this vector contains numeric
data. “[1:6]” shows that there are 6 numbers in this vector, and then it
shows the first several numbers for a preview.
# Now type the code to create the object pizza
pizza <- c(2, 4, 6, 9, 13, 25)
# Check the Console: What happened? -- the code didn't return the values
# Check the Environment: Did anything happen? -- new object pizza in the Values section
# Now, call your object by typing the name, then Ctrl/Command + Enter
pizza
[1] 2 4 6 9 13 25
# Check the Environment: What do you see? -- the code returns the values 2 4 6 9 13 25
Scalars and operations
Numeric scalar
Scalar is atomic quantity that can hold only one value at a time. We
can create numerical scalars. For example, let’s create a scalar called
x with the value 4 and another scalar called y with the value 7.
# Create x
x <- 4
# Create y
y <- 7
# Check your Environment!
We can check what kind of scalar a given object is by using the
class()
function. For example: class(x)
will
return the the type of the object x, which is “numeric”.
# check the class of x
class(x)
[1] "numeric"
# what kind of scalar is x?
We can perform arithmetic operations on our scalars, e.g., adding
(+), subtracting (-), multiplying (*), dividing (/), squaring (use ^
followed by a number, e.g., ^2), taking the root (use the function
sqrt()
)
# try the following arithmetic operations, use any number you like
# adding
2+4
[1] 6
# subtracting
1-100
[1] -99
# multiplying
4*23
[1] 92
# dividing
88/8
[1] 11
# squaring
5^3
[1] 125
# taking the root
sqrt(10)
[1] 3.162278
# combination of different operations
sqrt((1+3)/2*9)^2
[1] 18
Logical scalar
We can also create logical scalars (i.e. TRUE and FALSE). For example
let’s create a scalar m that is defined as x > y, and a scalar n that
is defined as x < y, and a scalar p that is defined as x = y.
=
is expressed as ==
in R language. Note: x
and y are defined as the values we assigned to them earlier (x=4, and
y=7).
# Define m
m <- x>y
# Define n
n <- x<y
# Define p
p <- x==y
# Check the Environment: what are the values of m, n, and p?
# m is FALSE, n is TRUE, and p is FALSE
# What kind of scalar is m? (hint: use class() function)
class(m)
[1] "logical"
Logical operators include AND and OR. For example: x > y and x
< y can be written as x > y & x < y
or
m & n
; x > y or x < y can be written as
x > y | x < y
or m | n
.
# Evaluate the code given as examples in the instruction
x > y & x < y
[1] FALSE
m & n
[1] FALSE
Character scalar
We can make character/string scalars. We must use quotation
marks (either single or double) to indicate that they are
characters. Create a character hello world
and save it to
an object named mystring
.
# Create mystring
mystring <- "hello world"
# what kind of scalar is 'mystring'? (hint: use class() function)
class(mystring)
[1] "character"
We can’t use operations on character scalars. Note that when a number
is surrounded by quotation marks, it becomes a character. For example, 1
is a number, “1” is a character. You can use class()
function to test if this is true.
# Try this
string1 <- "1" # ATTENTION: This is a character scalar because we put quotation marks!
string2 <- "2"
# Try adding them and see what happens :)
string1 + string2 # Error in string1 + string2 : non-numeric argument to binary operator
Vectors and operations
Create a vector
We can combine scalars into larger objects of the same data type:
These are called vectors. Use c()
function which has been
introduced earlier.
# creates vector a which contains values 1,2,3,4 (hint: a <- c(x,y,z,...))
a <- c(1,2,3,4)
# creates vector b containing numbers from 1 to 10
b <- c(1,2,3,4,5,6,7,8,9,10)
Two useful functions to create scalar vectors with some rules.
To create continuous integer numbers you can use :
within c()
function. For example, c(1:10)
returns numbers from 1 to 10.
seq(from, to, by)
creates vector with sequence of
values. “from” defines the starting point, “to” defines the ending
point, and “by” defines the step. For example,
seq(from=1,to=5,by=1)
creates a vector containing
1,2,3,4,5.
rep(x, times, each)
creates vector with values
repeated. “x” defines the vector to be repeated, “times” defines the
number of repetition of the entire vector, and “each” defines the number
of repetition of each element in the vector. For example,
`rep(c(1,2,3),times=2,each=2)`` creates a new vector containing
1,1,2,2,3,3,1,1,2,2,3,3.
# generate numbers from 1 to 100 using c()
c(1:100)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
# create vector c with sequence of values from 1, to 20, by 2
c <- seq(from=1,to=20,by=2)
# create vector d with vector (2,4,6) repeated 2 times, with each element repeated 3 times
d <- rep(c(2,4,6),times=2,each=3)
Arithmetic operations with vectors
# Make a simple numeric vector (e)
e <- c(1,2,3)
# Make another one (f)
f <- c(5,7,9)
# Make another one (g)
g <- c(5,7,9,11)
# Check the Environment to see the length of the vectors
# e and f have 3 numbers, g has 4 numbers
We can do arithmetic operations on vectors too! If we want to do
operations between two vectors, they must have the same length, or one
is a multiple of the other.
# adds 6 to each value in e
e+6
[1] 7 8 9
# squaring each value of f
f^2
[1] 25 49 81
# adding e and f (each corresponding value in the sequence is summed)
e+f
[1] 6 9 12
# multiplying e and f (each corresponding value in the sequence is multiplied)
e*f
[1] 5 14 27
# Note: If you want to save these results to your Environment as vectors, you need to give them a name
# Now, try adding e and g, what happened?
e+g # returns Warning: longer object length is not a multiple of shorter object length[1] 6 9 12 12
Logical operations with vectors
We can also do logical operations with vectors. For example, we can
evaluate whether each item in the vector is larger than 10 by
vector_name > 10
.
# remember that we have vector b which contains integer numbers from 1 to 10
# find out whether each item in vector b is larger than 6
b > 6
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
# find out whether each item in vector b is larger than 5 AND smaller than 8
b > 5 & b < 8
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
The %in%
operation allows you to combine multiple OR
comparisons to check whether given values are present in a set. For
example: You have a vector that describes favourite letters
fave.letters <- c('a', 't', 'a', 'b', 'z')
. To check
whether the each letter in fave.letters is one of ‘a’, ‘b’ or ‘c’, you
can type: fave.letters %in% c('a', 'b', 'c')
. The results
will be TRUE FALSE TRUE TRUE FALSE, i.e., only the 2nd letter ‘t’ and
the 5th letter ‘z’ are not one of ‘a’, ‘b’ or ‘c’.
This is helpful when you want to filter your participants based on
their responses. For example, only keep participants whose favourite
letter is one of ‘a’, ‘b’ or ‘c’.
# find out whether each item in vector b is in the set 1,7,13
b %in% c(1,7,13)
[1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# find out whether each item in vector b is in the vector a
b %in% a
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Note that because FALSE = 0 and TRUE = 1, we can sum()
the logical vectors to find out the number of items that satisfy the
operation, and use mean()
to find out the proportion of
items that satisfy the operation.
# find out the number of items in vector b that are larger than 6
sum(b > 6)
[1] 4
# find out the number of items in vector b that are in the set 3,5,0
sum(b %in% c(3,5,0))
[1] 2
# find out the proportion of items in vector b are larger than 5 AND smaller than 8
mean(b > 5 & b < 8)
[1] 0.2
Indexing a vector
You can ask R to return the values of specific items within your
vectors by typing the vector name followed by []
, and put
the order number of the item into []
. For example, to get
the 2nd item in vector a, you write a[2]
. To return
multiple items, you can use c()
to combine all indices you
are interested in.
# remember that we have vector b which contains integer numbers from 1 to 10
# Get the 3rd item in vector b
b[3]
[1] 3
# Get the 2nd and 4th item in vector b
b[c(2,4)]
[1] 2 4
# Get the 2nd to 8th item in vector b
b[c(2:8)]
[1] 2 3 4 5 6 7 8
We can also use logical operations inside indexing []
.
For example, if we want to know what items in a vector is smaller than
5, we can write vector_name[vector_name < 5]
. This is
because vector_name < 5
returns a vector of logical
values (i.e., FALSEs & TRUEs), and then indexing []
will return the items that have TRUEs.
which()
function can return the indices where the value
is TRUE. We can write which(vector_name < 5)
to return
the indices of numbers which is smaller than 5.
# find out what items in vector b are smaller than 4
b[b<4]
[1] 1 2 3
# find out what items in vector b are larger than 4 but smaller than 6
b[b>4 & b<6]
[1] 5
# find out the indices of items in vector b are smaller than 4
which(b<4) # note that because b contains integer numbers from 1 to 10, the indices are the same as number values.
[1] 1 2 3
Dataframes
To create a dataframe, you need to specify column names and values in
each column. For example, let’s create a dataframe called
df1
with two columns, “sex” and “age”. We have 3 males and
they are 99, 46, 23 years old, and 2 females and they are 54, 23 years
old. Use data.frame()
function, each argument will be in
the format “column name = c(value1, value2, value3, …)”, e.g., “age =
c(99,46,23,54,23)”.
# Create df1 using the data.frame() function
df1 <- data.frame(sex = c("male","male","male","female","female"),
age = c(99,46,23,54,23))
# Now call on your dataframe: what does R return?
df1
sex age
1 male 99
2 male 46
3 male 23
4 female 54
5 female 23
Your column/vector names are your variables, you call on specific
variables within a dataframe with the dollar symbol $
,
e.g., df1$sex
(this means you extract a vector from
the dataframe!).
# Extract column age and sex
df1$sex
[1] "male" "male" "male" "female" "female"
df1$age
[1] 99 46 23 54 23
To get column names, you can use colnames(dataframe)
function. You can change a specified column name using function
rename(dataframe, new name = old name)
(Note:
rename()
is provided by tidyverse
package.
Remember to load the package using library()
first!).
Another option is to use
colnames(dataframe) <- c("xxx", "xxx", ...)
to change
ALL the column names at once.
#library(tidyverse) if you haven't
# Get the column names of df1
colnames(df1)
[1] "sex" "age"
# change column name "sex" to "gender" using rename()
rename(df1, "gender" = "sex")
gender age
1 male 99
2 male 46
3 male 23
4 female 54
5 female 23
# change column names to "GENDER" and "AGE" using colnames()
colnames(df1) <- c("GENDER", "AGE")
# call on df1 to see the change
df1
GENDER AGE
1 male 99
2 male 46
3 male 23
4 female 54
5 female 23
Factor variable
We can set the sex column to a factor variable using
as.factor()
function. If sex is set to a factor variable,
it is categorical and new observations can only be one of the specified
levels. The levels are by default arranged by alphabetical order. You
can manually specify the order using function factor()
(no
“as”!) and argument levels = c("xxx", "xxx", ...)
.
Remember you must give the results back to the object to save the
change! For example, a+1
will add 1 to the number
scalar a
, but the value of a
won’t change,
whereas a <- a+1
will update the value of
a
.
# Turn GENDER column into a factor variable using as.factor()
df1$GENDER <- as.factor(df1$GENDER)
# Set male to be level1 and female to be level2 using factor()
df1$GENDER <- factor(df1$GENDER, levels = c("male", "female"))
# Call on the GENDER column and see what levels are there in the factor (hint: use the dollor symbol $)
df1$GENDER
[1] male male male female female
Levels: male female
# Add a 6th observation to the column. Can you add "female"? Can you add "other"?
# first we extract the vector GENDER <- df1$GENDER
# then try GENDER[6] <- "female" and then GENDER[6] <- "other", What happened?
GENDER <- df1$GENDER
GENDER[6] <- "female" # works fine
GENDER[6] <- "other" # returns Warning: invalid factor level, NA generated, "other" is replaced with <NA>
Let’s practice !
Create a vector for each of the columns in the table (see the
last slide for Lab1)
Tip 1: use the combine function c()
to create the
vectors
Tip 2: assign vectors to column names, e.g., column_name =
c(value1, value2, …)
lang_name <- c('Mandarin', 'Spanish', 'English', 'Hindi', 'Portuguese', 'Bengali', 'Russian', 'Japanese')
lang_order <- c('SVO', 'SVO', 'SVO', 'SOV', 'SVO', 'SOV', 'SVO', 'SOV')
lang_popL1 <- c(921.2, 471.4, 369.9, 342.2, 232.4, 228.7, 153.7, 126.3)
lang_popL2 <- c(198.7, 71.5, 978.2, 258.3, 25.2, 39.0, 104.3, 0.12)
lang_indoeuro <- c(FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)
Combine the vectors into a dataframe
lang_df <- data.frame(lang_name, lang_order, lang_popL1, lang_popL2,lang_indoeuro)
Have a look at the dataframe
# Call on the entire dataframe
lang_df
lang_name lang_order lang_popL1 lang_popL2 lang_indoeuro
1 Mandarin SVO 921.2 198.70 FALSE
2 Spanish SVO 471.4 71.50 TRUE
3 English SVO 369.9 978.20 TRUE
4 Hindi SOV 342.2 258.30 TRUE
5 Portuguese SVO 232.4 25.20 TRUE
6 Bengali SOV 228.7 39.00 TRUE
7 Russian SVO 153.7 104.30 TRUE
8 Japanese SOV 126.3 0.12 FALSE
# Call on specific columns (vectors) in your dataframe
lang_df$lang_name
[1] "Mandarin" "Spanish" "English" "Hindi" "Portuguese"
[6] "Bengali" "Russian" "Japanese"
lang_df$lang_popL1
[1] 921.2 471.4 369.9 342.2 232.4 228.7 153.7 126.3
# return all column names
colnames(lang_df)
[1] "lang_name" "lang_order" "lang_popL1" "lang_popL2"
[5] "lang_indoeuro"
- Try executing the following functions on your dataframe and see what
the functions return.
# head(df_name)
head(lang_df) # returns first several rows
lang_name lang_order lang_popL1 lang_popL2 lang_indoeuro
1 Mandarin SVO 921.2 198.7 FALSE
2 Spanish SVO 471.4 71.5 TRUE
3 English SVO 369.9 978.2 TRUE
4 Hindi SOV 342.2 258.3 TRUE
5 Portuguese SVO 232.4 25.2 TRUE
6 Bengali SOV 228.7 39.0 TRUE
# tail(df_name)
tail(lang_df) # returns last several rows
lang_name lang_order lang_popL1 lang_popL2 lang_indoeuro
3 English SVO 369.9 978.20 TRUE
4 Hindi SOV 342.2 258.30 TRUE
5 Portuguese SVO 232.4 25.20 TRUE
6 Bengali SOV 228.7 39.00 TRUE
7 Russian SVO 153.7 104.30 TRUE
8 Japanese SOV 126.3 0.12 FALSE
# View(df_name) N.B. the 'V' in View is capitalised
View(lang_df) # this will open a new tab in the Source pane and show the table of lang_df
# nrow(df_name)
nrow(lang_df) # returns total number of rows
[1] 8
# ncol(df_name)
ncol(lang_df) # returns total number of columns
[1] 5
# dim(df_name)
dim(lang_df) # returns total number of rows and columns
[1] 8 5
# summary(df_name)
summary(lang_df) # returns a summary of values in each column
lang_name lang_order lang_popL1 lang_popL2
Length:8 Length:8 Min. :126.3 Min. : 0.12
Class :character Class :character 1st Qu.:209.9 1st Qu.: 35.55
Mode :character Mode :character Median :287.3 Median : 87.90
Mean :355.7 Mean :209.41
3rd Qu.:395.3 3rd Qu.:213.60
Max. :921.2 Max. :978.20
lang_indoeuro
Mode :logical
FALSE:2
TRUE :6
# What do each of these functions do?
- Indexing a vector (column) in dataframe
# What is the L1 language population of the first language (ie, Mandarin)?
lang_df$lang_popL1[1]
[1] 921.2
# What are the word orders of the first 5 languages?
lang_df$lang_order[1:5]
[1] "SVO" "SVO" "SVO" "SOV" "SVO"
# Whether each L1 language population are greater than 300 million?
lang_df$lang_popL1 > 300
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
# Whether each language orders are SVO?
lang_df$lang_order == "SVO"
[1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
# What are the names of languages with L2 population lower than 100 million?
lang_df$lang_name[lang_df$lang_popL2 < 100]
[1] "Spanish" "Portuguese" "Bengali" "Japanese"
# What are the names of languages with SVO order and L2 language population more than 100 million?
lang_df$lang_name[lang_df$lang_order == "SVO" & lang_df$lang_popL2 > 100]
[1] "Mandarin" "English" "Russian"
- Counting the number of observations satisfying a specific
criteria
# How many languages have SVO order? What is the proportion?
sum(lang_df$lang_order == "SVO")
[1] 5
mean(lang_df$lang_order == "SVO")
[1] 0.625
# How many languages have L2 population greater than 200 million? What is the proportion?
sum(lang_df$lang_popL2 > 200)
[1] 2
mean(lang_df$lang_popL2 > 200)
[1] 0.25
