R basics

Data Science for Studying Language and the Mind

Katie Schuler

2023-08-31

You are here

Data science with R
  • Hello, world!
  • R basics
  • Data importing
  • Data visualization
  • Data wrangling
Stats & Model buidling
  • Probability distributions
  • Sampling variability
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Feature engineering (preprocessing)
  • Inference for regression
  • Mixed-effect models

Learning resources

Basic concepts (review)

  • Expressions: fundamental building blocks of programming
  • Objects: allow us to store stuff, created with assignment operator
  • Names: names w give objects must be letters, numbers, ., or _
  • Attributes: allow us to attach arbitrary metadata to objects
  • Functions: take some input, perform some computation, and return some output
  • Environment: collection of all objects we defined in current R session
  • Packages: collections of functions, data, and documentation bundled together in R
  • Comments: notes you leave for yourself, not evaluated
  • Messages: notes R leaves for you (FYI, warning, error)

Important functions

Objects

  • str(x) - returns summary of object’s structure
  • typeof(x) - returns object’s data type
  • length(x) - returns object’s length
  • attributes(x) - returns list of object’s attributes

Important functions

Environment

  • ls() - list all variables in environment
  • rm(x) - remove x variable from environment
  • rm(list = ls()) - remove all variables from environment

Important function

Packages

  • install.packages() to install packages
  • library() to load package into current R session.
  • data() to load data from package into environment
  • sessionInfo() - version info, packages for current R session

Important functions

Help

  • ?mean - get help with a function
  • help('mean') - search help files for word or phrase
  • help(package='tidyverse') - find help for a package

Vectors

Vectors

are fundamental data structures in R. There are two types:

  • atomic vectors - elements of the same data type
  • lists - elements refer to any object

Atomic vectors

Atomic vectors can be one of six data types:

typeof(x) examples
double 3, 3.32
integer 1L, 144L
character “hello”, ‘hello, world!’
logical TRUE, F

atomic because they must contain only one type

Atomic vectors

double

typeof(3.34)
[1] "double"

integer

typeof(3L)
[1] "integer"

character

typeof('hello, world!')
[1] "character"

logical

typeof(TRUE)
[1] "logical"

Create a vector

with c() for concatenate

c(2,4,6)
[1] 2 4 6
c("hello", "world", "!")
[1] "hello" "world" "!"    
c(T, F, T)
[1]  TRUE FALSE  TRUE
c("hello", c(2, 3))
[1] "hello" "2"     "3"    

Create a vector

with sequences seq() or repetitions rep()

# sequence of integers have a special shorthand
6:10
[1]  6  7  8  9 10
# sequence from, to, by 
seq(from=3, to=5, by=0.5)
[1] 3.0 3.5 4.0 4.5 5.0
# rep(x, times = 1, each = 1)
rep(c(1,0), times = 4)
[1] 1 0 1 0 1 0 1 0
# rep(x, times = 1, each = 1)
rep(c(1,0), each = 4)
[1] 1 1 1 1 0 0 0 0

Check data type

with typeof(x) - returns the type of vector x

typeof(3)
[1] "double"
typeof(3L)
[1] "integer"
typeof("three")
[1] "character"
typeof(TRUE)
[1] "logical"

Check data type

with is.*(x) - returns TRUE if x has type *

is.double(3)
[1] TRUE
is.integer(3L)
[1] TRUE
is.character("three")
[1] TRUE
is.logical(TRUE)
[1] TRUE

Coercion, implicit

If you try to include elements of different types, R will coerce them into the same type without warning (implicit coercion)

x <- c(1, 2, "three", 4, 5 )
x
[1] "1"     "2"     "three" "4"     "5"    
typeof(x)
[1] "character"

Coercion, explicit

You can also use explict coercion to change a vector to another data type with as.*()

x <- c(1, 0 , 1, 0)
as.logical(x)
[1]  TRUE FALSE  TRUE FALSE

More complex structures

More complex structures

Some more complex data structures are built from atomic vectors by adding attributes:

Structure Description
matrix vector with dim attribute representing 2 dimensions
array vector with dim attribute representing n dimensions
data.frame a named list of vectors (of equal length) with attributes for names (column names), row.names, and class="data.frame"

Create more complex structures

matrix

matrix(0, nrow=2, ncol=3)
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

data.frame

data.frame(x=c(1,2,3), y=c('a','b','c'))
  x y
1 1 a
2 2 b
3 3 c

array

array(0, dim=c(2,3,2))
, , 1

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

, , 2

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

Operations

Basic math operators

Operator Operation
() Parentheses
^ Exponent
* Multiply
/ Divide
+ Add
- Subtract

Basic math operations

follow the order of operations you expect (PEMDAS)

# multiplication takes precedence
2 + 3 * 10
[1] 32
# we can use paratheses to be explicit
(2 + 3) * 10 
[1] 50

Comparison operators

Operator Comparison
x < y less than
x > y greater than
x <= y less than or equal to
x >= y greater than or equal to
x != y not equal to
x == y equal to

Comparison operators

x <- 2
y <- 3


x < y
[1] TRUE
x > y 
[1] FALSE
x != y
[1] TRUE
x == y
[1] FALSE

Logical operators

Operator Operation
x | y or
x & y and
!x not
any() true if any element meets condition
all() true if all elements meet condition
%in% true if any element is in following vector

Logical operators

x <- TRUE
y <- FALSE


x | y
[1] TRUE
x & y 
[1] FALSE
!x 
[1] FALSE
any(c(x,y))
[1] TRUE
all(c(x,y))
[1] FALSE

Operations are vectorized

Almost all operations (and many functions) are vectorized

math

c(1, 2, 3) + c(4, 5, 6)
[1] 5 7 9
c(1, 2, 3) / c(4, 5, 6)
[1] 0.25 0.40 0.50
c(1, 2, 3) * 10 
[1] 10 20 30
c(1, 2, 30) > 10
[1] FALSE FALSE  TRUE

logical

x <- c(TRUE, FALSE, FALSE)
y <- c(TRUE, TRUE, FALSE)
z <- TRUE
x | y
[1]  TRUE  TRUE FALSE
x & y 
[1]  TRUE FALSE FALSE
x | z 
[1] TRUE TRUE TRUE
x & z 
[1]  TRUE FALSE FALSE

Operator coercion

Operators and functions will also coerce values when needed (and without warning)

5.6 + 2L
[1] 7.6
10 + FALSE 
[1] 10
log(1)
[1] 0
log(TRUE)
[1] 0

Subsetting

Subsetting

Subsetting is a natural complement to str(). While str() shows you all the pieces of any object (its structure), subsetting allows you to pull out the pieces that you’re interested in. ~ Hadley Wickham, Advanced R

str()

x <- c("hello", "world", "!")
str(x)
 chr [1:3] "hello" "world" "!"
y <- c(1, 2, 3, 4, 5)
str(y)
 num [1:5] 1 2 3 4 5

Subsetting

There are three operators for subsetting objects:

  • [ - subsets (one or more) elements
  • [[ and $ - extracts a single element

Subset multiple elements with [

Code Returns
x[c(1,2)] positive integers select elements at specified indexes
x[-c(1,2)] negative integers select all but elements at specified indexes
x[c("x", "y")] select elements by name, if elements are named
x[] nothing returns the original object
x[0] zero returns a zero-length vector
x[c(TRUE, TRUE)] select elements where corresponding logical value is TRUE

Subset multiple elements with [

atomic vector

x <- c("hello", "world", "1")
x[c(1,2)]
[1] "hello" "world"
x[-c(1,2)]
[1] "1"
x[]
[1] "hello" "world" "1"    

data.frame

y <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )
y[c(1,2)]
  this that
1    1    a
2    2    b
3    3    c
y[-c(1,2)]
  theother
1        4
2        5
3        6
y[c("this")]
  this
1    1
2    2
3    3

3 ways to extract a single element

Code Returns
[[2]] a single positive integer (index)
[['name']] a single string
x$name the $ operator is a useful shorthand for [['name']]

3 ways to extract a single element

atomic vector

x <- c("hello", "world", "1")
x[[1]]
[1] "hello"
x[[2]]
[1] "world"
x[[3]]
[1] "1"

data.frame

y <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )
y[[1]]
[1] 1 2 3
y[["that"]]
[1] "a" "b" "c"
y$that
[1] "a" "b" "c"

R has many built-in functions

x <- c(1, -2, 3)

Some are vectorized

log(x)
[1] 0.000000      NaN 1.098612
abs(x)
[1] 1 2 3
round(x, 2)
[1]  1 -2  3

Some are not

mean(x)
[1] 0.6666667
max(x)
[1] 3
min(x)
[1] -2

Missing values

NA

  • used to represent missing or unknown elements in vectors
  • Note that NA is contageous: expressions including NA usually return NA
  • Check for NA values with is.na()
x <- c(1, NA, 3)
is.na(x)
[1] FALSE  TRUE FALSE
length(x)
[1] 3
mean(x)
[1] NA

NULL

  • used to represent an empty or absent vector of arbitrary type
  • NULL is its own special type and always has length zero and NULL attributes
  • Check for NULL values with is.null()
x <- c()
is.null(x)
[1] TRUE
length(x)
[1] 0
mean(x)
[1] NA

Programming

functions

are reusable pieces of code that take some input, perform some task or computation, and return an output

function(inputs){
    # do something
    return(output)
}

control flow

refers to managing the order in which expressions are executed in a program

  • ifelse - if something is true, do this; otherwise do that
  • for loops - repeat code a specific number of times
  • while loops - repeat code as long as certain conditions are true
  • break - exit a loop early
  • next - skip to next iteration in a loop

Subsetting quirks

If we have time

Notes on [ with higher dim objects

m <- matrix(1:6, nrow=2, ncol=3)
m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
# separate dimensions by comma 
m[1, 2]
[1] 3
# omitted dim return all from that dim 
m[2, ]
[1] 2 4 6
m[ , 2]
[1] 3 4

Notes on [[ and $:

both [[ and [ work for vectors; use [[

x <- c(1, -2, 3)
x[[1]]
[1] 1
x[1]
[1] 1

$ does partial matching without warning

df <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )
df[['theo']]
NULL
df$theo
[1] 4 5 6

Questions?

Have a great weekend!