R basics

Data Science for Studying Language and the Mind

Katie Schuler

2023-08-31

You are `here`

Data science with R

Hello, world!
R basics
Data importing
Data visualization
Data wrangling

Stats & Model buidling

Probability distributions
Sampling variability
Hypothesis testing
Model specification
Model fitting
Model accuracy
Model reliability

More advanced

Classification
Feature engineering (preprocessing)
Inference for regression
Mixed-effect models

Learning resources

Basic concepts (review)

Expressions: fundamental building blocks of programming
Objects: allow us to store stuff, created with assignment operator
Names: names w give objects must be letters, numbers, ., or _
Attributes: allow us to attach arbitrary metadata to objects
Functions: take some input, perform some computation, and return some output
Environment: collection of all objects we defined in current R session
Packages: collections of functions, data, and documentation bundled together in R
Comments: notes you leave for yourself, not evaluated
Messages: notes R leaves for you (FYI, warning, error)

Important functions

Objects

str(x) - returns summary of object’s structure
typeof(x) - returns object’s data type
length(x) - returns object’s length
attributes(x) - returns list of object’s attributes

Important functions

Environment

ls() - list all variables in environment
rm(x) - remove x variable from environment
rm(list = ls()) - remove all variables from environment

Important function

Packages

install.packages() to install packages
library() to load package into current R session.
data() to load data from package into environment
sessionInfo() - version info, packages for current R session

Important functions

Help

?mean - get help with a function
help('mean') - search help files for word or phrase
help(package='tidyverse') - find help for a package

Vectors

are fundamental data structures in R. There are two types:

atomic vectors - elements of the same data type
lists - elements refer to any object

Atomic vectors

Atomic vectors can be one of six data types:

`typeof(x)`	examples
double	3, 3.32
integer	1L, 144L
character	“hello”, ‘hello, world!’
logical	TRUE, F

atomic because they must contain only one type

Atomic vectors

double

typeof(3.34)

[1] "double"

integer

typeof(3L)

[1] "integer"

character

typeof('hello, world!')

[1] "character"

logical

typeof(TRUE)

[1] "logical"

Create a vector

with c() for concatenate

c(2,4,6)

[1] 2 4 6

c("hello", "world", "!")

[1] "hello" "world" "!"

c(T, F, T)

[1]  TRUE FALSE  TRUE

c("hello", c(2, 3))

[1] "hello" "2"     "3"

Create a vector

with sequences seq() or repetitions rep()

# sequence of integers have a special shorthand
6:10

[1]  6  7  8  9 10

# sequence from, to, by 
seq(from=3, to=5, by=0.5)

[1] 3.0 3.5 4.0 4.5 5.0

# rep(x, times = 1, each = 1)
rep(c(1,0), times = 4)

[1] 1 0 1 0 1 0 1 0

# rep(x, times = 1, each = 1)
rep(c(1,0), each = 4)

[1] 1 1 1 1 0 0 0 0

Check data type

with typeof(x) - returns the type of vector x

typeof(3)

[1] "double"

typeof(3L)

[1] "integer"

typeof("three")

[1] "character"

typeof(TRUE)

[1] "logical"

Check data type

with is.*(x) - returns TRUE if x has type *

is.double(3)

[1] TRUE

is.integer(3L)

[1] TRUE

is.character("three")

[1] TRUE

is.logical(TRUE)

[1] TRUE

Coercion, implicit

If you try to include elements of different types, R will coerce them into the same type without warning (implicit coercion)

x <- c(1, 2, "three", 4, 5 )
x

[1] "1"     "2"     "three" "4"     "5"

typeof(x)

[1] "character"

Coercion, explicit

You can also use explict coercion to change a vector to another data type with as.*()

x <- c(1, 0 , 1, 0)
as.logical(x)

[1]  TRUE FALSE  TRUE FALSE

More complex structures

Some more complex data structures are built from atomic vectors by adding attributes:

Structure	Description
`matrix`	vector with `dim` attribute representing 2 dimensions
`array`	vector with `dim` attribute representing n dimensions
`data.frame`	a named list of vectors (of equal length) with attributes for `names` (column names), `row.names`, and `class="data.frame"`

Create more complex structures

matrix

matrix(0, nrow=2, ncol=3)

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

data.frame

data.frame(x=c(1,2,3), y=c('a','b','c'))

  x y
1 1 a
2 2 b
3 3 c

array

array(0, dim=c(2,3,2))

, , 1

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

, , 2

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

Operations

Basic math operators

Operator	Operation
`()`	Parentheses
`^`	Exponent
`*`	Multiply
`/`	Divide
`+`	Add
`-`	Subtract

Basic math operations

follow the order of operations you expect (PEMDAS)

# multiplication takes precedence
2 + 3 * 10

[1] 32

# we can use paratheses to be explicit
(2 + 3) * 10

[1] 50

Comparison operators

Operator	Comparison
`x < y`	less than
`x > y`	greater than
`x <= y`	less than or equal to
`x >= y`	greater than or equal to
`x != y`	not equal to
`x == y`	equal to

Comparison operators

x <- 2
y <- 3

x < y

[1] TRUE

x > y

[1] FALSE

x != y

[1] TRUE

x == y

[1] FALSE

Logical operators

Operator	Operation
`x \| y`	or
`x & y`	and
`!x`	not
`any()`	true if any element meets condition
`all()`	true if all elements meet condition
`%in%`	true if any element is in following vector

Logical operators

x <- TRUE
y <- FALSE

x | y

[1] TRUE

x & y

[1] FALSE

!x

[1] FALSE

any(c(x,y))

[1] TRUE

all(c(x,y))

[1] FALSE

Operations are vectorized

Almost all operations (and many functions) are vectorized

math

c(1, 2, 3) + c(4, 5, 6)

[1] 5 7 9

c(1, 2, 3) / c(4, 5, 6)

[1] 0.25 0.40 0.50

c(1, 2, 3) * 10

[1] 10 20 30

c(1, 2, 30) > 10

[1] FALSE FALSE  TRUE

logical

x <- c(TRUE, FALSE, FALSE)
y <- c(TRUE, TRUE, FALSE)
z <- TRUE

x | y

[1]  TRUE  TRUE FALSE

x & y

[1]  TRUE FALSE FALSE

x | z

[1] TRUE TRUE TRUE

x & z

[1]  TRUE FALSE FALSE

Operator coercion

Operators and functions will also coerce values when needed (and without warning)

5.6 + 2L

[1] 7.6

10 + FALSE

[1] 10

log(1)

[1] 0

log(TRUE)

[1] 0

Subsetting

Subsetting is a natural complement to str(). While str() shows you all the pieces of any object (its structure), subsetting allows you to pull out the pieces that you’re interested in. ~ Hadley Wickham, Advanced R

str()

x <- c("hello", "world", "!")
str(x)

 chr [1:3] "hello" "world" "!"

y <- c(1, 2, 3, 4, 5)
str(y)

 num [1:5] 1 2 3 4 5

Subsetting

There are three operators for subsetting objects:

[ - subsets (one or more) elements
[[ and $ - extracts a single element

Subset multiple elements with `[`

Code	Returns
`x[c(1,2)]`	positive integers select elements at specified indexes
`x[-c(1,2)]`	negative integers select all but elements at specified indexes
`x[c("x", "y")]`	select elements by name, if elements are named
`x[]`	nothing returns the original object
`x[0]`	zero returns a zero-length vector
`x[c(TRUE, TRUE)]`	select elements where corresponding logical value is TRUE

Subset multiple elements with `[`

atomic vector

x <- c("hello", "world", "1")

x[c(1,2)]

[1] "hello" "world"

x[-c(1,2)]

[1] "1"

x[]

[1] "hello" "world" "1"

data.frame

y <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )

y[c(1,2)]

  this that
1    1    a
2    2    b
3    3    c

y[-c(1,2)]

y[c("this")]

3 ways to extract a single element

Code	Returns
`[[2]]`	a single positive integer (index)
`[['name']]`	a single string
`x$name`	the `$` operator is a useful shorthand for `[['name']]`

3 ways to extract a single element

atomic vector

x <- c("hello", "world", "1")

x[[1]]

[1] "hello"

x[[2]]

[1] "world"

x[[3]]

[1] "1"

data.frame

y <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )

y[[1]]

[1] 1 2 3

y[["that"]]

[1] "a" "b" "c"

y$that

[1] "a" "b" "c"

R has many built-in functions

x <- c(1, -2, 3)

Some are vectorized

log(x)

[1] 0.000000      NaN 1.098612

abs(x)

[1] 1 2 3

round(x, 2)

[1]  1 -2  3

Some are not

mean(x)

[1] 0.6666667

max(x)

[1] 3

min(x)

[1] -2

Missing values

NA

used to represent missing or unknown elements in vectors
Note that NA is contageous: expressions including NA usually return NA
Check for NA values with is.na()

x <- c(1, NA, 3)
is.na(x)

[1] FALSE  TRUE FALSE

length(x)

[1] 3

mean(x)

[1] NA

NULL

used to represent an empty or absent vector of arbitrary type
NULL is its own special type and always has length zero and NULL attributes
Check for NULL values with is.null()

x <- c()
is.null(x)

[1] TRUE

length(x)

[1] 0

mean(x)

[1] NA

Programming

functions

are reusable pieces of code that take some input, perform some task or computation, and return an output

function(inputs){
    # do something
    return(output)
}

control flow

refers to managing the order in which expressions are executed in a program

if…else - if something is true, do this; otherwise do that
for loops - repeat code a specific number of times
while loops - repeat code as long as certain conditions are true
break - exit a loop early
next - skip to next iteration in a loop

Subsetting quirks

If we have time

Notes on `[` with higher dim objects

m <- matrix(1:6, nrow=2, ncol=3)
m

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

# separate dimensions by comma 
m[1, 2]

[1] 3

# omitted dim return all from that dim 
m[2, ]

[1] 2 4 6

m[ , 2]

[1] 3 4

Notes on `[[` and `$`:

both [[ and [ work for vectors; use [[

x <- c(1, -2, 3)
x[[1]]

[1] 1

x[1]

[1] 1

$ does partial matching without warning

df <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )

df[['theo']]

NULL

df$theo

[1] 4 5 6

Questions?

Have a great weekend!

R basics

You are here

Data science with R

Stats & Model buidling

More advanced

Learning resources

Basic concepts (review)

Important functions

Objects

Important functions

Environment

Important function

Packages

Important functions

Help

Vectors

Vectors

Atomic vectors

Atomic vectors

Create a vector

Create a vector

Check data type

Check data type

Coercion, implicit

Coercion, explicit

More complex structures

More complex structures

Create more complex structures

Operations

Basic math operators

Basic math operations

Comparison operators

Comparison operators

Logical operators

Logical operators

Operations are vectorized

Operator coercion

Subsetting

Subsetting

Subsetting

Subset multiple elements with [

Subset multiple elements with [

3 ways to extract a single element

3 ways to extract a single element

R has many built-in functions

Missing values

Programming

Subsetting quirks

Notes on [ with higher dim objects

Notes on [[ and $:

Questions?

You are `here`

Subset multiple elements with `[`

Subset multiple elements with `[`

Notes on `[` with higher dim objects

Notes on `[[` and `$`: