Data Science for Studying Language and the Mind

Katie Schuler

2023-08-31

`R basics`

- Data importing
- Data visualization
- Data wrangling

- Probability distributions
- Sampling variability
- Hypothesis testing
- Model specification
- Model fitting
- Model accuracy
- Model reliability

- Classification
- Feature engineering (preprocessing)
- Inference for regression
- Mixed-effect models

`Expressions`

: fundamental building blocks of programming`Objects`

: allow us to store stuff, created with assignment operator`Name`

s: names w give objects must be letters, numbers, ., or _`Attributes`

: allow us to attach arbitrary metadata to objects`Functions`

: take some input, perform some computation, and return some output`Environment`

: collection of all objects we defined in current R session`Packages`

: collections of functions, data, and documentation bundled together in R`Comments`

: notes you leave for yourself, not evaluated`Messages`

: notes R leaves for you (FYI, warning, error)

`str(x)`

- returns summary of object’s structure`typeof(x)`

- returns object’s data type`length(x)`

- returns object’s length`attributes(x)`

- returns list of object’s attributes

`ls()`

- list all variables in environment`rm(x)`

- remove x variable from environment`rm(list = ls())`

- remove all variables from environment

`install.packages()`

to install packages`library()`

to load package into current R session.`data()`

to load data from package into environment`sessionInfo()`

- version info, packages for current R session

`?mean`

- get help with a function`help('mean')`

- search help files for word or phrase`help(package='tidyverse')`

- find help for a package

are fundamental data structures in R. There are two types:

**atomic vectors**- elements of the same data type**lists**- elements refer to any object

Atomic vectors can be one of six **data types**:

`typeof(x)` |
examples |
---|---|

double | 3, 3.32 |

integer | 1L, 144L |

character | “hello”, ‘hello, world!’ |

logical | TRUE, F |

*atomic* because they must contain only one type

with `c()`

for `concatenate`

with sequences `seq()`

or repetitions `rep()`

with `typeof(x)`

- returns the type of vector x

with `is.*(x)`

- returns `TRUE`

if x has type `*`

If you try to include elements of different types, R will coerce them into the same type without warning (**implicit coercion**)

You can also use **explict coercion** to change a vector to another data type with `as.*()`

Some more complex data structures are **built from atomic vectors** by adding **attributes**:

Structure | Description |
---|---|

`matrix` |
vector with `dim` attribute representing 2 dimensions |

`array` |
vector with `dim` attribute representing n dimensions |

`data.frame` |
a named list of vectors (of equal length) with attributes for `names` (column names), `row.names` , and `class="data.frame"` |

Operator | Operation |
---|---|

`()` |
Parentheses |

`^` |
Exponent |

`*` |
Multiply |

`/` |
Divide |

`+` |
Add |

`-` |
Subtract |

follow the order of operations you expect (PEMDAS)

Operator | Comparison |
---|---|

`x < y` |
less than |

`x > y` |
greater than |

`x <= y` |
less than or equal to |

`x >= y` |
greater than or equal to |

`x != y` |
not equal to |

`x == y` |
equal to |

Operator | Operation |
---|---|

`x | y` |
or |

`x & y` |
and |

`!x` |
not |

`any()` |
true if any element meets condition |

`all()` |
true if all elements meet condition |

`%in%` |
true if any element is in following vector |

Almost all operations (and many functions) are vectorized

`math`

Operators and functions will also coerce values when needed (and without warning)

Subsetting is a natural complement to str(). While str() shows you all the pieces of any object (its structure), subsetting allows you to pull out the pieces that you’re interested in. ~ Hadley Wickham, Advanced R

`str()`

There are three operators for subsetting objects:

`[`

-*subsets*(one or more) elements`[[`

and`$`

-*extracts*a single element

`[`

Code | Returns |
---|---|

`x[c(1,2)]` |
positive integers select elements at specified indexes |

`x[-c(1,2)]` |
negative integers select all but elements at specified indexes |

`x[c("x", "y")]` |
select elements by name, if elements are named |

`x[]` |
nothing returns the original object |

`x[0]` |
zero returns a zero-length vector |

`x[c(TRUE, TRUE)]` |
select elements where corresponding logical value is TRUE |

`[`

`atomic vector`

Code | Returns |
---|---|

`[[2]]` |
a single positive integer (index) |

`[['name']]` |
a single string |

`x$name` |
the `$` operator is a useful shorthand for `[['name']]` |

`NA`

- used to represent missing or unknown elements in vectors
- Note that
`NA`

is contageous: expressions including`NA`

usually return`NA`

- Check for
`NA`

values with`is.na()`

`functions`

are reusable pieces of code that take some input, perform some task or computation, and return an output

`control flow`

refers to managing the order in which expressions are executed in a program

`if`

…`else`

- if something is true, do this; otherwise do that`for`

loops - repeat code a specific number of times`while`

loops - repeat code as long as certain conditions are true`break`

- exit a loop early`next`

- skip to next iteration in a loop

`[`

with higher dim objects`[[`

and `$`

:both `[[`

and `[`

work for vectors; use `[[`

`$`

does partial matching without warning

