Data Importing

Data Science for Studying Language and the Mind

Katie Schuler

2023-09-05

Problem Set 1

  • due Sunday 11:59pm
  • Get support by:
    • Asking specific Qs on Ed
    • Come to office hours
    • Get a pset buddy
  • But we will not:
    • Answer “is this correct?”
    • Give feedback on your entire pset before deadline
    • Go over the pset in lab

Last week

  • Basic concepts
  • Important functions
  • Vectors
  • Operations
  • Subsetting - stopped here
  • Built-in functions
  • Missing values
  • Programming

You are here

Data science with R
  • Hello, world!
  • R basics
  • Data importing
  • Data visualization
  • Data wrangling
Stats & Model buidling
  • Probability distributions
  • Sampling variability
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Feature engineering (preprocessing)
  • Inference for regression
  • Mixed-effect models

Overview for today

  • Tidyverse
  • Tidy data
  • purr - functional programming
  • tibble - modern data.frame
  • readr - reading data

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Tidyverse package docs

Tidyverse

  • ggplot2 - for data visualization
  • dplyr - for data wrangling
  • readr - for reading data
  • tibble - for modern data frames
  • stringr: for string manipulation
  • forcats: for dealing with factors
  • tidyr: for data tidying
  • purrr: for functional programming

Figure 1: Tidyverse hex logos from www.tidyverse.org

Loading the tidyverse

Already installed on Google Colab’s R kernel:

library(tidyverse)

Message:

  • a list of packages loaded
  • a warning of potential name conflicts

Tidy data

Tidyverse makes use of tidy data, a standard way of structuring datasets:

  1. each variable forms a column; each column forms a variable
  2. each observation forms a row; each row forms an observation
  3. value is a cell; each cell is a single value

Tidy data

Figure 2: Visual of tidy data rules, from R for Data Science

Why tidy data?

  • Because consistency and uniformity are very helpful when programming
  • Variables as columns works well for vectorized languages (R!)

purr

Functional programming

to illustrate the joy of tidyverse and tidy data

purr

purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read.

purrr docs

The map_*() functions

  1. Take a vector as input
  2. Apply a function to each element
  3. Return a new vector

The map_*() functions

We say “functions” because there are 5, one for each type of vector:

  • map() - list
  • map_lgl() - logical
  • map_int() - integer
  • map_dbl() - double
  • map_chr() - character

map use case

df <- data.frame(
    x = 1:10,
    y = 11:20,
    z = 21:30
)

with copy+paste

mean(df$x)
[1] 5.5
mean(df$y)
[1] 15.5
mean(df$z)
[1] 25.5

with map

map(df, mean)
$x
[1] 5.5

$y
[1] 15.5

$z
[1] 25.5

tibble

modern data frames

tibble

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less and complain more

tibble docs

tibble

Tibbles do less than data frames, in a good way:

  • never changes type of input (never converts strings to factors!)
  • never changes the name of variables
  • only recycles vectors of length 1
  • never creates row names

Create a tibble

Coerce an existing object:

df <- data.frame(
    x = 1:4,
    y = c("a", "b", "c", "d")
)
as_tibble(df)
# A tibble: 4 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    

Pass a column of vectors:

tibble(
    x = 1:4,
    y = c("a", "b", "c", "d")
)
# A tibble: 4 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    

Define row-by-row:

tribble(
    ~x, ~y,
    "a", 1,
    "b", 2,
    "c", 3,
    "d", 4 
)
# A tibble: 4 × 2
  x         y
  <chr> <dbl>
1 a         1
2 b         2
3 c         3
4 d         4

Test if tibble

With is_tibble(x) and is.data.frame(x)

Data frame:

df <- data.frame(
    x = 1:4,
    y = c("a", "b", "c", "d")
)
is_tibble(df)
[1] FALSE
is.data.frame(df)
[1] TRUE

Tibble:

tib <- tribble(
    ~x, ~y,
    "a", 1,
    "b", 2,
    "c", 3,
    "d", 4 
)
is_tibble(tib)
[1] TRUE
is.data.frame(tib)
[1] TRUE

data.frame v tibble

You will encounter 2 main diffs:

  1. printing
    • by default, tibbles print the first 10 rows and all columns that fit on screen, making it easier to work with large datasets.
    • also report the type of each column (e.g. <dbl>, <chr>)
  2. subsetting - tibbles are more strict than data frames, which fixes two quirks we encountered last lecture when subsetting with [[ and $:
    • tibbles never do partial matching
    • they always generate a warning if the column you are trying to extract does not exist.

readr

reading data

readr

The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results.

readr docs

Rectangular data

Figure 3: Sample csv file from R for Data Science

read_*()

The read_*() functions have two important arguments:

  1. file - the path to the file
  2. col_types - a list of how each column should be converted to a specific data type

7 supported file types, read_*()

  • read_csv(): comma-separated values (CSV)
  • read_tsv(): tab-separated values (TSV)
  • read_csv2(): semicolon-separated values
  • read_delim(): delimited files (CSV and TSV are important special cases)
  • read_fwf(): fixed-width files
  • read_table(): whitespace-separated files
  • read_log(): web log files

Read csv files

Path only, readr guesses types:

read_csv(file='"https://pos.it/r4ds-students-csv"')

Path and specify col_types:

read_csv(
    file='"https://pos.it/r4ds-students-csv"', 
    col_types = list( x = col_string(), y = col_skip() )
)

col_types column specification

There are 11 column types that can be specified:

  • col_logical() - reads as boolean TRUE FALSE values
  • col_integer() - reads as integer
  • col_double() - reads as double
  • col_number() - numeric parser that can ignore non-numbers
  • col_character() - reads as strings
  • col_factor(levels, ordered = FALSE) - creates factors
  • col_datetime(format = "") - creates date-times
  • col_date(format = "") - creates dates
  • col_time(format = "") - creates times
  • col_skip() - skips a column
  • col_guess() - tries to guess the column

Reading more complex files

Reading more complex file types requires functions outside the tidyverse:

  • excel with readxl - see Spreadsheets in R for Data Science
  • google sheets with googlesheets4 - see Spreadsheets in R for Data Science
  • databases with DBI - see Databases in R for Data Science
  • json data with jsonlite - see Hierarchical data in R for Data Science

Writing to a file

Write to a .csv file with

write_csv(students, "students.csv")

Common problems readr

Column contains unexpected values

Your dataset has a column that you expected to be logical or double, but there is a typo somewhere, so R has coerced the column into character.

Solve by specifying the column type col_double() and then using the problems() function to see where R failed.

Missing values are not NA

Your dataset has missing values, but they were not coded as NA as R expects.

Solve by adding an na argument (e.g. na=c("N/A"))

Column names have spaces

Your dataset has column names that include spaces, breaking R’s naming rules. In these cases, R adds backticks (e.g. `brain size`); . . .

We can use the rename() function to fix them.

Questions?