Data Science for Studying Language and the Mind
2024-09-10
ggplot2
here
Data wrangling
purr
- functional programmingtibble
- modern data.framereadr
- reading dataThe tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
ggplot2
- for data visualizationdplyr
- for data wranglingreadr
- for reading datatibble
- for modern data framesstringr
: for string manipulationforcats
: for dealing with factorstidyr
: for data tidyingpurrr
: for functional programmingAlready installed on Google Colab’s R kernel:
Returns a message in Google Colab:
Tidyverse makes use of tidy data, a standard way of structuring datasets:
Why tidy data?
purr
Functional programming
to illustrate the joy of tidyverse
and tidy data
purr
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read.
map_*()
functionsmap_*()
functionsWe say “functions” because there are 5, one for each type of vector:
map()
- listmap_lgl()
- logicalmap_int()
- integermap_dbl()
- doublemap_chr()
- charactermap
use casetibble
modern data frames
tibble
A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less and complain more
tibble
Tibbles do less than data frames, in a good way:
tibble
Coerce an existing object:
# A tibble: 4 × 2
x y
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
Pass a column of vectors:
tibble
With is_tibble(x)
and is.data.frame(x)
data.frame
v tibble
You will encounter 2 main differences:
<dbl>
, <chr>
)[[
and $
:
readr
reading data
readr
The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results.
read_*()
The read_*()
functions have two important arguments:
file
- the path to the filecol_types
- a list of how each column should be converted to a specific data typeread_*()
read_csv()
: comma-separated values (CSV)read_tsv()
: tab-separated values (TSV)read_csv2()
: semicolon-separated valuesread_delim()
: delimited files (CSV and TSV are important special cases)read_fwf()
: fixed-width filesread_table()
: whitespace-separated filesread_log()
: web log filescsv
filesPath only, readr
guesses types:
col_types
column specificationThere are 11 column types that can be specified:
col_logical()
- reads as boolean TRUE FALSE valuescol_integer()
- reads as integercol_double()
- reads as doublecol_number()
- numeric parser that can ignore non-numberscol_character()
- reads as stringscol_factor(levels, ordered = FALSE)
- creates factorscol_datetime(format = "")
- creates date-timescol_date(format = "")
- creates datescol_time(format = "")
- creates timescol_skip()
- skips a columncol_guess()
- tries to guess the columnReading more complex file types requires functions outside the tidyverse:
readxl
- see Spreadsheets in R for Data Sciencegooglesheets4
- see Spreadsheets in R for Data ScienceDBI
- see Databases in R for Data Sciencejsonlite
- see Hierarchical data in R for Data ScienceWrite to a .csv file with
readr
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
AGE
)NA
(AGE
and favorite.food
)Student ID
and Full Name
)Your dataset has a column that you expected to be logical
or double
, but there is a typo somewhere, so R has coerced the column into character
.
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
Solve by specifying the column type col_double()
and then using the problems()
function to see where R failed.
NA
Your dataset has missing values, but they were not coded as NA
as R expects.
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
Solve by adding an na
argument (e.g. na=c("N/A")
)
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only "4"
2 2 Barclay Lynn French fries Lunch only "5"
3 3 Jayendra Lyne <NA> Breakfast and lunch "7"
4 4 Leon Rossini Anchovies Lunch only ""
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch "five"
6 6 Güvenç Attila Ice cream Lunch only "6"
Your dataset has column names that include spaces, breaking R’s naming rules. In these cases, R adds backticks (e.g. `brain size`
);
We can use the rename()
function to fix them.
# A tibble: 6 × 5
student_id full_name favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
. . . d If we have a lot to rename and that gets annoying, see janitor::clean_names()
.