Data Science for Studying Language and the Mind
2023-09-05
not
:
here
Data importing
purr
- functional programmingtibble
- modern data.framereadr
- reading dataThe tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
ggplot2
- for data visualizationdplyr
- for data wranglingreadr
- for reading datatibble
- for modern data framesstringr
: for string manipulationforcats
: for dealing with factorstidyr
: for data tidyingpurrr
: for functional programmingAlready installed on Google Colab’s R kernel:
Message:
Tidyverse makes use of tidy data, a standard way of structuring datasets:
Why tidy data?
purr
Functional programming
to illustrate the joy of tidyverse
and tidy data
purr
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read.
map_*()
functionsmap_*()
functionsWe say “functions” because there are 5, one for each type of vector:
map()
- listmap_lgl()
- logicalmap_int()
- integermap_dbl()
- doublemap_chr()
- charactermap
use casetibble
modern data frames
tibble
A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less and complain more
tibble
Tibbles do less than data frames, in a good way:
tibble
Coerce an existing object:
# A tibble: 4 × 2
x y
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
Pass a column of vectors:
tibble
With is_tibble(x)
and is.data.frame(x)
data.frame
v tibble
You will encounter 2 main diffs:
<dbl>
, <chr>
)[[
and $
:
readr
reading data
readr
The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results.
read_*()
The read_*()
functions have two important arguments:
file
- the path to the filecol_types
- a list of how each column should be converted to a specific data typeread_*()
read_csv()
: comma-separated values (CSV)read_tsv()
: tab-separated values (TSV)read_csv2()
: semicolon-separated valuesread_delim()
: delimited files (CSV and TSV are important special cases)read_fwf()
: fixed-width filesread_table()
: whitespace-separated filesread_log()
: web log filescsv
filesPath only, readr
guesses types:
col_types
column specificationThere are 11 column types that can be specified:
col_logical()
- reads as boolean TRUE FALSE valuescol_integer()
- reads as integercol_double()
- reads as doublecol_number()
- numeric parser that can ignore non-numberscol_character()
- reads as stringscol_factor(levels, ordered = FALSE)
- creates factorscol_datetime(format = "")
- creates date-timescol_date(format = "")
- creates datescol_time(format = "")
- creates timescol_skip()
- skips a columncol_guess()
- tries to guess the columnReading more complex file types requires functions outside the tidyverse:
readxl
- see Spreadsheets in R for Data Sciencegooglesheets4
- see Spreadsheets in R for Data ScienceDBI
- see Databases in R for Data Sciencejsonlite
- see Hierarchical data in R for Data ScienceWrite to a .csv file with
readr
Your dataset has a column that you expected to be logical
or double
, but there is a typo somewhere, so R has coerced the column into character
.
Solve by specifying the column type col_double()
and then using the problems()
function to see where R failed.
NA
Your dataset has missing values, but they were not coded as NA
as R expects.
Solve by adding an na
argument (e.g. na=c("N/A")
)
Your dataset has column names that include spaces, breaking R’s naming rules. In these cases, R adds backticks (e.g. `brain size`
); . . .
We can use the rename()
function to fix them.