Data wrangling
These notes are adapted from Ch 5 Data tidying, Ch 7 Data import and Ch 4 Data transformation in R for Data Science
Materials from lecture
- Tuesday: slides, demo, my_chicks.csv
- Thursday: slides and demo
1 Tidy
1.1 Welcome to the tidyverse
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ~ Tidyverse package docs
The tidyverse
collection of packages includes:
ggplot2
- for data visualizationdplyr
- for data wranglingreadr
- for reading datatibble
- for modern data framesstringr
: for string manipulationforcats
: for dealing with factorstidyr
: for data tidyingpurrr
: for functional programminglubridate
: for working with dates and times
We load the tidyverse
like any other package, with library(tidyverse)
. When we do, we will receive a message with (1) a list packages that were loaded and (2) a warning that there are potential conflicts with base R’s stats
functions
- We can resolve conflicts with the
::
operator, which allows us to specify which package our intended function belongs to as a prefix:stats::filter()
ordplyr::filter()
1.2 What is tidy data?
The same underlying data can be represented in a table in many different ways; some easier to work with than others. The tidyverse makes use of tidy data principles to make datasets easier to work with in R. Tidy data provides a standard way of structuring datasets:
- each variable forms a column; each column forms a variable
- each observation forms a row; each row forms an observation
- value is a cell; each cell is a single value
Why is tidy data easier to work with?
- Because consistency and uniformity are very helpful when programming
- Variables as columns works well for vectorized languages (R!)
1.3 Functional programming with purrr
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. ~ purrr docs
Let’s illustrate the joy of the tidyverse with one of its packages: purrr
. The docs say that the best place to start is the family of map()
functions, so we’ll do that.
The map()
functions:
- take a vector as input
- apply a function to each element
- return a new vector
We say “functions” because there are 5, the generic map()
function and map_*()
variants for each type of vector:
map()
map_lgl()
map_int()
map_dbl()
map_chr()
To illustrate, suppose we have a data frame df
with 3 columns and we want to compute the mean of each column. We could solve this with copy-and-paste (run mean()
3 different times) or try to use a for
loop, but map()
can do this with just one line:
# We use `map_dbl()` because `mean()` returns a *double* value
map_dbl(df, mean)
1.4 Modern data frames with tibble
A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less and complain more ~ tibble docs
Tibbles do less than data frames, in a good way:
- never changes type of input (never converts strings to factors!)
- never changes the name of variables
- only recycles vectors of length 1
- never creates row names
You can read more in the tibble vignette if you are interested, but understanding these differences is not necessary to be successful in the course. The take-away is that data.frame
and tibble
sometimes behave differently. The behavior of tibble
makes more sense for modern data science, so we should us it instead!
Create a tibble
with one of the following:
# (1) coerce an existing object (e.g., a data frame) to tibble
as_tibble(x)
# (2) construct a tibble from a column of vectors
tibble(x=1:5, y=1)
# (3) define row-by-row, short for transposed tibble
tribble(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
We will encounter two main ways tibbles and data frames differ:
- printing - by default, tibbles print the first 10 rows and all columns that fit on screen, making it easier to work with large datasets. Tibbles also report the type of each column (e.g.
<dbl>
,<chr>
) - subsetting - tibbles are more strict than data frames, which fixes two quirks we encountered last lecture when subsetting with
[[
and$
: (1) tibbles never do partial matching, and (2) they always generate a warning if the column you are trying to extract does not exist.
To test if something is a tibble
or a data.frame
:
is_tibble(x)
is.data.frame(x)
2 Import
Often we want to read in some data we’ve generated or collected outside of R. The most basic and common format is plain-text rectangular files. We will “read” these into R with the readr
package.
2.1 Reading data with readr
The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results.
We will use readr
’s read_*()
functions to read in our rectangular data files.
The read_*()
functions have two important arguments:
file
- the path to the file (that reader will try to parse)col_types
- column specification, a description of how each column should be converted from a character vector to a specific data type
There are 7 supported file types, each with their own read_*()
function:
read_csv()
: comma-separated values (CSV)read_tsv()
: tab-separated values (TSV)read_csv2()
: semicolon-separated valuesread_delim()
: delimited files (CSV and TSV are important special cases)read_fwf()
: fixed-width filesread_table()
: whitespace-separated filesread_log()
: web log files
To read .csv
files, include a path and (optionally) a column specification in col_types
:
# (1) pass only the path; readr guesses col_types
read_csv(file='path/to/file.csv')
# (2) include a column specification with col_types
read_csv(
file='path/to/file.csv',
col_types = list( x = col_string(), y = col_skip() )
)
With no column specification, readr
uses the the first 1000 rows to guess with a simple heuristic:
- if column contains only T/F,
logical
- if only numbers,
double
- if ISO8601 standard,
date
ordate-time
- otherwise
string
There are 11 column types that can be specified:
col_logical()
- reads as boolean TRUE FALSE valuescol_integer()
- reads as integercol_double()
- reads as doublecol_number()
- numeric parser that can ignore non-numberscol_character()
- reads as stringscol_factor(levels, ordered = FALSE)
- creates factorscol_datetime(format = "")
- creates date-timescol_date(format = "")
- creates datescol_time(format = "")
- creates timescol_skip()
- skips a columncol_guess()
- tries to guess the column
Some useful additional arguments:
- if there is no header (the top row containing column names), include
col_names = FALSE
- to provide a header, include
col_names = c("x","y","z")
- to skip some lines, include
skip = n
, where n is number of lines to skip - to select which columns to import, include
col_select(x, y)
- to guess column types with all rows, include
guess_max = Inf
Sometimes weird things happen. The most common problems are:
- column contains unexpected values - your dataset has a column that you expected to be
logical
ordouble
, but there is a typo somewhere, so R has coerced the column intocharacter
. Solve by specifying the column typecol_double()
and then using theproblems()
function to see where R failed. - missing values are not
NA
- your dataset has missing values, but they were not coded asNA
as R expects. Solve by adding anna
argument (e.g.na=c("N/A")
) - column names have spaces - your dataset has column names that include spaces, breaking R’s naming rules. In these cases, R adds backticks (e.g.
`brain size`
); we can use therename()
function to fix them. If we have a lot to rename and that gets annoying, seejanitor::clean_names()
.
Reading more complex file types requires functions outside the tidyverse:
- excel with
readxl
- see Spreadsheets in R for Data Science - google sheets with
googlesheets4
- see Spreadsheets in R for Data Science - databases with
DBI
- see Databases in R for Data Science - json data with
jsonlite
- see Hierarchical data in R for Data Science
2.2 Writing data
We can also write to a csv file with readr
:
write_csv(our_tibble, "name_of_file.csv")
We pass the name of our tibble and the name of the file as arguments. In Google Colab, files we write appear in the left side bar file menu.
3 Data transformation
Why do we need to transform data?
Visualization is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need to make the graph you want. Often you’ll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.
3.1 Data transformation with dplyr
All dplyr
functions (verbs) share a common structure:
- 1st argument is always a data frame
- Subsequent arguments typically describe which columns to operate on (via their names)
- Output is always a new data frame
We can group dplyr
functions based on what they operate on:
- rows - see section 3 Manipulating rows
- columns - see section 4 Manipulating columns
- groups - see section 5 Grouping and summarizing data frames
- tables - see section 6 Joining data frames
We can easily combine dplyr
functions to solve complex problems:
- The pipe operator,
|>
takes the output from one function and passes it as input (the first argument) to the next function. - There is another version of the pipe,
%>%
. See the reading on data transformation if you are curious about the difference.
In lecture, we will demonstrate with the 3 most common dplyr
functions for manipulating rows, manipulating columns, and grouping. But you should feel comfortable reading the docs/resources to use others to solve unique problems.
3.2 Manipulating rows
filter()
filters rows, allowing you to keep only some rows based on the values of the columns.
- the first argument is a data frame (all dplyr verbs)
- subsequent arguments are the conditions that must be true to keep the row (using R’s logical and comparison operators we learned in R basics!), e.g.
filter(age > 18)
- a common filtering mistake is to use
=
instead of the logical operator==
!
arrange()
arranges the rows in the order you specify based on column values (does not change the number of rows, just changes their order)
- the first argument is a data frame (all dplyr verbs)
- subsequent arguments are a set of column names to order by
- note that the default order is ascending, but you can specify descending by wrapping the column in the
desc()
function
distinct()
finds unique rows in a dataset, but you can also provide column names
- the first argument is a data frame
- optionally subsequent arguments provides column names to find the distinct combination of some variables
- note that if you provide column names, distinct will only return those columns unless you add the argument
.keep_all=TRUE
3.3 Manipulating columns
mutate()
adds new columns that are calculated from existing columns
- first argument is a data frame (all dplyr verbs)
- subsequent arguments are the new column name, an equals sign, followed by an expression you want to use to calculate the new value, e.g.
difference=age_end - age_start
- by default new columns are added to the right, but the
.before
and.after
arguments allows you to add them before/after specific positions (by position number, e.g..before=1
or by column name, e.g.before=age
)
select()
selects columns based on their names
- first argument is a data frame (all dplyr verbs)
- subsequent arguments can be the names of the columns you want to keep
- use the
:
operator to select everything from one column to another, e.g.age:height
- you can also use logical operators like
&
(and) or!
(not) to identify the subset of columns you want to select, e.g.!age:height
- you can also rename columns within select by putting the name of the column and an equals sign before the column you want to select, e.g.
new_name=selected_column
rename()
we’ve already seen this function when importing data. Rename is used when want to keep all of our columns but rename one or more.
- first argument is a data frame (all dplyr verbs)
- subsequent arguments are the columns we would like to rename, e.g.
new_colname=old_colname
3.4 Group and summarise
In addition to manipulating rows and columns in your dataset, dplyr
also allows you to work with groups
group_by()
is used to divide your dataset into groups that are meaningful for your analysis.
group_by()
doesn’t change the data, but adds a groups attribute, which tells R that subsequent operations will be performed by group- you can tell if a data frame is grouped by the first line of the output (or with
attributes()
!)
summarise()
is often used after group_by()
to calculate summary statistics on grouped data, which returns a data frame with a single row for each group
- you can add any number of summary stats; usually you want to name them something that makes sense for your analysis
n()
is a particularly useful summary stat to add to our list that returns a count- use the argument
na.rm=TRUE
to compute the summary statistics with NAs removed (remember they are contageous!) - note that the returned data frame is iteself grouped, but in a quirky way, with one fewer group (you may get a warning about this). You can add the argument
.groups="drop"
to drop all groups or.groups="keep"
to keep them all - to avoid this quirk,
summarise()
also has a cool new.by
argument that can be used instead of callinggroup_by()
, which always returns an ungrouped data frame.
ungroup()
is used to remove the grouping attribute from a data frame
3.5 More advanced
There are a few more advanced techniques for transforming and tidying data that we won’t cover now, but might be useful to you in your own research.
- joins - sometimes you have more than one dataset that you want to join into one.
dplyr
also has functions for handling that. Learn more about joins - pivots - sometimes your data doesn’t arrive in the tidy data form. The
tidyr
package can help withpivot_longer()
andpivot_wider()
. Learn more about pivots
4 Further reading and references
Recommended further reading:
- Data tidying in R for Data Science
- Tibbles in R for Data Science
- Data import in R for Data Science
- readr cheatsheet
- Ch 4 Data transformation in R for Data Science textbook
Other references:
- Ch 20 Joins in R for Data Science textbook
- Ch 6 Data tidying in R for Data Science textbook