Lab 3: Data wrangling

Not graded, just practice

Author

Katie Schuler

Published

September 12, 2024

Materials from lab

1 Tidy

1.1 Tidyverse

What is the relationship between tidyverse and readr?

tidyverse is a package in the readr family of packages readr is a package in the tidyverse family of packages tidyverse and readr are two unrelated packages tidyverse and reader are two names for the same package
In the tidyverse, what does “tidy data” refer to?

any data we load into the tidyverse a dataset with no missing values a standard way to organize a dataset the process of cleaning a dataset
What is the purpose of the purrr package?

Data visualization Data wrangling Data importing Functional programming All of the above
What is the primary purpose of the readr package?

Data visualization Data wrangling Data importing Functional programming All of the above

Which of the following returned this message?

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyverse) family(tidyverse) library.collection(tidyverse) library(tidyverse, report=TRUE)

1.2 purrr

Suppose we have the following tibble, stored with the variable df.

# A tibble: 4 × 3
      x     y     z
  <int> <int> <int>
1     1     5     9
2     2     6    10
3     3     7    11
4     4     8    12

What will map(df, mean) return?

the mean of each row the mean of each column the mean of all values Error: cannot compute mean of type integer
Suppose we wanted to coerce each column in the previous tibble to the data type double with one line of code. Fill in the two arguments to map that would accomplish this:

map(, )

1.3 Tibbles

Suppose we run the following code block and create 3 tibbles:

# create tibble tib
tib <- tibble(x = 1:2, y = c("a", "b"))

# create tibble x 
x <- tribble(
    ~x, ~y, 
    2, 3, 
    4, 5
)

# create tibble tibby
tibby <- tibble(
        age = c(1, 2, 3, 5),
        name = c("dory", "hazel", "graham", "joan"),
        alt_name = c("dolores", NA, NA, "joanie")
    )

What will is.data.frame(tib) return?

True False
What will typeof(tib) return?

double character 'double' • 'character' list tibble data.frame
What will is_tibble(x) return?

True False
Which of the following would convert a dataframe called df to a tibble? (note that df is not defined above, consider any artibrary dataframe)

as_tibble(df) as.data.frame(df, tibble) tribble(df) df %>% as_tibble() df |> as_tibble()
What will tibby$a return?

a warning and the value NULL age via partial matching age and alt_name via partial matching hazel, graham, joan, and joanie via partial matching an empty vector

2 Import

The questions below refer to this dataset borrowed from R4DS and available at the url https://pos.it/r4ds-students-csv.

What does the csv in read_csv() stand for? Fill in the blank.

separated values

Suppose we attempt to import the csv file given above with the code below. What will be the result?
```
data <- read_csv("https://pos.it/r4ds-students-csv",
    col_types = list(AGE = col_double())
)
```
imports with no errors or warnings fails to import, throws error imports, but with a warning that there are parsing issues imports, but changes the column name to age
Suppose we import the dataset given above and name it data. What will is.na(data[3,3]) return?

True False
Suppose we import the dataset given above and name it data. Which of the following would return the first column?

data[1] data[[1]] data[[Student ID]] data$`Student ID`
True or false, assuming the same dataset the following code would rename the Student ID column to student_id?
```
data %>% rename(student_id = `Student ID`)
```
True False
True or false, we can use a read_*() function from readr to import a google sheet.

True False

3 Transform

Which of the following dplyr functions retuns a data frame?

select() mutate() filter() rename() None of the above
Which of the following dplyr functions takes a number as their first argument?

select() mutate() filter() rename() None of the above

True or false, the following code blocks are equivalent.

# option 1
ratings %>% select(Word, Frequency) %>% glimpse()

# option 2
glimpse(select(ratings, Word, Frequency))

True False

True or false, the following code options are equivalent

# option 1
ratings %>% 
    select(Word:Class) %>% 
    mutate(Length/Frequency, .after = Class)

# option 2
ratings %>% 
    select(Word:Class) %>% 
    mutate(Length/Frequency)

True False

Recall that there are two possible values in the Class variable in the ratings dataset: “animal” or “plant”. How many rows would be in the data frame returned by the following code block?

ratings %>% group_by(Class) %>% summarise(n = n())

0 2 4 81
Given the code block in the previous question, what will n() do?

summarize all classes including the letter n count the number of rows per Class adds the string n before each value of Class error: missing arguments to n()

True or false, the following code blocks will return the same dataframe

# code block 1
ratings %>% select(complexity = Complex) 


# code block 2
ratings %>% rename(complexity = Complex)

True False

Which of the following code blocks will return a dataframe including only the rows in ratings for which the Class value is “animal”?
```
# code block a
ratings %>% filter(Class = "animal")

# code block b
ratings %>% filter(Class == "animal")
```
a b both a and b

By default the arrange() function arranges the rows in ascending order. Which of the following code blocks would arrange the Frequency variable in descending order?

# code block a
ratings %>% arrange(Frequency, order = "descending")

# code block b
ratings %>% arrange(Frequency, order = "reverse")

# code block c
ratings %>% arrange(desc(Frequency))

Which of the following code blocks could be used to return the mean frequency by class?

# code block a
ratings %>% group_by(Class) %>% summarise( mean = mean(Frequency) )

# code block b
ratings %>% summarise( 
    mean = mean(Frequency), .by = c(Class) )

# code block c
ratings %>% mean(Frequency) %>% group_by(Class)