Lab 4: Sampling distribution

Not graded, just practice

Author

Katie Schuler

Published

September 21, 2023

If you would like to practice with a set of data, you can import the following dataset with read_csv:

# brain volumes simulated from Ritchie et al
"http://kathrynschuler.com/datasets/brain_volume.csv"

1 Exploring a simple dataset

Which of the following is the best choice to visualize the frequency distribution of a set of data? Choose one.

geom_rug() geom_histogram() geom_point() geom_smooth()

Which of the following would summarize the central tendency of a set of data? Choose all that apply.

mean median standard deviation inter quartile range (IQR)

Which of the following would summarize the spread of a set of data? Choose all that apply

mean median standard deviation inter quartile range (IQR)

mean median standard deviation inter quartile range (IQR)

Given the following figure, which summary statistics would best describe these data?

mean median standard deviation inter quartile range (IQR)

Write code to generate 200 data points, sampled from a gaussian distribution with a mean of 0 and a standard deviation of 1.

Answer

rnorm(200, mean = 0, sd = 1)

Suppose you sampled 500 data points from a uniform distribution and stored the result in data. Then, you use the following code to compute the summary stats. What is the height of the probability density function at a value of 5?

data %>% summarise(
    n = n(),
    mean = mean(y), 
    sd = sd(y), 
    lower = quantile(y, 0), 
    upper = quantile(y, 1)
)

# A tibble: 1 × 5
      n  mean    sd lower upper
  <int> <dbl> <dbl> <dbl> <dbl>
1   500  7.56  1.49  5.02  9.99

Suppose your data is normally distributed and has a mean of 25 and a standard deviation of 5. What is the probability a random value drawn from your dataset will be less than 20? Select the closest value.

0.0483 0.1589 1 0

True or false, the parameter is the mean of the population and the parameter estimate is the mean of your sample?

TRUE FALSE

What do we call the probability distribution of the values our parameter estimate can take on?

Suppose we want to quantify the spread of the sampling distribution. What method could we choose? Choose all that apply.

mean median standard error confidence interval

For a typical experiment, how many samples from the population is practical for us to take? Enter a number.

True or false, when we generate the bootstrap sampling distribution, we sample our original sample with replacement.

TRUE FALSE

Suppose we want to generate the bootstrap sampling distribution for the mean of set of data, data, with one variable: reaction_time. Write code that uses the infer package to accomplish this, generating 1000 samples.

Answer

data %>% 
    specify(response = reaction_time) %>%
    generate(reps = 1000, type = "bootstrap") %>%
    calculate(stat = "mean")

Suppose we store our bootstrap sampling distribution from part b in a variable called bootstrap_distribution. Which two arguments should we add to the code below to compute the 68% confidence interval and assign it to the value ci?

ci <- bootstrap_distribution %>% 
    get_confidence_interval(______, ________)

type="se", level = 68 type="se", level = 0.68 type="percentage", level = 0.68 type="percentage", level = 68

Suppose we store our bootstrap sampling distribution in bootstrap_distribution and we want to visualize the confidence interval we just computed in c. Which of the following could we add to the code below? Choose all that apply.

bootstrap_distribution %>%
    visualize() + 
    _____________

get_confidence_interval(endpoints = ci) shade_ci(endpoints = ci) shade_confidence_interval(endpoints = ci) get_ci(endpoints = ci)