# Sampling Distributions

Data Science for Studying Language and the Mind

Katie Schuler

2023-09-21

## You are here

##### Data science with R
• Hello, world!
• R basics
• Data importing
• Data visualization
• Data wrangling
##### Stats & Model buidling
• Probability distributions
• Sampling variability
• Hypothesis testing
• Model specification
• Model fitting
• Model accuracy
• Model reliability
• Classification
• Feature engineering (preprocessing)
• Inference for regression
• Mixed-effect models

• Inspired by a MATLAB course Katie took by Kendrick Kay
• Data simulated from Ritchie et al 2018:

Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants

# Explore a simple dataset

## Dataset

Suppose we measure a single quantity: brain volume of human adults

Rows: 5,216
Columns: 1
$volume <dbl> 1193.283, 1150.383, 1242.702, 1206.808, 1235.955, 1292.399, 120… ## Visualize the distribution Visualize the distribution of the data with a histogram ## Measure of central tendency Summarize the data with a single value: mean, a measure of where a central or typical value might fall sum_stats <- data %>% summarise( n = n(), mean = mean(volume)) sum_stats # A tibble: 1 × 2 n mean <int> <dbl> 1 5216 1173. ## Measure of variability Summarize the spread of the data with standard deviation sum_stats <- data %>% summarise( n = n(), mean = mean(volume), sd = sd(volume)) sum_stats # A tibble: 1 × 3 n mean sd <int> <dbl> <dbl> 1 5216 1173. 112. ## Parametric statistics Mean and sd are parametric summary statistics. They are given by the following equations: $mean(x) = \bar{x} = \frac{\sum_{i=i}^{n} x_{i}}{n}$ sd($x$) = $\sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}$ • Mean and sd are a good summary of the data when the distribution is normal (gaussian) ## Nonparametric statistics But suppose our distribution is not normal. ## Nonparametric statistics mean and sd are not a good summary anymore. ## Median Instead we can use the median as our measure of central tendency. np_sum_stats <- not_normal %>% summarise( n = n(), median = median(y)) np_sum_stats # A tibble: 1 × 2 n median <int> <dbl> 1 111 15 ## IQR And the interquartile range (IQR) as a measure of the spread in our data. np_sum_stats <- not_normal %>% summarise( n = n(), median = median(y), lower = quantile(y, 0.25), upper = quantile(y, 0.75) ) np_sum_stats # A tibble: 1 × 4 n median lower upper <int> <dbl> <dbl> <dbl> 1 111 15 5 25 # Probability distributions A mathematical function that describes the probability of observing different possible values of a variable ## Uniform probability distribution ## Uniform probability distirubtion All possible values are equally likely uniform_sample %>% summarise( min = min(y), max = max(y), prob = 1/(max - min)) # A tibble: 1 × 3 min max prob <int> <int> <dbl> 1 1 10 0.111 height of prob density func dunif(4, min = 1, max = 10)  0.1111111 prob less than given value punif(4, min = 1, max = 10)  0.3333333 $p(x) = \frac{1}{max-min}$ ## Gaussian (normal) probability distribution ## Gaussian (normal) probability distribution height of prob density func #dunif(4, min = 1, max = 10) dnorm(4, mean=0, sd=1)  0.0001338302 prob less than given value #punif(4, min = 1, max = 10) pnorm(4, mean=0, sd=1)  0.9999683 $p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right)$ # Sampling variability ## The population We actually want to know something about the population: the mean brain volume of Penn undergrads (the parameter) ## The sample But we only have a small sample of the population: maybe we can measure the brain volume of 100 students ## Sampling variability Any statistic we compute from a random sample we’ve collected (parameter estimate) will be subject to sampling variability and will differ from that statistics computed on the entire population (parameter) ## Sampling variability If we took another sample of 100 students, our parameter estimate would be different. ## Sampling distribution The sampling distribution is the probability distribution of values our parameter estimate can take on. Constructed by taking a random sample, computing stat of interest, and repeating many times. ## Quantifying sampling variability The spread of the sampling distribution indicates how the parameter estimate will vary from different random samples. We can quantify the spread (express our uncertainty on our parameter estimate) in two ways ## Quantifying sampling variability with standard error One way is to compute the standard deviation of the sampling distribution: the standard error ## Quantifying sampling variability with a confidence interval Another way is to construct a confidence interval ## Practical considerations • We don’t have access to the entire population • We can (usually) only do our experiment once # Bootstrapping To construct the sampling distribution ## Bootstrapping Instead of assuming a parametric probability distributon, we use the data themselves to approximate the underlying distribution: we sample our sample! # Bootsrapping with infer ## infer is part of tidymodels The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. install.packages("tidymodels") ## Generate the sampling distribution Generate the sampling distribution with specify(), generate(), and calculate() bootstrap_distribution <- sample1 %>% specify(response = volume) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") bootstrap_distribution Response: volume (numeric) # A tibble: 1,000 × 2 replicate stat <int> <dbl> 1 1 1238. 2 2 1262. 3 3 1259. 4 4 1232. 5 5 1226. 6 6 1294. 7 7 1259. 8 8 1246. 9 9 1226. 10 10 1235. # ℹ 990 more rows ## Visualize the bootstrap distribution Visualize the bootstrap distribution you generated with visualize() bootstrap_distribution %>% visualize() ## Quantify the spread with se Quantify the spread of the sampling distributon with get_confidence_interval(), using standard error se_bootstrap <- bootstrap_distribution %>% get_confidence_interval( type = "se", point_estimate = mean(sample1$volume)
)

se_bootstrap
# A tibble: 1 × 2
lower_ci upper_ci
<dbl>    <dbl>
1    1210.    1274.
bootstrap_distribution %>%
visualize() +
endpoints = se_bootstrap
)

## Quantify the spread with ci

Quantify the spread of the sampling distributon with get_confidence_interval, using a confidence interval

ci_bootstrap <- bootstrap_distribution %>%
get_confidence_interval(
type  ="percentile",
level = 0.95
)

ci_bootstrap
# A tibble: 1 × 2
lower_ci upper_ci
<dbl>    <dbl>
1    1211.    1273.
bootstrap_distribution %>%
visualize() +
)`