Sampling Distributions

Data Science for Studying Language and the Mind

Katie Schuler

2023-09-21

You are `here`

Data science with R

Hello, world!
R basics
Data importing
Data visualization
Data wrangling

Stats & Model buidling

Probability distributions
Sampling variability
Hypothesis testing
Model specification
Model fitting
Model accuracy
Model reliability

More advanced

Classification
Feature engineering (preprocessing)
Inference for regression
Mixed-effect models

Attribution

Inspired by a MATLAB course Katie took by Kendrick Kay
Data simulated from Ritchie et al 2018:

Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants

Explore a simple dataset

Dataset

Suppose we measure a single quantity: brain volume of human adults

Rows: 5,216
Columns: 1
$ volume <dbl> 1193.283, 1150.383, 1242.702, 1206.808, 1235.955, 1292.399, 120…

Visualize the distribution

Visualize the distribution of the data with a histogram

Measure of central tendency

Summarize the data with a single value: mean, a measure of where a central or typical value might fall

sum_stats <- data %>% summarise(
    n = n(), 
    mean = mean(volume))
sum_stats

# A tibble: 1 × 2
      n  mean
  <int> <dbl>
1  5216 1173.

Measure of variability

Summarize the spread of the data with standard deviation

sum_stats <- data %>% summarise(
    n = n(), 
    mean = mean(volume),
    sd = sd(volume))
sum_stats

# A tibble: 1 × 3
      n  mean    sd
  <int> <dbl> <dbl>
1  5216 1173.  112.

Parametric statistics

Mean and sd are parametric summary statistics. They are given by the following equations:

\(mean(x) = \bar{x} = \frac{\sum_{i=i}^{n} x_{i}}{n}\)

sd(\(x\)) = \(\sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\)

Mean and sd are a good summary of the data when the distribution is normal (gaussian)

Nonparametric statistics

But suppose our distribution is not normal.

Nonparametric statistics

mean and sd are not a good summary anymore.

Median

Instead we can use the median as our measure of central tendency.

np_sum_stats <- not_normal %>% summarise(
    n = n(), 
    median = median(y))
np_sum_stats

# A tibble: 1 × 2
      n median
  <int>  <dbl>
1   111     15

IQR

And the interquartile range (IQR) as a measure of the spread in our data.

np_sum_stats <- not_normal %>% summarise(
    n = n(), 
    median = median(y),
    lower = quantile(y, 0.25),
    upper = quantile(y, 0.75) )
np_sum_stats

# A tibble: 1 × 4
      n median lower upper
  <int>  <dbl> <dbl> <dbl>
1   111     15     5    25

Probability distributions

A mathematical function that describes the probability of observing different possible values of a variable

Uniform probability distribution

Uniform probability distirubtion

All possible values are equally likely

uniform_sample %>% summarise(
    min = min(y), 
    max = max(y), 
    prob = 1/(max - min))

# A tibble: 1 × 3
    min   max  prob
  <int> <int> <dbl>
1     1    10 0.111

height of prob density func

dunif(4, min = 1, max = 10)

[1] 0.1111111

prob less than given value

punif(4, min = 1, max = 10)

[1] 0.3333333

\(p(x) = \frac{1}{max-min}\)

Gaussian (normal) probability distribution

height of prob density func

#dunif(4, min = 1, max = 10)
dnorm(4, mean=0, sd=1)

[1] 0.0001338302

prob less than given value

#punif(4, min = 1, max = 10)
pnorm(4, mean=0, sd=1)

[1] 0.9999683

\(p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right)\)

Sampling variability

The population

We actually want to know something about the population: the mean brain volume of Penn undergrads (the parameter)

The sample

But we only have a small sample of the population: maybe we can measure the brain volume of 100 students

Sampling variability

Any statistic we compute from a random sample we’ve collected (parameter estimate) will be subject to sampling variability and will differ from that statistics computed on the entire population (parameter)

Sampling variability

If we took another sample of 100 students, our parameter estimate would be different.

Sampling distribution

The sampling distribution is the probability distribution of values our parameter estimate can take on. Constructed by taking a random sample, computing stat of interest, and repeating many times.

Quantifying sampling variability

The spread of the sampling distribution indicates how the parameter estimate will vary from different random samples. We can quantify the spread (express our uncertainty on our parameter estimate) in two ways

Quantifying sampling variability with `standard error`

One way is to compute the standard deviation of the sampling distribution: the standard error

Quantifying sampling variability with a `confidence interval`

Another way is to construct a confidence interval

Practical considerations

We don’t have access to the entire population
We can (usually) only do our experiment once

Bootstrapping

To construct the sampling distribution

Bootstrapping

Instead of assuming a parametric probability distributon, we use the data themselves to approximate the underlying distribution: we sample our sample!

Bootsrapping with `infer`

`infer` is part of `tidymodels`

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.

install.packages("tidymodels")`

Generate the sampling distribution

Generate the sampling distribution with specify(), generate(), and calculate()

bootstrap_distribution <- sample1  %>% 
  specify(response = volume) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

bootstrap_distribution

Response: volume (numeric)
# A tibble: 1,000 × 2
   replicate  stat
       <int> <dbl>
 1         1 1238.
 2         2 1262.
 3         3 1259.
 4         4 1232.
 5         5 1226.
 6         6 1294.
 7         7 1259.
 8         8 1246.
 9         9 1226.
10        10 1235.
# ℹ 990 more rows

Visualize the bootstrap distribution

Visualize the bootstrap distribution you generated with visualize()

bootstrap_distribution %>% 
  visualize()

Quantify the spread with `se`

Quantify the spread of the sampling distributon with get_confidence_interval(), using standard error

se_bootstrap <- bootstrap_distribution %>% 
  get_confidence_interval(
    type = "se",
    point_estimate = mean(sample1$volume)
  )

se_bootstrap

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    1210.    1274.

bootstrap_distribution %>% 
  visualize() +
  shade_confidence_interval(
    endpoints = se_bootstrap
  )

Quantify the spread with `ci`

Quantify the spread of the sampling distributon with get_confidence_interval, using a confidence interval

ci_bootstrap <- bootstrap_distribution %>% 
  get_confidence_interval(
    type  ="percentile", 
    level = 0.95
  )

ci_bootstrap

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    1211.    1273.

bootstrap_distribution %>% 
  visualize() +
  shade_confidence_interval(
    endpoints = ci_bootstrap
  )

Sampling Distributions

You are here

Data science with R

Stats & Model buidling

More advanced

Attribution

Explore a simple dataset

Dataset

Visualize the distribution

Measure of central tendency

Measure of variability

Parametric statistics

Nonparametric statistics

Nonparametric statistics

Median

IQR

Probability distributions

Uniform probability distribution

Uniform probability distirubtion

Gaussian (normal) probability distribution

Gaussian (normal) probability distribution

Sampling variability

The population

The sample

Sampling variability

Sampling variability

Sampling distribution

Quantifying sampling variability

Quantifying sampling variability with standard error

Quantifying sampling variability with a confidence interval

Practical considerations

Bootstrapping

Bootstrapping

Bootsrapping with infer

infer is part of tidymodels

Generate the sampling distribution

Visualize the bootstrap distribution

Quantify the spread with se

Quantify the spread with ci

You are `here`

Quantifying sampling variability with `standard error`

Quantifying sampling variability with a `confidence interval`

Bootsrapping with `infer`

`infer` is part of `tidymodels`

Quantify the spread with `se`

Quantify the spread with `ci`