Sampling Distributions

Data Science for Studying Language and the Mind

Katie Schuler

2023-09-21

You are here

Data science with R
  • Hello, world!
  • R basics
  • Data importing
  • Data visualization
  • Data wrangling
Stats & Model buidling
  • Probability distributions
  • Sampling variability
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Feature engineering (preprocessing)
  • Inference for regression
  • Mixed-effect models

Attribution

  • Inspired by a MATLAB course Katie took by Kendrick Kay
  • Data simulated from Ritchie et al 2018:

Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants

Explore a simple dataset

Dataset

Suppose we measure a single quantity: brain volume of human adults

Rows: 5,216
Columns: 1
$ volume <dbl> 1193.283, 1150.383, 1242.702, 1206.808, 1235.955, 1292.399, 120…

Visualize the distribution

Visualize the distribution of the data with a histogram

Measure of central tendency

Summarize the data with a single value: mean, a measure of where a central or typical value might fall

sum_stats <- data %>% summarise(
    n = n(), 
    mean = mean(volume))
sum_stats
# A tibble: 1 × 2
      n  mean
  <int> <dbl>
1  5216 1173.

Measure of variability

Summarize the spread of the data with standard deviation

sum_stats <- data %>% summarise(
    n = n(), 
    mean = mean(volume),
    sd = sd(volume))
sum_stats
# A tibble: 1 × 3
      n  mean    sd
  <int> <dbl> <dbl>
1  5216 1173.  112.

Parametric statistics

Mean and sd are parametric summary statistics. They are given by the following equations:

\(mean(x) = \bar{x} = \frac{\sum_{i=i}^{n} x_{i}}{n}\)

sd(\(x\)) = \(\sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\)

  • Mean and sd are a good summary of the data when the distribution is normal (gaussian)

Nonparametric statistics

But suppose our distribution is not normal.

Nonparametric statistics

mean and sd are not a good summary anymore.

Median

Instead we can use the median as our measure of central tendency.

np_sum_stats <- not_normal %>% summarise(
    n = n(), 
    median = median(y))
np_sum_stats
# A tibble: 1 × 2
      n median
  <int>  <dbl>
1   111     15

IQR

And the interquartile range (IQR) as a measure of the spread in our data.

np_sum_stats <- not_normal %>% summarise(
    n = n(), 
    median = median(y),
    lower = quantile(y, 0.25),
    upper = quantile(y, 0.75) )
np_sum_stats
# A tibble: 1 × 4
      n median lower upper
  <int>  <dbl> <dbl> <dbl>
1   111     15     5    25

Probability distributions

A mathematical function that describes the probability of observing different possible values of a variable

Uniform probability distribution

Uniform probability distirubtion

All possible values are equally likely

uniform_sample %>% summarise(
    min = min(y), 
    max = max(y), 
    prob = 1/(max - min))
# A tibble: 1 × 3
    min   max  prob
  <int> <int> <dbl>
1     1    10 0.111

height of prob density func

dunif(4, min = 1, max = 10)
[1] 0.1111111

prob less than given value

punif(4, min = 1, max = 10)
[1] 0.3333333

\(p(x) = \frac{1}{max-min}\)

Gaussian (normal) probability distribution

Gaussian (normal) probability distribution

height of prob density func

#dunif(4, min = 1, max = 10)
dnorm(4, mean=0, sd=1)
[1] 0.0001338302

prob less than given value

#punif(4, min = 1, max = 10)
pnorm(4, mean=0, sd=1)
[1] 0.9999683

\(p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right)\)

Sampling variability

The population

We actually want to know something about the population: the mean brain volume of Penn undergrads (the parameter)

The sample

But we only have a small sample of the population: maybe we can measure the brain volume of 100 students

Sampling variability

Any statistic we compute from a random sample we’ve collected (parameter estimate) will be subject to sampling variability and will differ from that statistics computed on the entire population (parameter)

Sampling variability

If we took another sample of 100 students, our parameter estimate would be different.

Sampling distribution

The sampling distribution is the probability distribution of values our parameter estimate can take on. Constructed by taking a random sample, computing stat of interest, and repeating many times.

Quantifying sampling variability

The spread of the sampling distribution indicates how the parameter estimate will vary from different random samples. We can quantify the spread (express our uncertainty on our parameter estimate) in two ways

Quantifying sampling variability with standard error

One way is to compute the standard deviation of the sampling distribution: the standard error

Quantifying sampling variability with a confidence interval

Another way is to construct a confidence interval

Practical considerations

  • We don’t have access to the entire population
  • We can (usually) only do our experiment once

Bootstrapping

To construct the sampling distribution

Bootstrapping

Instead of assuming a parametric probability distributon, we use the data themselves to approximate the underlying distribution: we sample our sample!

Bootsrapping with infer

infer is part of tidymodels

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.

install.packages("tidymodels")`

Generate the sampling distribution

Generate the sampling distribution with specify(), generate(), and calculate()

bootstrap_distribution <- sample1  %>% 
  specify(response = volume) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

bootstrap_distribution
Response: volume (numeric)
# A tibble: 1,000 × 2
   replicate  stat
       <int> <dbl>
 1         1 1238.
 2         2 1262.
 3         3 1259.
 4         4 1232.
 5         5 1226.
 6         6 1294.
 7         7 1259.
 8         8 1246.
 9         9 1226.
10        10 1235.
# ℹ 990 more rows

Visualize the bootstrap distribution

Visualize the bootstrap distribution you generated with visualize()

bootstrap_distribution %>% 
  visualize()

Quantify the spread with se

Quantify the spread of the sampling distributon with get_confidence_interval(), using standard error

se_bootstrap <- bootstrap_distribution %>% 
  get_confidence_interval(
    type = "se",
    point_estimate = mean(sample1$volume)
  )

se_bootstrap
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    1210.    1274.
bootstrap_distribution %>% 
  visualize() +
  shade_confidence_interval(
    endpoints = se_bootstrap
  )

Quantify the spread with ci

Quantify the spread of the sampling distributon with get_confidence_interval, using a confidence interval

ci_bootstrap <- bootstrap_distribution %>% 
  get_confidence_interval(
    type  ="percentile", 
    level = 0.95
  )

ci_bootstrap
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    1211.    1273.
bootstrap_distribution %>% 
  visualize() +
  shade_confidence_interval(
    endpoints = ci_bootstrap
  )