Sampling Distribution

Data Science for Studying Language and the Mind

Katie Schuler

2024-09-17

Announcements

  • Grades for Problem Set 1 have been released
  • Problem Set 2 has been posted to the course website
    • Due Monday (Sept 23) at noon
  • You may request an extension of up to 3 days for any reason
    • Please ask in advance of the deadline
    • There is no limit to the number of extensions (you may take one for all 6 problem sets if you need it).

You are here

Data science with R
  • R basics
  • Data visualization
  • Data wrangling
Stats & Model building
  • Sampling distribution
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Inference for regression
  • Mixed-effect models

Data science workflow

Data Science Workflow by R4DS

Attribution

Inspired by a MATLAB course Katie took by Kendrick Kay

Data

Simulated from Ritchie et al 2018:

Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants

Descriptive statistics

Dataset

Suppose we measure a single quantity: brain volume of human adults (in cubic centemeters)

# A tibble: 10 × 1
   volume
    <dbl>
 1  1193.
 2  1150.
 3  1243.
 4  1207.
 5  1236.
 6  1292.
 7  1201.
 8  1259.
 9  1157.
10  1169.
Rows: 5,216
Columns: 1
$ volume <dbl> 1193.283, 1150.383, 1242.702, 1206.808, 1235.955, 1292.399, 120…

Exploring a simple dataset

Each tick mark is one data point: one participant’s brain volume

Visualize the distribution

Visualize the distribution of the data with a histogram

Measure of central tendency

Summarize the data with a single value: mean, a measure of where a central or typical value might fall

Measure of central tendency

Summarize the data with a single value: mean, a measure of where a central or typical value might fall

Measure of variability

Summarize the spread of the data with standard deviation

Measure of variability

Summarize the spread of the data with standard deviation

Parametric statistics

Mean and sd are parametric summary statistics. They are given by the following equations:

\(mean(x) = \bar{x} = \frac{\sum_{i=i}^{n} x_{i}}{n}\)

sd(\(x\)) = \(\sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\)

\(mean(x) = \bar{x} = \frac{\sum_{i=i}^{n} x_{i}}{n}\)

sd(\(x\)) = \(\sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\)

Nonparametric statistics

  • Mean and sd are a good summary of the data when the distribution is normal (gaussian)
  • But suppose our distribution is not normal.

Visualize the distribution

Suppose we have a non-normal distribution

Nonparametric statistics

mean() and sd() are not a good summary of central tendency and variability anymore.

Median

Instead we can use the median as our measure of central tendency: the value below which 50% of the data points fall.

IQR

And the interquartile range (IQR) as a measure of the spread in our data: the difference between the 25th and 75th percentiles (50% of the data fall between these values)

Coverage interval

We can calculate any arbitrary coverage interval. In the sciences we often use the 95% coverage interval — the difference between the 2.5 percentile and the 97.5 percentile — including all but 5% of the data.

Probability distributions

A mathematical function that describes the probability of observing different possible values of a variable (also called probability density function)

Uniform probability distribution

All possible values are equally likely

\(p(x) = \frac{1}{max-min}\)

The probability density function for the uniform distribution is given by this equation (with two parameters: min and max).

Gaussian (normal) probability distribution

One of the most useful probability distributions for our purposes is the Gaussian (or Normal) distribution

\(p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right)\)

The probability density function for the Gaussian distribution is given by the following equation, with the parameters \(\mu\) (mean) and \(\sigma\) (standard deviation).

Gaussian (normal) probability distribution

  • When computing the mean and standard deviation of a set of data, we are implicitly fitting a Gaussian distribution to the data.

Sampling variability

The population

When measuring some quantity, we are usually interested in knowning something about the population: the mean brain volume of Penn undergrads (the parameter)

The sample

But we only have a small sample of the population: maybe we can measure the brain volume of 100 students

Sampling variability

Any statistic we compute from a random sample we’ve collected (parameter estimate) will be subject to sampling variability and will differ from that statistics computed on the entire population (parameter)

Sampling variability

If we took another sample of 100 students, our parameter estimate would be different.

Sampling distribution

The sampling distribution is the probability distribution of values our parameter estimate can take on. Constructed by taking a random sample, computing stat of interest, and repeating many times.

Sampling distribution

Our first sample was on the low end of possible mean brain volume.

Sampling distribution

Our second sample was on the high end of possible mean brain volume.

Quantifying sampling variability

The spread of the sampling distribution indicates how the parameter estimate will vary from different random samples.

Quantifying sampling variability

We can quantify the spread (express our uncertainty on our parameter estimate) in two ways.

  • Parametrically, by compute the standard error
  • Nonparametrically, by constructing a confidence interval

Quantifying sampling variability

One way is to compute the standard deviation of the sampling distribution, which has a special name: the standard error

  • The standard error is given by the following equation, where \(\sigma\) is the standard deviation of the population and \(n\) is the sample size.
  • \(\frac{\sigma}{n}\)
  • In practice, the standard deviation of the population is unknown, so we use the standard deviation of the sample as an estimate.

Standard error is parametric

  • Standard error is a parametric statistic because we assume a gaussian probaiblity distribution and compute standard error based on what happens theoretically when we sample from that theoretical distribution.
  • \(\frac{\sigma}{n}\)

Quantifying sampling variability

Another way is to construct a confidence interval

Practical considerations

  • We don’t have access to the entire population
  • We can (usually) only do our experiment once
  • So, in practice we only have one sample

Bootstrapping

To construct the sampling distribution

Bootstrapping

Instead of assuming a parametric probability distributon, we use the data themselves to approximate the underlying distribution: we sample our sample!

Bootsrapping with infer

The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework

install.packages("infer")`

Let’s create some data

Suppose we collect a sample of 100 subjects and find their mean brain volume is 1200 cubic cm and sd is 100:

# get a sample to work with as our "data"
sample1 <- tibble(
  subject_id = 1:100,
  volume = rnorm(100, mean = 1200, sd = 100)
)

sample1 %>% head(10)
# A tibble: 10 × 2
   subject_id volume
        <int>  <dbl>
 1          1  1225.
 2          2  1186.
 3          3  1176.
 4          4  1207.
 5          5  1173.
 6          6  1137.
 7          7  1118.
 8          8  1177.
 9          9  1169.
10         10  1216.

Generate the sampling distribution

Generate the sampling distribution with specify(), generate(), and calculate()

bootstrap_distribution <- sample1  %>% 
  specify(response = volume) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

bootstrap_distribution
Response: volume (numeric)
# A tibble: 1,000 × 2
   replicate  stat
       <int> <dbl>
 1         1 1213.
 2         2 1184.
 3         3 1194.
 4         4 1184.
 5         5 1200.
 6         6 1180.
 7         7 1197.
 8         8 1202.
 9         9 1189.
10        10 1193.
# ℹ 990 more rows

Visualize the bootstrap distribution

Visualize the bootstrap distribution you generated with visualize()

bootstrap_distribution %>% 
  visualize()

  • Visualize is a shortcut function to ggplot!

Quantify the spread with se

Quantify the spread of the bootstrapped sampling distributon with get_confidence_interval(), and set the type to se for standard error.

se_bootstrap <- bootstrap_distribution %>% 
  get_confidence_interval(
    type = "se",
    point_estimate = mean(sample1$volume)
  )

se_bootstrap
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    1179.    1213.
bootstrap_distribution %>% 
  visualize() +
  shade_confidence_interval(
    endpoints = se_bootstrap
  )

Quantify the spread with ci

Quantify the spread of the sampling distributon with get_confidence_interval, and set the type to percentile for confidence interval

ci_bootstrap <- bootstrap_distribution %>% 
  get_confidence_interval(
    type  ="percentile", 
    level = 0.95
  )

ci_bootstrap
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    1178.    1212.
bootstrap_distribution %>% 
  visualize() +
  shade_confidence_interval(
    endpoints = ci_bootstrap
  )

More next lecture

  • Let’s stop there, and work through some more demos in our next lecture!