Rows: 5,216
Columns: 1
$ volume <dbl> 1193.283, 1150.383, 1242.702, 1206.808, 1235.955, 1292.399, 120…
Data Science for Studying Language and the Mind
2023-09-21
here
Probability distributions
Sampling variability
Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants
Suppose we measure a single quantity: brain volume of human adults
Rows: 5,216
Columns: 1
$ volume <dbl> 1193.283, 1150.383, 1242.702, 1206.808, 1235.955, 1292.399, 120…
Visualize the distribution of the data with a histogram
Summarize the data with a single value: mean
, a measure of where a central or typical value might fall
# A tibble: 1 × 2
n mean
<int> <dbl>
1 5216 1173.
Summarize the spread of the data with standard deviation
# A tibble: 1 × 3
n mean sd
<int> <dbl> <dbl>
1 5216 1173. 112.
Mean and sd are parametric
summary statistics. They are given by the following equations:
\(mean(x) = \bar{x} = \frac{\sum_{i=i}^{n} x_{i}}{n}\)
sd(\(x\)) = \(\sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\)
normal
(gaussian)But suppose our distribution is not normal.
mean and sd are not a good summary anymore.
Instead we can use the median as our measure of central tendency.
# A tibble: 1 × 2
n median
<int> <dbl>
1 111 15
And the interquartile range (IQR
) as a measure of the spread in our data.
# A tibble: 1 × 4
n median lower upper
<int> <dbl> <dbl> <dbl>
1 111 15 5 25
A mathematical function that describes the probability of observing different possible values of a variable
All possible values are equally likely
\(p(x) = \frac{1}{max-min}\)
\(p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right)\)
We actually want to know something about the population
: the mean brain volume of Penn undergrads (the parameter)
But we only have a small sample
of the population: maybe we can measure the brain volume of 100 students
Any statistic we compute from a random sample we’ve collected (parameter estimate) will be subject to sampling variability
and will differ from that statistics computed on the entire population (parameter)
If we took another sample of 100 students, our parameter estimate would be different.
The sampling distribution
is the probability distribution of values our parameter estimate can take on. Constructed by taking a random sample, computing stat of interest, and repeating many times.
The spread
of the sampling distribution indicates how the parameter estimate will vary from different random samples. We can quantify the spread (express our uncertainty on our parameter estimate) in two ways
standard error
One way is to compute the standard deviation of the sampling distribution: the standard error
confidence interval
Another way is to construct a confidence interval
To construct the sampling distribution
Instead of assuming a parametric probability distributon, we use the data themselves to approximate the underlying distribution: we sample our sample
!
infer
infer
is part of tidymodels
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
Generate the sampling distribution with specify()
, generate()
, and calculate()
Response: volume (numeric)
# A tibble: 1,000 × 2
replicate stat
<int> <dbl>
1 1 1238.
2 2 1262.
3 3 1259.
4 4 1232.
5 5 1226.
6 6 1294.
7 7 1259.
8 8 1246.
9 9 1226.
10 10 1235.
# ℹ 990 more rows
Visualize the bootstrap distribution you generated with visualize()
se
Quantify the spread of the sampling distributon with get_confidence_interval()
, using standard error
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 1210. 1274.
ci
Quantify the spread of the sampling distributon with get_confidence_interval
, using a confidence interval
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 1211. 1273.