Data Science for Studying Language and the Mind
2024-09-17
here
Sampling distribution
Inspired by a MATLAB course Katie took by Kendrick Kay
Tuesday’s lecture was conceptual. Today we will demo those concepts to try to understand them better.
Let’s first try to understand descriptive statistics a bit better by using a toy dataset.
Suppose we create a tibble that measures a single quantity: how many minutes your instructor was late to class for 10 days.
# A tibble: 9 × 1
late_minutes
<dbl>
1 1
2 2
3 2
4 3
5 4
6 2
7 5
8 3
9 3
Recall tha twe can summarize (or describe) a set of data with descriptive statistics
(aka summary statistics). There are three we typically use:
Measure | Stats | Describes |
---|---|---|
Central tendency | mean, median, mode | a central or typical value |
Variability | variance, standard deviation, range, IQR | dispersion or spread of values |
Frequency distribution | count | how frequently different values occur |
We can create a visual summary of our dataset with a histogram, which plots the frequency distribution
of the data.
Measure of central tendency describe where a central or typical value might fall
Measures of variability which describe the dispersion or spread of values
We can also get these with group_by()
and summarise()
Some statistics are considered parametric
because they make assumptions about the distribution of the data (we can compute them theoretically from parameters)
The mean is one example of a parametric descriptive statistic, where \(x_{i}\) is the \(i\)-th data point and \(n\) is the total number of data points
\(mean(x) = \bar{x} = \frac{\sum_{i=i}^{n} x_{i}}{n}\)
Standard deviation is another paramteric descriptive statistic where \(x_{i}\) is the \(i\)-th data point, \(n\) is the total number of data points, and \(\bar{x}\) is the mean.
We can compute this by hand as well, to see how it happens under the hood of sd()
# A tibble: 9 × 3
late_minutes dev sq_dev
<dbl> <dbl> <dbl>
1 1 -1.78 3.16
2 2 -0.778 0.605
3 2 -0.778 0.605
4 3 0.222 0.0494
5 4 1.22 1.49
6 2 -0.778 0.605
7 5 2.22 4.94
8 3 0.222 0.0494
9 3 0.222 0.0494
How do we visualize the mean and sd on our histogram?
First get the summary statistics with summarise()
Other statistics are considered nonparametric
, because thy make minimal assumptions about the distribution of the data (we can compute them theoretically from parameters)
The mean is the value below which 50% of the data fall.
The difference between the 25th and 75th percentiles. We can compute these values with the quantile()
function.
# A tibble: 1 × 2
iqr_lower iqr_upper
<dbl> <dbl>
1 2 3
The IQR is also called the 50% coverage interval (because 50% of the data fall in this range). We can calculate any artibrary coverage interval with quantile()
We can visualize these statistics on our histograms in the same way we did mean and sd:
First get the summary statistics with summarise()
A probability distribution is a mathematical function of one (or more) variables that describes the likelihood of observing any specific set of values for the variables.
function | params | returns |
---|---|---|
d*() |
depends on * | height of the probability density function at the given values |
p*() |
depends on * | cumulative density function (probability that a random number from the distribution will be less than the given values) |
q*() |
depends on * | value whose cumulative distribution matches the probaiblity (inverse of p) |
r*() |
depends on * | returns n random numbers generated from the distribution |
The uniform distribution is the simplest probability distribution, where all values are equally likely. The probability density function for the uniform distribution is given by this equation (with two parameters: min
and max
).
\(p(x) = \frac{1}{max-min}\)
We just use norm
(normal) to stand in for the *
function | params | returns |
---|---|---|
dnorm() |
x, mean, sd | height of the probability density function at the given values |
pnorm() |
q, mean, sd | cumulative density function (probability that a random number from the distribution will be less than the given values) |
qnorm() |
p, mean, sd | value whose cumulative distribution matches the probaiblity (inverse of p) |
rnorm() |
n, mean, sd | returns n random numbers generated from the distribution |
rnorm()
to sample from the distributionrnorm(n, mean, sd)
: returns n random numbers generated from the distribution
dnorm(x, mean, sd)
Returns the height of the probability density function at the given values
pnorm(q, mean, sd)
Returns the cumulative density function (probability that a random number from the distribution will be less than the given values)
qnorm(p, mean, sd)
Returns the value whose cumulative distribution matches the probability
Change the function’s suffix (the * in r*()
) to another distribution and pass the parameters that define that distribution.
runif(n, min, max)
: returns n random numbers generated from the distribution
But remember, this only works for paramteric probability distributions (those defined by particular paramters)
Let’s do a walk through from start to finish
Generate data for the brain volume of the 28201 grad and undergrad students at UPenn and compute the parameter of interest (mean brain volume)
Now take a realistic sample of 100 students and compute the paramter estimate (mean brain volume on our sample)
Use infer
to construct the probability distribution of the values our parameter estimate can take on (the sampling distribution).
Response: volume (numeric)
# A tibble: 1,000 × 2
replicate stat
<int> <dbl>
1 1 1209.
2 2 1195.
3 3 1213.
4 4 1186.
5 5 1215.
6 6 1186.
7 7 1213.
8 8 1217.
9 9 1220.
10 10 1225.
# ℹ 990 more rows
Recall that standard error is the standard deviation of the sampling distribution. It indicaes about how far away the true population might be.
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 1184. 1227.
Confidence intervals are the nonparameteric approach to the standard error: if the distribution is Gaussian, +/- 1 standard error gives the 68% confidence internval and +/- 2 gives the 95% confidence interval.
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 1195. 1216.