```
Rows: 5,216
Columns: 1
$ volume <dbl> 1193.283, 1150.383, 1242.702, 1206.808, 1235.955, 1292.399, 120…
```

Data Science for Studying Language and the Mind

Katie Schuler

2023-09-21

`here`

- Hello, world!
- R basics
- Data importing
- Data visualization
- Data wrangling

`Probability distributions`

`Sampling variability`

- Hypothesis testing
- Model specification
- Model fitting
- Model accuracy
- Model reliability

- Classification
- Feature engineering (preprocessing)
- Inference for regression
- Mixed-effect models

- Inspired by a MATLAB course Katie took by Kendrick Kay
- Data simulated from Ritchie et al 2018:

Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants

Suppose we measure a single quantity: `brain volume of human adults`

```
Rows: 5,216
Columns: 1
$ volume <dbl> 1193.283, 1150.383, 1242.702, 1206.808, 1235.955, 1292.399, 120…
```

Visualize the distribution of the data with a `histogram`

Summarize the data with a single value: `mean`

, a measure of where a central or typical value might fall

```
# A tibble: 1 × 2
n mean
<int> <dbl>
1 5216 1173.
```

Summarize the spread of the data with `standard deviation`

```
# A tibble: 1 × 3
n mean sd
<int> <dbl> <dbl>
1 5216 1173. 112.
```

Mean and sd are `parametric`

summary statistics. They are given by the following equations:

\(mean(x) = \bar{x} = \frac{\sum_{i=i}^{n} x_{i}}{n}\)

sd(\(x\)) = \(\sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\)

- Mean and sd are a good summary of the data when the distribution is
`normal`

(**gaussian**)

But suppose our distribution is not normal.

mean and sd are not a good summary anymore.

Instead we can use the median as our measure of central tendency.

```
# A tibble: 1 × 2
n median
<int> <dbl>
1 111 15
```

And the interquartile range (`IQR`

) as a measure of the spread in our data.

```
# A tibble: 1 × 4
n median lower upper
<int> <dbl> <dbl> <dbl>
1 111 15 5 25
```

A mathematical function that describes the probability of observing different possible values of a variable

All possible values are equally likely

\(p(x) = \frac{1}{max-min}\)

\(p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right)\)

We actually want to know something about the `population`

: the mean brain volume of Penn undergrads (the **parameter**)

But we only have a small `sample`

of the population: maybe we can measure the brain volume of 100 students

Any statistic we compute from a random sample we’ve collected (**parameter estimate**) will be subject to `sampling variability`

and will differ from that statistics computed on the entire population (**parameter**)

If we took another sample of 100 students, our parameter estimate would be different.

The `sampling distribution`

is the probability distribution of values our parameter estimate can take on. Constructed by taking a random sample, computing stat of interest, and repeating many times.

The `spread`

of the sampling distribution indicates how the parameter estimate will vary from different random samples. We can quantify the spread (express our uncertainty on our parameter estimate) in two ways

`standard error`

One way is to compute the standard deviation of the sampling distribution: the `standard error`

`confidence interval`

Another way is to construct a `confidence interval`

- We don’t have access to the entire population
- We can (usually) only do our experiment once

To construct the sampling distribution

Instead of assuming a parametric probability distributon, we use the data themselves to approximate the underlying distribution: we `sample our sample`

!

`infer`

`infer`

is part of `tidymodels`

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.

Generate the sampling distribution with `specify()`

, `generate()`

, and `calculate()`

```
Response: volume (numeric)
# A tibble: 1,000 × 2
replicate stat
<int> <dbl>
1 1 1238.
2 2 1262.
3 3 1259.
4 4 1232.
5 5 1226.
6 6 1294.
7 7 1259.
8 8 1246.
9 9 1226.
10 10 1235.
# ℹ 990 more rows
```

Visualize the bootstrap distribution you generated with `visualize()`

`se`

Quantify the spread of the sampling distributon with `get_confidence_interval()`

, using **standard error**

```
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 1210. 1274.
```

`ci`

Quantify the spread of the sampling distributon with `get_confidence_interval`

, using a **confidence interval**

```
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 1211. 1273.
```