## Main

- in R:
`y ~ 1 + year`

, in eq: \(y = w_1x_1 + w_2x_2\)

- in R:
`y ~ 1 + sex`

, in eq: \(y = w_1x_1 + w_2x_2\)

- in R:
`y ~ 1 + year`

, in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3\)

Data Science for Studying Language and the Mind

Katie Schuler

2024-10-08

- You did great on the exam!
- You can replace your lowest exam score with the optional final
- The final exam is cumulative: another opportunity to show mastery of the material.

**Demos more accessible**

- Posted before class
- Make font bigger
- Not so fast, please 😅

**In-class exercises**(not graded)

- Slightly more interactive

**Challenge questions**

- On labs or homework (optional)

**Projects**instead of exams**R Studio**instead of Google Colab

`here`

- R basics
- Data visualization
- Data wrangling

- Sampling distribution
- Hypothesis testing
`Model specification`

- Model fitting
- Model accuracy
- Model reliability

- Classification
- Inference for regression
- Mixed-effect models

Sampling distribution and hypothesis testing with Correlation!

To review what we learned before break, let’s explore the relationship between Frequency and meanFamiliarity in the `ratings`

dataset of the `languageR`

package.

If there was no relationship, we’d say there are **independent**: knowing the value of one provides no information about about the other. But that’s not the case here.

In a linear relationship, when one variable goes up the other goes up (positive); or when one goes up the other goes down (negative).

One way to quantify linear relationships is with **correlation** (\(r\)). Correlation expresses the linear relationship as a range from -1 (perfectly negative) to 1 (perfectly positive).

We can compute a correlation with R’s built in `cor(x,y)`

function

Or via the `infer`

pacakge.

Just like the mean — and all other test statistics! — \(r\) is subject to sampling variability. We can indicate our uncertainty around the correlation the same way we always have:

Construct the sampling distribution for the correlation:

```
Response: meanFamiliarity (numeric)
Explanatory: Frequency (numeric)
# A tibble: 6 × 2
replicate stat
<int> <dbl>
1 1 0.444
2 2 0.595
3 3 0.533
4 4 0.565
5 5 0.573
6 6 0.579
```

Compute a confidence interval

*Take a few minutes to try this yourself!*

Use the `infer`

way to visualize the sampling distribution and shade the confidence interval we just computed. Change the x-axis label to **stat (correlation)** as pictured below.

How do we test whether the correlation we observed is significantly different from zero? Hypothesis test!

Step 1: Construct the null distribution, the sampling distribution of the null hypothesis

```
Response: meanFamiliarity (numeric)
Explanatory: Frequency (numeric)
Null Hypothesis: independence
# A tibble: 6 × 2
replicate stat
<int> <dbl>
1 1 -0.189
2 2 0.0233
3 3 -0.0523
4 4 -0.0187
5 5 -0.0674
6 6 -0.0242
```

Step 2: How likley is our observed value under the null? Get a p-value.

How do we test whether the correlation we observed is significantly different from zero? Hypothesis test!

Step 3: Decide whether to reject the null!

Interpret our p-value. Should we reject the null hypothesis?

Big picture overview of the model building process and the types of models we might encounter in our research.

Correlation is a simple case of model building, in which we use one value (\(x\)) to predict another (\(y\)).

Even more specifically — formally, the **model specification** — we are fitting the linear model \(y = ax+b\), where \(a\) and \(b\) are free parameters.

- Model specification: \(y = ax + b\)
- Estimate free parameters: \(a\) and \(b\)
- Fitted model: \(y = 0.39x + 2.02\)

The link between correlation and linear models is understood when we normalize our variables with a z-score.

```
Frequency meanFamiliarity z_Freq z_meanFamil
1 4.204693 3.72 -0.4387602 -0.1573220
2 5.347108 3.60 0.4619516 -0.2742310
3 6.304449 5.84 1.2167459 1.9080703
4 3.828641 4.40 -0.7352500 0.5051623
5 3.663562 3.68 -0.8654029 -0.1962917
6 3.433987 4.12 -1.0464062 0.2323747
```

- A z-score gets the number of standard deviations a data point is from the mean.

Correlation is the slope of the line that best predicts \(y\) from \(x\) (after z-scoring)

`Model specification`

(this week): specify the functional form of the model.`Model fitting`

: you have the form, how do you estimate the free parameters?`Model accuracy`

: you’ve estimated the parameters, how well does that model describe your data?`Model reliability`

: when you estimate the parameters, you want to quantify your uncertainty on your estimates

*Take a few minutes to try this yourself!*

Ask ChatGPT what type of model it is made with?

**Linear models**are models in which the output (y) is a weighted sum of the inputs- Easy to understand and fit
- \(y=\sum_{i=1}^{n}w_ix_i\)
- \(y = ax + b\) is this!

\(y = ax + b\) *can be expressed* \(y=\sum_{i=1}^{n}w_ix_i\)

- implicit constant: \(y=ax+b\mathbf{1}\)

- let \(x_1=x\) and \(x_2=\mathbf{1}\)
- we have \(y=ax_1 + bx_2\)
- express \(a\) and \(b\) as weights: \(a=w_1\) and \(b=w_2\)
- \(y=w_1x_1 + w_2x_2\) where \(w_1\) and \(w_2\) are free parameters

Output (y) cannot be expressed as a weighted sum of inputs(\(y=\sum_{i=1}^{n}w_ix_i\) ); pattern is better captured by more complex functions. (But often we can linearize them!)

*Take a few minutes to try this yourself!*

Load the following data, which shows brain size and body weight for several different animals:

Explore the data to specify the type of model we should use to predict brain size by body weight.

- Supervised or unsupervised?
- Regression or classification?
- Linear or nonlinear?

Recall that `model specification`

is one aspect of the model building process in which we select the form of the model (the type of model)

**Response variable (\(y\)):**Specify the variable you want to predict/explain (output).**Explanatory variables(\(x_i\)):**Specify the variables that may explain the variation in the response (inputs).**Functional form:**Specify the relationship between the response and explanatory variables.*For linear models, we use the linear model equation!***Model terms:**Specify*how*to include your explanatory variables in the model (since they can be included in more than one way).

The following issues can also be considered part of the model specification process.

**Model assumptions:**Check any assumptions underlying the model you selected (e.g. does the model assume the relationship is linear?).**Model complexity:**Simple models are easier to interpret but may not capture all complexities in the data. Complex models may suffer from overfitting the data or being difficult to interpret.

*A well-specified model should be based on a clear understanding of the data, the underlying relationships, and the research question.*

- Literally specifying the mathematical formula we’re going to use to represent the relationship between our response and explanatory variables.
- We already know it:
**linear models**are models in whch the response variable (\(y\)) is a weighted sum of the explanatory variables (\(x_i\)) - \(y=\sum_{i=1}^{n}w_ix_i\)

The **linear model equation** can be expressed in many ways, but *they are all this same thing*

- in
**high school algebra**: \(y=ax+b\). - in
**machine learning**: \(y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n\) - in
**statistics**: \(y = β_0 + β_1x_1 + β_2x_2 + ... + β_nx_n + ε\) - in
**matrix**notation: \(y = Xβ + ε\)

To illustrate how this simple equation scales up to complex models, let’s start with a simple case (“toy”, “tractable”).

```
# A tibble: 2 × 2
x y
<dbl> <dbl>
1 1 3
2 2 5
```

Specify our model!

In our simple dataset, we can appreciate that we have a system of equations. We have **two unknowns** (free parameters) and **two datapoints**

- so we have 2 equations, 2 unknowns.
`c(1, 3) ->`

\(w_11 + w_21 = 3\)`c(2, 5) ->`

\(w_11 + w_22 = 5\)

- which have a solution:
- \(w_1 = 1\) and \(w_2 = 2\)

We’ll learn what is going on under the hood of model fitting next week, but for now, we can appreciate that we are solving a system of equations:

with `lm()`

:

When we have multiple data points, we are essentially solving for the best line (or hyperplane, in higher dimensions) that fits the data.

- For each data point, we create an equation based on the linear model
- Which leads to a system of equations.
- With 2 unknowns and 2 data points, we have 2 equations.

When we have **more equations than unknowns** we cannot solve the system directly (we have an **overdetermined system**), but we can find a soulution with linear algebra.

\[\begin{aligned}
\begin{bmatrix}
3 \\
5
\end{bmatrix}
=
\begin{bmatrix}
1 & 1 \\
1 & 2
\end{bmatrix}
\begin{bmatrix}
w_1 \\
w_0
\end{bmatrix}
\end{aligned}\]

- Matrix way allows us to appreciate that we can expand this toy example to an number of data points and any number of unknowns.

Ask chatGPT how many parameters it has.

Applied to a more complex problem

`SwimRecords`

How have world swim records in the 100m changed over time?

Plot the swim records data, then use your model specification worksheet to specify the model.

What is the thing you are trying to understand?

What could **explain** the variation in your response variable?

- Linear model
- \(y=\sum_{i=1}^{n}w_ix_i\)

Model terms describe *how* to include our explanatory variables in our model formula — there is more than one way!

- Intercept
- Main
- Interaction
- Transformation

- in R:
`y ~ 1`

, in eq: \(y=w_1x_1\)

- in R:
`y ~ 1 + year`

, in eq: \(y = w_1x_1 + w_2x_2\)

- in R:
`y ~ 1 + sex`

, in eq: \(y = w_1x_1 + w_2x_2\)

- in R:
`y ~ 1 + year`

, in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3\)

- in R:
`y ~ 1 + year + gender + year:gender`

- or the short way:
`y ~ 1 + year * gender`

- or the short way:
- in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4\) where \(x_4\) is \(x_2x_3\)

- in R:
`y ~ 1 + year * sex + I(year^2)`

- in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4\) + \(w_5x_5\)
- where \(x_4\) is \(x_2x_3\) and \(x_5\) is \(x_2^2\)