Week 07: Model Specification

Data Science for Studying Language and the Mind

Katie Schuler

2024-10-08

Tuesday

Announcements

You did great on the exam!
You can replace your lowest exam score with the optional final
The final exam is cumulative: another opportunity to show mastery of the material.

Thanks for your feedback

Adding

Demos more accessible

Posted before class
Make font bigger
Not so fast, please 😅

In-class exercises (not graded)

Slightly more interactive

Challenge questions

On labs or homework (optional)

Not adding

Projects instead of exams
R Studio instead of Google Colab

You are `here`

Data science with R

R basics
Data visualization
Data wrangling

Stats & Model building

Sampling distribution
Hypothesis testing
Model specification
Model fitting
Model accuracy
Model reliability

More advanced

Classification
Inference for regression
Mixed-effect models

Review

Sampling distribution and hypothesis testing with Correlation!

Exploring relationships

To review what we learned before break, let’s explore the relationship between Frequency and meanFamiliarity in the ratings dataset of the languageR package.

Is there a relationship?

If there was no relationship, we’d say there are independent: knowing the value of one provides no information about about the other. But that’s not the case here.

Yes, a linear relationship

In a linear relationship, when one variable goes up the other goes up (positive); or when one goes up the other goes down (negative).

Quantify with correlation

One way to quantify linear relationships is with correlation (\(r\)). Correlation expresses the linear relationship as a range from -1 (perfectly negative) to 1 (perfectly positive).

Computing correlation in R

We can compute a correlation with R’s built in cor(x,y) function

cor(
  x = ratings$Frequency, 
  y = ratings$meanFamiliarity
)

[1] 0.4820286

Or via the infer pacakge.

(obs_corr <- ratings %>%
  specify(
    response = meanFamiliarity, 
    explanatory = Frequency
  ) %>%
  calculate(stat = "correlation"))

Response: meanFamiliarity (numeric)
Explanatory: Frequency (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.482

Correlation uncertainty

Just like the mean — and all other test statistics! — \(r\) is subject to sampling variability. We can indicate our uncertainty around the correlation the same way we always have:

Construct the sampling distribution for the correlation:

sampling_distribution <- ratings %>%
  specify(
    response = meanFamiliarity, 
    explanatory = Frequency
  ) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "correlation")

head(sampling_distribution)

Response: meanFamiliarity (numeric)
Explanatory: Frequency (numeric)
# A tibble: 6 × 2
  replicate  stat
      <int> <dbl>
1         1 0.444
2         2 0.595
3         3 0.533
4         4 0.565
5         5 0.573
6         6 0.579

Compute a confidence interval

ci <- sampling_distribution %>% 
  get_ci(level = 0.95, type = "percentile")

ci

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.305    0.635

💪 In-class Exercise 7.1

Take a few minutes to try this yourself!

Use the infer way to visualize the sampling distribution and shade the confidence interval we just computed. Change the x-axis label to stat (correlation) as pictured below.

Hypothesis testing our correlation

How do we test whether the correlation we observed is significantly different from zero? Hypothesis test!

Step 1: Construct the null distribution, the sampling distribution of the null hypothesis

null_distribution_corr <- ratings %>% 
  specify(
    response = meanFamiliarity, 
    explanatory = Frequency
  ) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "correlation") 

null_distribution_corr %>% head

Response: meanFamiliarity (numeric)
Explanatory: Frequency (numeric)
Null Hypothesis: independence
# A tibble: 6 × 2
  replicate    stat
      <int>   <dbl>
1         1 -0.189 
2         2  0.0233
3         3 -0.0523
4         4 -0.0187
5         5 -0.0674
6         6 -0.0242

Step 2: How likley is our observed value under the null? Get a p-value.

p <- null_distribution_corr %>%
  get_p_value(
    obs_stat = obs_corr, 
    direction = "both")
p

# A tibble: 1 × 1
  p_value
    <dbl>
1       0

Hypothesis testing our correlation

How do we test whether the correlation we observed is significantly different from zero? Hypothesis test!

Step 3: Decide whether to reject the null!

null_distribution_corr %>% 
  visualize() + 
  shade_p_value(
    obs_stat = obs_corr, 
    direction = "two-sided"
  )

Interpret our p-value. Should we reject the null hypothesis?

Model building

Big picture overview of the model building process and the types of models we might encounter in our research.

Correlation is model building

Correlation is a simple case of model building, in which we use one value (\(x\)) to predict another (\(y\)).

Correlation is model building

Even more specifically — formally, the model specification — we are fitting the linear model \(y = ax+b\), where \(a\) and \(b\) are free parameters.

Model specification: \(y = ax + b\)
Estimate free parameters: \(a\) and \(b\)
Fitted model: \(y = 0.39x + 2.02\)

How do we get \(r = 0.48\) ?

The link between correlation and linear models is understood when we normalize our variables with a z-score.

z_ratings <- ratings %>%
  select(Frequency, meanFamiliarity) %>%
  mutate(
    z_Freq = scale(Frequency), 
    z_meanFamil = scale(meanFamiliarity)
  )

z_ratings %>% head

  Frequency meanFamiliarity     z_Freq z_meanFamil
1  4.204693            3.72 -0.4387602  -0.1573220
2  5.347108            3.60  0.4619516  -0.2742310
3  6.304449            5.84  1.2167459   1.9080703
4  3.828641            4.40 -0.7352500   0.5051623
5  3.663562            3.68 -0.8654029  -0.1962917
6  3.433987            4.12 -1.0464062   0.2323747

A z-score gets the number of standard deviations a data point is from the mean.

Correlation is the slope of the model

Correlation is the slope of the line that best predicts \(y\) from \(x\) (after z-scoring)

Model building overview

Model specification (this week): specify the functional form of the model.
Model fitting: you have the form, how do you estimate the free parameters?
Model accuracy: you’ve estimated the parameters, how well does that model describe your data?
Model reliability: when you estimate the parameters, you want to quantify your uncertainty on your estimates

Types of models

💪 In-class Exercise 7.2

Take a few minutes to try this yourself!

Ask ChatGPT what type of model it is made with?

Supervised learning

Types of models

Regression v classification

Types of models

Linear models

Linear models are models in which the output (y) is a weighted sum of the inputs
Easy to understand and fit
\(y=\sum_{i=1}^{n}w_ix_i\)
\(y = ax + b\) is this!

Linear model equation

\(y = ax + b\) can be expressed \(y=\sum_{i=1}^{n}w_ix_i\)

implicit constant: \(y=ax+b\mathbf{1}\)
let \(x_1=x\) and \(x_2=\mathbf{1}\)
we have \(y=ax_1 + bx_2\)
express \(a\) and \(b\) as weights: \(a=w_1\) and \(b=w_2\)
\(y=w_1x_1 + w_2x_2\) where \(w_1\) and \(w_2\) are free parameters

Types of models

Nonlinear models

Output (y) cannot be expressed as a weighted sum of inputs(\(y=\sum_{i=1}^{n}w_ix_i\) ); pattern is better captured by more complex functions. (But often we can linearize them!)

💪 In-class Exercise 7.3

Take a few minutes to try this yourself!

Load the following data, which shows brain size and body weight for several different animals:

animal_brain_body_size.csv

Explore the data to specify the type of model we should use to predict brain size by body weight.

Supervised or unsupervised?
Regression or classification?
Linear or nonlinear?

Thursday

Model specification

Recall that model specification is one aspect of the model building process in which we select the form of the model (the type of model)

Response variable (\(y\)): Specify the variable you want to predict/explain (output).
Explanatory variables(\(x_i\)): Specify the variables that may explain the variation in the response (inputs).
Functional form: Specify the relationship between the response and explanatory variables. For linear models, we use the linear model equation!
Model terms: Specify how to include your explanatory variables in the model (since they can be included in more than one way).

Model specification

The following issues can also be considered part of the model specification process.

Model assumptions: Check any assumptions underlying the model you selected (e.g. does the model assume the relationship is linear?).
Model complexity: Simple models are easier to interpret but may not capture all complexities in the data. Complex models may suffer from overfitting the data or being difficult to interpret.

A well-specified model should be based on a clear understanding of the data, the underlying relationships, and the research question.

Specifying the functional form

Literally specifying the mathematical formula we’re going to use to represent the relationship between our response and explanatory variables.
We already know it: linear models are models in whch the response variable (\(y\)) is a weighted sum of the explanatory variables (\(x_i\))
\(y=\sum_{i=1}^{n}w_ix_i\)

🥸 Aliases of \(y=\sum_{i=1}^{n}w_ix_i\)

The linear model equation can be expressed in many ways, but they are all this same thing

in high school algebra: \(y=ax+b\).
in machine learning: \(y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n\)
in statistics: \(y = β_0 + β_1x_1 + β_2x_2 + ... + β_nx_n + ε\)
in matrix notation: \(y = Xβ + ε\)

Specify our first model

To illustrate how this simple equation scales up to complex models, let’s start with a simple case (“toy”, “tractable”).

# A tibble: 2 × 2
      x     y
  <dbl> <dbl>
1     1     3
2     2     5

💪 In class exercise 7.4

Specify our model!

System of equations

In our simple dataset, we can appreciate that we have a system of equations. We have two unknowns (free parameters) and two datapoints

so we have 2 equations, 2 unknowns.
- c(1, 3) -> \(w_11 + w_21 = 3\)
- c(2, 5) -> \(w_11 + w_22 = 5\)
which have a solution:
- \(w_1 = 1\) and \(w_2 = 2\)

🪄 Fit our model to our data

We’ll learn what is going on under the hood of model fitting next week, but for now, we can appreciate that we are solving a system of equations:

with lm():

lm(y ~ x, data = toy)


Call:
lm(formula = y ~ x, data = toy)

Coefficients:
(Intercept)            x  
          1            2

with infer:

toy %>%
  specify(response = y, explanatory = x) %>%
  fit()

# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept     1.00
2 x             2

Many equations, many unknowns

When we have multiple data points, we are essentially solving for the best line (or hyperplane, in higher dimensions) that fits the data.

For each data point, we create an equation based on the linear model
Which leads to a system of equations.
With 2 unknowns and 2 data points, we have 2 equations.

Enter the matrix \(y = Xβ + ε\)

When we have more equations than unknowns we cannot solve the system directly (we have an overdetermined system), but we can find a soulution with linear algebra.

\[\begin{aligned} \begin{bmatrix} 3 \\ 5 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} w_1 \\ w_0 \end{bmatrix} \end{aligned}\]

We can have super complex models

Matrix way allows us to appreciate that we can expand this toy example to an number of data points and any number of unknowns.

💪 In class exercise 7.5

Ask chatGPT how many parameters it has.

Swim Records

Applied to a more complex problem

Specify a model for `SwimRecords`

How have world swim records in the 100m changed over time?

library(mosaic)
glimpse(SwimRecords)

Rows: 62
Columns: 3
$ year <int> 1905, 1908, 1910, 1912, 1918, 1920, 1922, 1924, 1934, 1935, 1936,…
$ time <dbl> 65.80, 65.60, 62.80, 61.60, 61.40, 60.40, 58.60, 57.40, 56.80, 56…
$ sex  <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M,…

💪 In class exercise 7.6

Plot the swim records data, then use your model specification worksheet to specify the model.

Response variable \(y\)

What is the thing you are trying to understand?

Explanatory variable(s) \(x_i\)

What could explain the variation in your response variable?

year
year + sex

Functional form

Linear model
\(y=\sum_{i=1}^{n}w_ix_i\)

Model terms

Model terms describe how to include our explanatory variables in our model formula — there is more than one way!

Intercept
Main
Interaction
Transformation

Intercept

in R: y ~ 1, in eq: \(y=w_1x_1\)

in R: y ~ 1 + year, in eq: \(y = w_1x_1 + w_2x_2\)

in R: y ~ 1 + sex, in eq: \(y = w_1x_1 + w_2x_2\)

in R: y ~ 1 + year, in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3\)

Interaction

in R: y ~ 1 + year + gender + year:gender
- or the short way: y ~ 1 + year * gender
in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4\) where \(x_4\) is \(x_2x_3\)

Transformation

in R: y ~ 1 + year * sex + I(year^2)
in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4\) + \(w_5x_5\)
- where \(x_4\) is \(x_2x_3\) and \(x_5\) is \(x_2^2\)

Week 07: Model Specification

Tuesday

Announcements

Thanks for your feedback

Adding

Not adding

You are here

Data science with R

Stats & Model building

More advanced

Review

Exploring relationships

Is there a relationship?

Yes, a linear relationship

Quantify with correlation

Computing correlation in R

Correlation uncertainty

💪 In-class Exercise 7.1

Hypothesis testing our correlation

Hypothesis testing our correlation

Model building

Correlation is model building

Correlation is model building

How do we get \(r = 0.48\) ?

Correlation is the slope of the model

Model building overview

Types of models

💪 In-class Exercise 7.2

Supervised learning

Supervised learning

Supervised learning

Types of models

Regression v classification

Types of models

Linear models

Linear model equation

Types of models

Nonlinear models

💪 In-class Exercise 7.3

Thursday

Model specification

Model specification

Specifying the functional form

🥸 Aliases of \(y=\sum_{i=1}^{n}w_ix_i\)

Specify our first model

💪 In class exercise 7.4

System of equations

🪄 Fit our model to our data

Many equations, many unknowns

Enter the matrix \(y = Xβ + ε\)

We can have super complex models

💪 In class exercise 7.5

Swim Records

Specify a model for SwimRecords

💪 In class exercise 7.6

Response variable \(y\)

Explanatory variable(s) \(x_i\)

Functional form

Model terms

Intercept

Main

Interaction

Transformation

You are `here`

Specify a model for `SwimRecords`