Week 07: Model Specification

Data Science for Studying Language and the Mind

Katie Schuler

2024-10-08

Tuesday

Announcements

  • You did great on the exam!
  • You can replace your lowest exam score with the optional final
  • The final exam is cumulative: another opportunity to show mastery of the material.

Thanks for your feedback

Adding

  1. Demos more accessible
  • Posted before class
  • Make font bigger
  • Not so fast, please 😅
  1. In-class exercises (not graded)
  • Slightly more interactive
  1. Challenge questions
  • On labs or homework (optional)

Not adding

  1. Projects instead of exams
  2. R Studio instead of Google Colab

You are here

Data science with R
  • R basics
  • Data visualization
  • Data wrangling
Stats & Model building
  • Sampling distribution
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Inference for regression
  • Mixed-effect models

Review

Sampling distribution and hypothesis testing with Correlation!

Exploring relationships

To review what we learned before break, let’s explore the relationship between Frequency and meanFamiliarity in the ratings dataset of the languageR package.

Is there a relationship?

If there was no relationship, we’d say there are independent: knowing the value of one provides no information about about the other. But that’s not the case here.

Yes, a linear relationship

In a linear relationship, when one variable goes up the other goes up (positive); or when one goes up the other goes down (negative).

Quantify with correlation

One way to quantify linear relationships is with correlation (\(r\)). Correlation expresses the linear relationship as a range from -1 (perfectly negative) to 1 (perfectly positive).

Computing correlation in R

We can compute a correlation with R’s built in cor(x,y) function

cor(
  x = ratings$Frequency, 
  y = ratings$meanFamiliarity
)
[1] 0.4820286

Or via the infer pacakge.

(obs_corr <- ratings %>%
  specify(
    response = meanFamiliarity, 
    explanatory = Frequency
  ) %>%
  calculate(stat = "correlation"))
Response: meanFamiliarity (numeric)
Explanatory: Frequency (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1 0.482

Correlation uncertainty

Just like the mean — and all other test statistics! — \(r\) is subject to sampling variability. We can indicate our uncertainty around the correlation the same way we always have:

Construct the sampling distribution for the correlation:

sampling_distribution <- ratings %>%
  specify(
    response = meanFamiliarity, 
    explanatory = Frequency
  ) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "correlation")

head(sampling_distribution)
Response: meanFamiliarity (numeric)
Explanatory: Frequency (numeric)
# A tibble: 6 × 2
  replicate  stat
      <int> <dbl>
1         1 0.444
2         2 0.595
3         3 0.533
4         4 0.565
5         5 0.573
6         6 0.579

Compute a confidence interval

ci <- sampling_distribution %>% 
  get_ci(level = 0.95, type = "percentile")

ci
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.305    0.635

💪 In-class Exercise 7.1

Take a few minutes to try this yourself!

Use the infer way to visualize the sampling distribution and shade the confidence interval we just computed. Change the x-axis label to stat (correlation) as pictured below.

Hypothesis testing our correlation

How do we test whether the correlation we observed is significantly different from zero? Hypothesis test!

Step 1: Construct the null distribution, the sampling distribution of the null hypothesis

null_distribution_corr <- ratings %>% 
  specify(
    response = meanFamiliarity, 
    explanatory = Frequency
  ) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "correlation") 

null_distribution_corr %>% head
Response: meanFamiliarity (numeric)
Explanatory: Frequency (numeric)
Null Hypothesis: independence
# A tibble: 6 × 2
  replicate    stat
      <int>   <dbl>
1         1 -0.189 
2         2  0.0233
3         3 -0.0523
4         4 -0.0187
5         5 -0.0674
6         6 -0.0242

Step 2: How likley is our observed value under the null? Get a p-value.

p <- null_distribution_corr %>%
  get_p_value(
    obs_stat = obs_corr, 
    direction = "both")
p
# A tibble: 1 × 1
  p_value
    <dbl>
1       0

Hypothesis testing our correlation

How do we test whether the correlation we observed is significantly different from zero? Hypothesis test!

Step 3: Decide whether to reject the null!

null_distribution_corr %>% 
  visualize() + 
  shade_p_value(
    obs_stat = obs_corr, 
    direction = "two-sided"
  ) 

Interpret our p-value. Should we reject the null hypothesis?

Model building

Big picture overview of the model building process and the types of models we might encounter in our research.

Correlation is model building

Correlation is a simple case of model building, in which we use one value (\(x\)) to predict another (\(y\)).

Correlation is model building

Even more specifically — formally, the model specification — we are fitting the linear model \(y = ax+b\), where \(a\) and \(b\) are free parameters.

  • Model specification: \(y = ax + b\)
  • Estimate free parameters: \(a\) and \(b\)
  • Fitted model: \(y = 0.39x + 2.02\)

How do we get \(r = 0.48\) ?

The link between correlation and linear models is understood when we normalize our variables with a z-score.

z_ratings <- ratings %>%
  select(Frequency, meanFamiliarity) %>%
  mutate(
    z_Freq = scale(Frequency), 
    z_meanFamil = scale(meanFamiliarity)
  )

z_ratings %>% head
  Frequency meanFamiliarity     z_Freq z_meanFamil
1  4.204693            3.72 -0.4387602  -0.1573220
2  5.347108            3.60  0.4619516  -0.2742310
3  6.304449            5.84  1.2167459   1.9080703
4  3.828641            4.40 -0.7352500   0.5051623
5  3.663562            3.68 -0.8654029  -0.1962917
6  3.433987            4.12 -1.0464062   0.2323747
  • A z-score gets the number of standard deviations a data point is from the mean.

Correlation is the slope of the model

Correlation is the slope of the line that best predicts \(y\) from \(x\) (after z-scoring)

Model building overview

  • Model specification (this week): specify the functional form of the model.
  • Model fitting: you have the form, how do you estimate the free parameters?
  • Model accuracy: you’ve estimated the parameters, how well does that model describe your data?
  • Model reliability: when you estimate the parameters, you want to quantify your uncertainty on your estimates

Types of models

G models models supervised learning supervised learning models->supervised learning unsupervised learning unsupervised learning models->unsupervised learning

💪 In-class Exercise 7.2

Take a few minutes to try this yourself!

Ask ChatGPT what type of model it is made with?

Supervised learning

G x1 x1 model Model x2 x2 x3 x3 y y x4 x4 x5 x5

Supervised learning

G x1 x1 model Model x1->model x2 x2 x2->model x3 x3 x3->model y y x4 x4 x4->model x5 x5 x5->model

Supervised learning

G x1 x1 model Model x1->model x2 x2 x2->model x3 x3 x3->model y y model->y x4 x4 x4->model x5 x5 x5->model

Types of models

G models models supervised learning supervised learning models->supervised learning unsupervised learning unsupervised learning models->unsupervised learning regression regression supervised learning->regression classification classification supervised learning->classification

Regression v classification

G x1 x1 model Model x1->model x2 x2 x2->model x3 x3 x3->model y y model->y r Regression (y is continuous) 1 3 4 2 3 4 y->r c Classification (y is discrete) yes/no, male/female/nonbinary y->c x4 x4 x4->model x5 x5 x5->model

Types of models

G models models supervised learning supervised learning models->supervised learning unsupervised learning unsupervised learning models->unsupervised learning regression regression supervised learning->regression classification classification supervised learning->classification linear models linear models regression->linear models nonlinear models nonlinear models regression->nonlinear models

Linear models

  • Linear models are models in which the output (y) is a weighted sum of the inputs
  • Easy to understand and fit
  • \(y=\sum_{i=1}^{n}w_ix_i\)
  • \(y = ax + b\) is this!

Linear model equation

\(y = ax + b\) can be expressed \(y=\sum_{i=1}^{n}w_ix_i\)

  • implicit constant: \(y=ax+b\mathbf{1}\)
  • let \(x_1=x\) and \(x_2=\mathbf{1}\)
  • we have \(y=ax_1 + bx_2\)
  • express \(a\) and \(b\) as weights: \(a=w_1\) and \(b=w_2\)
  • \(y=w_1x_1 + w_2x_2\) where \(w_1\) and \(w_2\) are free parameters

Types of models

G models models supervised learning supervised learning models->supervised learning unsupervised learning unsupervised learning models->unsupervised learning regression regression supervised learning->regression classification classification supervised learning->classification linear models linear models regression->linear models nonlinear models nonlinear models regression->nonlinear models

Nonlinear models

Output (y) cannot be expressed as a weighted sum of inputs(\(y=\sum_{i=1}^{n}w_ix_i\) ); pattern is better captured by more complex functions. (But often we can linearize them!)

💪 In-class Exercise 7.3

Take a few minutes to try this yourself!

Load the following data, which shows brain size and body weight for several different animals:

Explore the data to specify the type of model we should use to predict brain size by body weight.

  • Supervised or unsupervised?
  • Regression or classification?
  • Linear or nonlinear?

Thursday

Model specification

Recall that model specification is one aspect of the model building process in which we select the form of the model (the type of model)

  1. Response variable (\(y\)): Specify the variable you want to predict/explain (output).
  2. Explanatory variables(\(x_i\)): Specify the variables that may explain the variation in the response (inputs).
  3. Functional form: Specify the relationship between the response and explanatory variables. For linear models, we use the linear model equation!
  4. Model terms: Specify how to include your explanatory variables in the model (since they can be included in more than one way).

Model specification

The following issues can also be considered part of the model specification process.

  • Model assumptions: Check any assumptions underlying the model you selected (e.g. does the model assume the relationship is linear?).
  • Model complexity: Simple models are easier to interpret but may not capture all complexities in the data. Complex models may suffer from overfitting the data or being difficult to interpret.

A well-specified model should be based on a clear understanding of the data, the underlying relationships, and the research question.

Specifying the functional form

  • Literally specifying the mathematical formula we’re going to use to represent the relationship between our response and explanatory variables.
  • We already know it: linear models are models in whch the response variable (\(y\)) is a weighted sum of the explanatory variables (\(x_i\))
  • \(y=\sum_{i=1}^{n}w_ix_i\)

🥸 Aliases of \(y=\sum_{i=1}^{n}w_ix_i\)

The linear model equation can be expressed in many ways, but they are all this same thing

  1. in high school algebra: \(y=ax+b\).
  2. in machine learning: \(y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n\)
  3. in statistics: \(y = β_0 + β_1x_1 + β_2x_2 + ... + β_nx_n + ε\)
  4. in matrix notation: \(y = Xβ + ε\)

Specify our first model

To illustrate how this simple equation scales up to complex models, let’s start with a simple case (“toy”, “tractable”).

# A tibble: 2 × 2
      x     y
  <dbl> <dbl>
1     1     3
2     2     5

💪 In class exercise 7.4

Specify our model!

G x1 x1 model Linear model x1->model x2 x2 x2->model x3 x3 x3->model y y model->y x4 x4 x4->model x5 x5 x5->model

System of equations

In our simple dataset, we can appreciate that we have a system of equations. We have two unknowns (free parameters) and two datapoints

  • so we have 2 equations, 2 unknowns.
    • c(1, 3) -> \(w_11 + w_21 = 3\)
    • c(2, 5) -> \(w_11 + w_22 = 5\)
  • which have a solution:
    • \(w_1 = 1\) and \(w_2 = 2\)

🪄 Fit our model to our data

We’ll learn what is going on under the hood of model fitting next week, but for now, we can appreciate that we are solving a system of equations:

with lm():

lm(y ~ x, data = toy)

Call:
lm(formula = y ~ x, data = toy)

Coefficients:
(Intercept)            x  
          1            2  

with infer:

toy %>%
  specify(response = y, explanatory = x) %>%
  fit()
# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept     1.00
2 x             2   

Many equations, many unknowns

When we have multiple data points, we are essentially solving for the best line (or hyperplane, in higher dimensions) that fits the data.

  • For each data point, we create an equation based on the linear model
  • Which leads to a system of equations.
  • With 2 unknowns and 2 data points, we have 2 equations.

Enter the matrix \(y = Xβ + ε\)

When we have more equations than unknowns we cannot solve the system directly (we have an overdetermined system), but we can find a soulution with linear algebra.

\[\begin{aligned} \begin{bmatrix} 3 \\ 5 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} w_1 \\ w_0 \end{bmatrix} \end{aligned}\]

We can have super complex models

  • Matrix way allows us to appreciate that we can expand this toy example to an number of data points and any number of unknowns.

💪 In class exercise 7.5

Ask chatGPT how many parameters it has.

Swim Records

Applied to a more complex problem

Specify a model for SwimRecords

How have world swim records in the 100m changed over time?

library(mosaic)
glimpse(SwimRecords)
Rows: 62
Columns: 3
$ year <int> 1905, 1908, 1910, 1912, 1918, 1920, 1922, 1924, 1934, 1935, 1936,…
$ time <dbl> 65.80, 65.60, 62.80, 61.60, 61.40, 60.40, 58.60, 57.40, 56.80, 56…
$ sex  <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M,…

💪 In class exercise 7.6

Plot the swim records data, then use your model specification worksheet to specify the model.

Response variable \(y\)

What is the thing you are trying to understand?

Explanatory variable(s) \(x_i\)

What could explain the variation in your response variable?

Functional form

  • Linear model
  • \(y=\sum_{i=1}^{n}w_ix_i\)

Model terms

Model terms describe how to include our explanatory variables in our model formula — there is more than one way!

  1. Intercept
  2. Main
  3. Interaction
  4. Transformation

Intercept

  • in R: y ~ 1, in eq: \(y=w_1x_1\)

Main

  • in R: y ~ 1 + year, in eq: \(y = w_1x_1 + w_2x_2\)

  • in R: y ~ 1 + sex, in eq: \(y = w_1x_1 + w_2x_2\)

  • in R: y ~ 1 + year, in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3\)

Interaction

  • in R: y ~ 1 + year + gender + year:gender
    • or the short way: y ~ 1 + year * gender
  • in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4\) where \(x_4\) is \(x_2x_3\)

Transformation

  • in R: y ~ 1 + year * sex + I(year^2)
  • in eq: \(y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4\) + \(w_5x_5\)
    • where \(x_4\) is \(x_2x_3\) and \(x_5\) is \(x_2^2\)