Model specification

Data Science for Studying Language and the Mind

Katie Schuler

2023-10-03

You are here

Data science with R
  • Hello, world!
  • R basics
  • Data importing
  • Data visualization
  • Data wrangling
Stats & Model buidling
  • Sampling distribution
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Feature engineering (preprocessing)
  • Inference for regression
  • Mixed-effect models

We’ve been modeling!

  • The black dots are our data
  • y is the response variable
  • x is the explanatory variable

We’ve been modeling!

The red line is our model (a linear model)

We’ve been modeling!

  • Model specification: \(y=ax+b\)
  • Fit the model: estimate free parameters \(a\) and \(b\)
  • Fitted model: \(y=0.39x+2.02\)

Model building overview

  • Model specification (this week): what is the form?
  • Model fitting (this week); you have the form, how do you guess the free parameters?
  • Model accuracy (after break): you’ve estimated the parameters, how well does that model describe your data?
  • Model reliability (after break): when you estimate the parameters, there is some uncertainty on them

Types of models

Supervised learning

Regression v classification

Regression v classification

Linear v nonlinear

Model building overview refresh

  • Model specification (this week): what is the form?
  • Model fitting (this week); you have the form, how do you guess the free parameters?
  • Model accuracy (after break): you’ve estimated the parameters, how well does that model describe your data?
  • Model reliability (after break): when you estimate the parameters, there is some uncertainty on them

Model specification

Today’s dataset: swim records

How have world swim records in the 100m changed over the years?

library(mosaic)
glimpse(SwimRecords)
Rows: 62
Columns: 3
$ year <int> 1905, 1908, 1910, 1912, 1918, 1920, 1922, 1924, 1934, 1935, 1936,…
$ time <dbl> 65.80, 65.60, 62.80, 61.60, 61.40, 60.40, 58.60, 57.40, 56.80, 56…
$ sex  <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M,…

Response variable \(y\)

What is the thing you are trying to understand?

Explanatory variable(s) \(x_n\)

What could explain the variation in your response variable?

Functional form

  • Model specification: \(y=ax+b\)
  • Fit the model: estimate free parameters \(a\) and \(b\)
  • Fitted model: \(y=0.39x+2.02\)

Functional form

  • high school algebra: \(y=ax+b\)
  • machine learning: \(y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n\)
  • statistics: \(y = β_0 + β_1x_1 + β_2x_2 + ... + β_nx_n + ε\)
  • matrix: \(y = Xβ + ε\)

Model terms

Model terms describe how to include our explanatory variables in our model formula — there is more than one way!

  1. Intercept
  2. Main
  3. Interaction
  4. Transformation

Intercept

  • in R: y ~ 1
  • in eq: \(y=\beta_0 + \varepsilon\)

Main

  • in R: y ~ 1 + year + gender
  • in eq: \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\)

Interaction

  • in R: y ~ 1 + year + gender + year:gender
    • or the short way: y ~ 1 + year * gender
  • in eq: \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2\)

Transformation