# Model specification

Data Science for Studying Language and the Mind

Katie Schuler

2023-10-03

## You are here

##### Data science with R
• Hello, world!
• R basics
• Data importing
• Data visualization
• Data wrangling
##### Stats & Model buidling
• Sampling distribution
• Hypothesis testing
• Model specification
• Model fitting
• Model accuracy
• Model reliability
• Classification
• Feature engineering (preprocessing)
• Inference for regression
• Mixed-effect models

## We’ve been modeling!

• The black dots are our data
• y is the response variable
• x is the explanatory variable

## We’ve been modeling!

The red line is our model (a linear model)

## We’ve been modeling!

• Model specification: $y=ax+b$
• Fit the model: estimate free parameters $a$ and $b$
• Fitted model: $y=0.39x+2.02$

## Model building overview

• Model specification (this week): what is the form?
• Model fitting (this week); you have the form, how do you guess the free parameters?
• Model accuracy (after break): you’ve estimated the parameters, how well does that model describe your data?
• Model reliability (after break): when you estimate the parameters, there is some uncertainty on them

## Model building overview refresh

• Model specification (this week): what is the form?
• Model fitting (this week); you have the form, how do you guess the free parameters?
• Model accuracy (after break): you’ve estimated the parameters, how well does that model describe your data?
• Model reliability (after break): when you estimate the parameters, there is some uncertainty on them

## Today’s dataset: swim records

How have world swim records in the 100m changed over the years?

library(mosaic)
glimpse(SwimRecords)
Rows: 62
Columns: 3
$year <int> 1905, 1908, 1910, 1912, 1918, 1920, 1922, 1924, 1934, 1935, 1936,…$ time <dbl> 65.80, 65.60, 62.80, 61.60, 61.40, 60.40, 58.60, 57.40, 56.80, 56…
\$ sex  <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M,…

## Response variable $y$

What is the thing you are trying to understand?

## Explanatory variable(s) $x_n$

What could explain the variation in your response variable?

## Functional form

• Model specification: $y=ax+b$
• Fit the model: estimate free parameters $a$ and $b$
• Fitted model: $y=0.39x+2.02$

## Functional form

• high school algebra: $y=ax+b$
• machine learning: $y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n$
• statistics: $y = β_0 + β_1x_1 + β_2x_2 + ... + β_nx_n + ε$
• matrix: $y = Xβ + ε$

## Model terms

Model terms describe how to include our explanatory variables in our model formula — there is more than one way!

1. Intercept
2. Main
3. Interaction
4. Transformation

## Intercept

• in R: y ~ 1
• in eq: $y=\beta_0 + \varepsilon$

## Main

• in R: y ~ 1 + year + gender
• in eq: $y = \beta_0 + \beta_1x_1 + \beta_2x_2$

## Interaction

• in R: y ~ 1 + year + gender + year:gender
• or the short way: y ~ 1 + year * gender
• in eq: $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2$