Model Fitting

Data Science for Studying Language and the Mind

Katie Schuler

2023-10-17

You are here

Data science with R
  • Hello, world!
  • R basics
  • Data importing
  • Data visualization
  • Data wrangling
Stats & Model buidling
  • Sampling distribution
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Feature engineering (preprocessing)
  • Inference for regression
  • Mixed-effect models

Model building overview

  • Model specification: what is the form?
  • Model fitting: you have the form, how do you guess the free parameters?
  • Model accuracy: you’ve estimated the parameters, how well does that model describe your data?
  • Model reliability: when you estimate the parameters, there is some uncertainty on them

Model specification

a brief review

Types of models

Specification

  1. Response, \(y\)
  2. Explanatory, \(x_n\)
  3. Functional form, \(y=\beta_0 + \beta_1x_1 + \epsilon\)
  4. Model terms
    • Intercept
    • Main
    • Interaction
    • Transformation

Linear model functional form

field linear model eq
h.s. algebra \(y=ax+b\)
machine learning \(y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n\)
statistics \(y = β_0 + β_1x_1 + β_2x_2 + ... + β_nx_n + ε\)
matrix \(y = Xβ + ε\)

Model fitting

Model fitting

flowchart TD
    spec(Model specification) --> fit(Estimate free parameters) 
    fit(Estimate free parameters) --> fitted(Fitted model) 

Fitting a linear model

flowchart TD
    spec(Model specification \n y = ax + b) --> fit(Estimate free parameters) 
    fit(Estimate free parameters) --> fitted(Fitted model \n y = 0.7x + 0.6) 
ggplot(data, aes(x = x, y = y)) +
    geom_point(size = 4, color = "darkred") +
    geom_smooth(method = "lm", formula = 'y ~ x', se = FALSE) 

Fitting by intuition

How would you draw a “best fit” line?

Fitting by intuition

Which line fits best? How can you tell?

Quantifying “goodness” of fit

We can measure how close the model is to the data

residuals

\(SSE=\sum_{i=i}^{n} (d_{i} - m_{i})^2\)

x y pred err sq_err
1 1.2 1.3 -0.1 0.01
2 2.5 2.0 0.5 0.25
3 2.3 2.7 -0.4 0.16
4 3.1 3.4 -0.3 0.09
5 4.4 4.1 0.3 0.09
x y pred err sq_err
1 1.2 1.58 -0.38 0.1444
2 2.5 2.62 -0.12 0.0144
3 2.3 3.66 -1.36 1.8496
4 3.1 4.70 -1.60 2.5600
5 4.4 5.74 -1.34 1.7956

But there are infinite possibilities

We can’t test all Inf of the possible free parameters

\(y=b_0+b_1x_1\)

Free parameters to test

Level = SSE

Error surface

Gradient descent, intuition

Gradient descent

Gradient descent linear model

Linear models are convex functions: one minimum

Ordinary least squares

Linear models have a solution: we can solve for the values with linear algebra.

\(y = ax + b\)

\(1.2 = a1 + b\)

\(2.5 = a2 + b\)

lm(y ~ 1 + x, data)

Call:
lm(formula = y ~ 1 + x, data = data)

Coefficients:
(Intercept)            x  
        0.6          0.7  
data %>%
    specify(y ~ 1 + x) %>%
    fit()
# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept    0.600
2 x            0.7  

ordinary least squares