Model Fitting

Data Science for Studying Language and the Mind

Katie Schuler


You are here

Data science with R
  • Hello, world!
  • R basics
  • Data importing
  • Data visualization
  • Data wrangling
Stats & Model buidling
  • Sampling distribution
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Feature engineering (preprocessing)
  • Inference for regression
  • Mixed-effect models

Model building overview

  • Model specification: what is the form?
  • Model fitting: you have the form, how do you guess the free parameters?
  • Model accuracy: you’ve estimated the parameters, how well does that model describe your data?
  • Model reliability: when you estimate the parameters, there is some uncertainty on them

Model specification

a brief review

Types of models


  1. Response, \(y\)
  2. Explanatory, \(x_n\)
  3. Functional form, \(y=\beta_0 + \beta_1x_1 + \epsilon\)
  4. Model terms
    • Intercept
    • Main
    • Interaction
    • Transformation

Linear model functional form

field linear model eq
h.s. algebra \(y=ax+b\)
machine learning \(y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n\)
statistics \(y = β_0 + β_1x_1 + β_2x_2 + ... + β_nx_n + ε\)
matrix \(y = Xβ + ε\)

Model fitting

Model fitting

flowchart TD
    spec(Model specification) --> fit(Estimate free parameters) 
    fit(Estimate free parameters) --> fitted(Fitted model) 

Fitting a linear model

flowchart TD
    spec(Model specification \n y = ax + b) --> fit(Estimate free parameters) 
    fit(Estimate free parameters) --> fitted(Fitted model \n y = 0.7x + 0.6) 

ggplot(data, aes(x = x, y = y)) +
    geom_point(size = 4, color = "darkred") +
    geom_smooth(method = "lm", formula = 'y ~ x', se = FALSE) 

Fitting by intuition

How would you draw a “best fit” line?

Fitting by intuition

Which line fits best? How can you tell?

Quantifying “goodness” of fit

We can measure how close the model is to the data


\(SSE=\sum_{i=i}^{n} (d_{i} - m_{i})^2\)

x y pred err sq_err
1 1.2 1.3 -0.1 0.01
2 2.5 2.0 0.5 0.25
3 2.3 2.7 -0.4 0.16
4 3.1 3.4 -0.3 0.09
5 4.4 4.1 0.3 0.09
x y pred err sq_err
1 1.2 1.58 -0.38 0.1444
2 2.5 2.62 -0.12 0.0144
3 2.3 3.66 -1.36 1.8496
4 3.1 4.70 -1.60 2.5600
5 4.4 5.74 -1.34 1.7956

But there are infinite possibilities

We can’t test all Inf of the possible free parameters


Free parameters to test

Level = SSE

Error surface

Gradient descent, intuition

Gradient descent

Gradient descent linear model

Linear models are convex functions: one minimum

Ordinary least squares

Linear models have a solution: we can solve for the values with linear algebra.

\(y = ax + b\)

\(1.2 = a1 + b\)

\(2.5 = a2 + b\)

lm(y ~ 1 + x, data)

lm(formula = y ~ 1 + x, data = data)

(Intercept)            x  
        0.6          0.7  
data %>%
    specify(y ~ 1 + x) %>%
# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept    0.600
2 x            0.7  

ordinary least squares