# Model Fitting

Data Science for Studying Language and the Mind

Katie Schuler

2023-10-17

## You are here

##### Data science with R
• Hello, world!
• R basics
• Data importing
• Data visualization
• Data wrangling
##### Stats & Model buidling
• Sampling distribution
• Hypothesis testing
• Model specification
• Model fitting
• Model accuracy
• Model reliability
• Classification
• Feature engineering (preprocessing)
• Inference for regression
• Mixed-effect models

## Model building overview

• Model specification: what is the form?
• Model fitting: you have the form, how do you guess the free parameters?
• Model accuracy: you’ve estimated the parameters, how well does that model describe your data?
• Model reliability: when you estimate the parameters, there is some uncertainty on them

# Model specification

a brief review

#### Specification

1. Response, $y$
2. Explanatory, $x_n$
3. Functional form, $y=\beta_0 + \beta_1x_1 + \epsilon$
4. Model terms
• Intercept
• Main
• Interaction
• Transformation

## Linear model functional form

field linear model eq
h.s. algebra $y=ax+b$
machine learning $y = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n$
statistics $y = β_0 + β_1x_1 + β_2x_2 + ... + β_nx_n + ε$
matrix $y = Xβ + ε$

# Model fitting

## Model fitting

flowchart TD
spec(Model specification) --> fit(Estimate free parameters)
fit(Estimate free parameters) --> fitted(Fitted model)


## Fitting a linear model

flowchart TD
spec(Model specification \n y = ax + b) --> fit(Estimate free parameters)
fit(Estimate free parameters) --> fitted(Fitted model \n y = 0.7x + 0.6)

ggplot(data, aes(x = x, y = y)) +
geom_point(size = 4, color = "darkred") +
geom_smooth(method = "lm", formula = 'y ~ x', se = FALSE) 

## Fitting by intuition

How would you draw a “best fit” line?

## Fitting by intuition

Which line fits best? How can you tell?

## Quantifying “goodness” of fit

We can measure how close the model is to the data

residuals

## $SSE=\sum_{i=i}^{n} (d_{i} - m_{i})^2$

x y pred err sq_err
1 1.2 1.3 -0.1 0.01
2 2.5 2.0 0.5 0.25
3 2.3 2.7 -0.4 0.16
4 3.1 3.4 -0.3 0.09
5 4.4 4.1 0.3 0.09
x y pred err sq_err
1 1.2 1.58 -0.38 0.1444
2 2.5 2.62 -0.12 0.0144
3 2.3 3.66 -1.36 1.8496
4 3.1 4.70 -1.60 2.5600
5 4.4 5.74 -1.34 1.7956

## But there are infinite possibilities

We can’t test all Inf of the possible free parameters

$y=b_0+b_1x_1$

## Error surface

Linear models are convex functions: one minimum

## Ordinary least squares

Linear models have a solution: we can solve for the values with linear algebra.

#### $y = ax + b$

$1.2 = a1 + b$

$2.5 = a2 + b$

lm(y ~ 1 + x, data)

Call:
lm(formula = y ~ 1 + x, data = data)

Coefficients:
(Intercept)            x
0.6          0.7  
data %>%
specify(y ~ 1 + x) %>%
fit()
# A tibble: 2 × 2
term      estimate
<chr>        <dbl>
1 intercept    0.600
2 x            0.7  

ordinary least squares