Data Science for Studying Language and the Mind
2024-10-22
Final exam scheduled for December 19th at noon
If you have a conflict, please let us know by Demeber 6. We can proctor an earlier date.
Pset 03 solutions to be posted today!
Psets 04 and 05 pushed back by one week
here
Model fitting
model specification | |
---|---|
R syntax | rt ~ 1 + experience |
R syntax | rt ~ experience |
Equation | \(y=w_0+w_1x_1\) |
\(y = 211.271 + -1.695x\)
model specification | |
---|---|
R syntax | rt ~ 1 + experience |
R syntax | rt ~ experience |
Equation | \(y=w_0+w_1x_1\) |
How would you draw a “best fit” line?
Which line fits best? How can you tell?
We can measure how close the model is to the data
We can quantify our intutition and express goodness of fit with sum of squared error.
experience | rt | prediction | error | squared_error |
---|---|---|---|---|
49 | 124 | 128.22414 | 4.2241379 | 17.8433413 |
69 | 95 | 94.32759 | -0.6724138 | 0.4521403 |
89 | 71 | 60.43103 | -10.5689655 | 111.7030321 |
99 | 45 | 43.48276 | -1.5172414 | 2.3020214 |
109 | 18 | 26.53448 | 8.5344828 | 72.8373960 |
[1] 205.1379
\(SSE=\sum_{i=i}^{n} (d_{i} - m_{i})^2 = 205.1379\)
\(SSE=\sum_{i=i}^{n} (d_{i} - m_{i})^2\)
Given some data:
experience | rt | prediction | error | squared_error |
---|---|---|---|---|
49 | 124 | 128.22414 | 4.2241379 | 17.8433413 |
69 | 95 | 94.32759 | -0.6724138 | 0.4521403 |
89 | 71 | 60.43103 | -10.5689655 | 111.7030321 |
99 | 45 | 43.48276 | -1.5172414 | 2.3020214 |
109 | 18 | 26.53448 | 8.5344828 | 72.8373960 |
Compute the sum of squared error:
predict()
way uses a fitted model, which already has the best fitting free parameters.Now we can check any model parameters.
Inf
?)Check every para
If there is only one parameter, the error surface of a function is a curvy line; if there are two parameters, it’s a bumpy sheet; etc.
To search through a parameter space, we can use local, iterative optimization.
There are many iterative optimazation algorithms out there that vary in how they execute these steps. One simple example is gradient descent
Define our cost function (step 1):
Impliment gradient descent algorithm with optimg
(step 2):
lm()
We can compare optimg
’s estimates to that of lm()
to see that they are nearly identical:
A potential problem with iterative optimization algorithms is the risk of finding a local minimum.
Only nonlinear models have this potential problem. Even it higher dimensions.
For all linear models, we can think of the error surface is being shaped like a bowl, so there is no risk of a local minimum.
Another way we can find the best fitting free parameters for linear (or linearizable nonlinear) models is to use the Ordinary Least-Squares (OLS) estimate.
In OLS, the goal is to model the relationship between input variables and the output variable (\(y\)) as a linear combination. We express this very generally in our favorite equation, where the output (\(y\)) is a weighted sum of inputs (\(x_i\)).
Recall that this general expression has many 🥸 aliases. That is, the linear model equation can be expressed in many ways, but they are all this same thing:
The matrix notation is what allows us to appreciate that we can solve for the best fitting free parameters with linear algebra.
We can express in matrix notation:
\[ \begin{aligned} \mathbf{y} = \mathbf{X} \mathbf{w} + \mathbf{\epsilon} \end{aligned} \]
Where:
rt
).experience
with an intercept).Because our data set is small, we can expand these to help you picture this visually a little better:
experience
):\[ \begin{aligned} \mathbf{X} = \begin{bmatrix} 1 & 49 \\ 1 & 69 \\ 1 & 89 \\ 1 & 99 \\ 1 & 109 \end{bmatrix} \end{aligned} \]
rt
):\[ \begin{aligned} \mathbf{y} = \begin{bmatrix} 124 \\ 95 \\ 71 \\ 45 \\ 18 \end{bmatrix} \end{aligned} \]
\[ \begin{aligned} \mathbf{w} = \begin{bmatrix} w_1 \\ % Intercept w_2 % Weight for experience \end{bmatrix} \end{aligned} \]
Putting it all together, the linear model equation becomes, where there is a vector of errors (residuals), \(\mathbf{\epsilon}\).
\[ \begin{aligned} \begin{bmatrix} 124 \\ 95 \\ 71 \\ 45 \\ 18 \end{bmatrix} = \begin{bmatrix} 1 & 49 \\ 1 & 69 \\ 1 & 89 \\ 1 & 99 \\ 1 & 109 \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \epsilon_5 \\ \end{bmatrix} \end{aligned} \]
At this stage, we can take the mathematicians’ word for it that this provides an exact solution to the best fitting parameter estimates.
We can demonstrate this with code:
We need to construct X and Y (must be matrices):
(response_matrix <- data %>% select(rt) %>% as.matrix())
(explanatory_matrix <- data %>% mutate(int = 1) %>% select(int, experience) %>% as.matrix())
rt
[1,] 124
[2,] 95
[3,] 71
[4,] 45
[5,] 18
int experience
[1,] 1 49
[2,] 1 69
[3,] 1 89
[4,] 1 99
[5,] 1 109
Then we can use our function to generate the OLS solution:
Which is exactly the same as that returned by lm()
(because lm is doing this!)
We can fit if there are fewer inputs than datapoints
data2 <- tibble(
y = c(2, 5, 7),
x = c(1, 2, 3),
z = c(2, 4, 6),
a = c(6, 7, 8)
)
(model1 <- lm(y ~ 1 + x, data = data2))
Call:
lm(formula = y ~ 1 + x, data = data2)
Coefficients:
(Intercept) x
-0.3333 2.5000
Call:
lm(formula = y ~ 1 + x + z + a, data = data2)
Coefficients:
(Intercept) x z a
-0.3333 2.5000 NA NA
But not if there are more inputs than datapoints in our model
Call:
lm(formula = y ~ 1 + x + z + a, data = data2)
Coefficients:
(Intercept) x z a
-0.3333 2.5000 NA NA
lm()
is smart and fits the reduced model it can fit. If we try to solve this the matrix way via our homegrown function, we get an error.
And solve with linear algebra: \(w = (X^TX)^{-1}X^TY\)
We need to construct X and Y (must be matrices):
(response_matrix <- data %>% select(rt) %>% as.matrix())
(explanatory_matrix <- data %>% mutate(int = 1) %>% select(int, experience) %>% as.matrix())
rt
[1,] 124
[2,] 95
[3,] 71
[4,] 45
[5,] 18
int experience
[1,] 1 49
[2,] 1 69
[3,] 1 89
[4,] 1 99
[5,] 1 109
Then we can use our function to generate the OLS solution:
Call:
lm(formula = rt ~ experience, data = data)
Coefficients:
(Intercept) experience
211.271 -1.695