library(tidyverse)
library(mosaic)
Applied model specification
Now that we’ve covered the terminology and concepts, let’s apply model specification to some real models.
1 “Toy” data
Let’s start with the simplest possible example, a dataset with two data points. Suppose you record how many days you study over the next 5 days. On day 1, you study for 2 hours. On day 2, you study for 6 hours and so on. Your dataset might look something like this.
x | y |
---|---|
1 | 2 |
2 | 6 |
3 | 7 |
4 | 12 |
5 | 13 |
<- tibble(
toy_data x = c(1, 2, 3, 4, 5),
y = c(2, 6, 7, 12, 13)
)
%>%
toy_data ggplot(aes(x = x, y = y)) +
geom_point() +
theme_bw(base_size = 14)
- Specify our response variable, \(y\): the response variable (🥸 data, output, prediction) is the variable you are trying to predict or explain with your model.
y
- Specify explantory variables, \(x_i\): the explanatory variables (🥸 regressors, inputs, predictors) are the predictors in your data that could help explain the response variable. Our data has only one possible:
x
- Specify the functional form: the functional form describes the relationship between the response and explanatory variables with a mathematical expresson. In a linear model, we express this relationship as a weighted sum of inputs:
- \(y=\sum_{i=1}^{n}w_ix_i\)
- Specify model terms: here we need to specify exactly how to express our explanatory variables in our functional form. The actual variables and constants that will be included in the model. There are four kinds of terms: (1) intercept, (2) main, (3) interaction, and (4) transformation. Here we have the simplest case of an intercept and one main term (no interactions or transformations necessary)
- \(y = w_1\mathbf{1} + w_2x_2\)
- in R:
y ~ 1 + x
Model specification:
\(y = w_1\cdot\mathbf{1} + w_2\cdot\mathbf{x}\)
Call:
lm(formula = y ~ 1 + x, data = toy_data)
Coefficients:
(Intercept) x
-0.4 2.8
Fitted model:
\(y = -0.4\cdot1 + 2.8\cdot x\)
x | y | with_formula | with_predict |
---|---|---|---|
1 | 2 | 2.4 | 2.4 |
2 | 6 | 5.2 | 5.2 |
3 | 7 | 8.0 | 8.0 |
4 | 12 | 10.8 | 10.8 |
5 | 13 | 13.6 | 13.6 |
<- lm(y ~ 1 + x, data = toy_data)
model
<- toy_data %>%
toy_data mutate(with_formula = -0.4*1 + 2.8*x) %>%
mutate(with_predict= predict(model, toy_data))
%>%
toy_data ggplot(aes(x = x, y = y)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
2 Swim records
2.1 One input
If our model has a single input, it is likely the intercept term, a constant (not variable) capturing the typical value of the response variable when all explanatory variables are zero.
Model specification:
\(y = w_1\cdot\mathbf{1}\)
Call:
lm(formula = time ~ 1, data = SwimRecords)
Coefficients:
(Intercept)
59.92
Fitted model:
\(y = 59.92 \cdot 1\)
year | time | sex | with_formula | with_predict |
---|---|---|---|---|
1905 | 65.80 | M | 59.92 | 59.92419 |
1908 | 65.60 | M | 59.92 | 59.92419 |
1910 | 62.80 | M | 59.92 | 59.92419 |
1912 | 61.60 | M | 59.92 | 59.92419 |
1918 | 61.40 | M | 59.92 | 59.92419 |
1920 | 60.40 | M | 59.92 | 59.92419 |
1922 | 58.60 | M | 59.92 | 59.92419 |
1924 | 57.40 | M | 59.92 | 59.92419 |
1934 | 56.80 | M | 59.92 | 59.92419 |
1935 | 56.60 | M | 59.92 | 59.92419 |
1936 | 56.40 | M | 59.92 | 59.92419 |
1944 | 55.90 | M | 59.92 | 59.92419 |
1947 | 55.80 | M | 59.92 | 59.92419 |
1948 | 55.40 | M | 59.92 | 59.92419 |
1955 | 54.80 | M | 59.92 | 59.92419 |
1957 | 54.60 | M | 59.92 | 59.92419 |
1961 | 53.60 | M | 59.92 | 59.92419 |
1964 | 52.90 | M | 59.92 | 59.92419 |
1967 | 52.60 | M | 59.92 | 59.92419 |
1968 | 52.20 | M | 59.92 | 59.92419 |
1970 | 51.90 | M | 59.92 | 59.92419 |
1972 | 51.22 | M | 59.92 | 59.92419 |
1975 | 50.59 | M | 59.92 | 59.92419 |
1976 | 49.44 | M | 59.92 | 59.92419 |
1981 | 49.36 | M | 59.92 | 59.92419 |
1985 | 49.24 | M | 59.92 | 59.92419 |
1986 | 48.74 | M | 59.92 | 59.92419 |
1988 | 48.42 | M | 59.92 | 59.92419 |
1994 | 48.21 | M | 59.92 | 59.92419 |
2000 | 48.18 | M | 59.92 | 59.92419 |
2000 | 47.84 | M | 59.92 | 59.92419 |
1908 | 95.00 | F | 59.92 | 59.92419 |
1910 | 86.60 | F | 59.92 | 59.92419 |
1911 | 84.60 | F | 59.92 | 59.92419 |
1912 | 78.80 | F | 59.92 | 59.92419 |
1915 | 76.20 | F | 59.92 | 59.92419 |
1920 | 73.60 | F | 59.92 | 59.92419 |
1923 | 72.80 | F | 59.92 | 59.92419 |
1924 | 72.20 | F | 59.92 | 59.92419 |
1926 | 70.00 | F | 59.92 | 59.92419 |
1929 | 69.40 | F | 59.92 | 59.92419 |
1930 | 68.00 | F | 59.92 | 59.92419 |
1931 | 66.60 | F | 59.92 | 59.92419 |
1933 | 66.00 | F | 59.92 | 59.92419 |
1934 | 65.40 | F | 59.92 | 59.92419 |
1936 | 64.60 | F | 59.92 | 59.92419 |
1956 | 62.00 | F | 59.92 | 59.92419 |
1958 | 61.20 | F | 59.92 | 59.92419 |
1960 | 60.20 | F | 59.92 | 59.92419 |
1962 | 59.50 | F | 59.92 | 59.92419 |
1964 | 58.90 | F | 59.92 | 59.92419 |
1972 | 58.50 | F | 59.92 | 59.92419 |
1973 | 57.54 | F | 59.92 | 59.92419 |
1974 | 56.96 | F | 59.92 | 59.92419 |
1976 | 55.65 | F | 59.92 | 59.92419 |
1978 | 55.41 | F | 59.92 | 59.92419 |
1980 | 54.79 | F | 59.92 | 59.92419 |
1986 | 54.73 | F | 59.92 | 59.92419 |
1992 | 54.48 | F | 59.92 | 59.92419 |
1994 | 54.01 | F | 59.92 | 59.92419 |
2000 | 53.77 | F | 59.92 | 59.92419 |
2004 | 53.52 | F | 59.92 | 59.92419 |
<- lm(time ~ 1, data = SwimRecords)
model
<- SwimRecords %>%
SwimRecords_predict mutate(with_formula = 59.92*1) %>%
mutate(with_predict= predict(model, SwimRecords))
%>%
SwimRecords_predict ggplot(aes(x = year, y = time)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
2.2 Two inputs
We can add another term to our model represnting the effect of year on record time. This is a main term or main effect, which represents the effect of each explanatory variable on the response variable directly. In other words, how does record time change as a result of changes in year, when all other explanatory variables are zero?
Model specification:
\(y = w_1\cdot\mathbf{1} + w_2\cdot\mathbf{year}\)
Call:
lm(formula = time ~ 1 + year, data = SwimRecords)
Coefficients:
(Intercept) year
567.2420 -0.2599
Fitted model:
\(y = 567.2420 \cdot \mathbf{1} + -0.2599 \cdot \mathbf{year}\)
year | time | sex | with_formula | with_predict |
---|---|---|---|---|
1905 | 65.80 | M | 72.1325 | 72.17614 |
1908 | 65.60 | M | 71.3528 | 71.39651 |
1910 | 62.80 | M | 70.8330 | 70.87676 |
1912 | 61.60 | M | 70.3132 | 70.35700 |
1918 | 61.40 | M | 68.7538 | 68.79774 |
1920 | 60.40 | M | 68.2340 | 68.27798 |
1922 | 58.60 | M | 67.7142 | 67.75823 |
1924 | 57.40 | M | 67.1944 | 67.23848 |
1934 | 56.80 | M | 64.5954 | 64.63971 |
1935 | 56.60 | M | 64.3355 | 64.37983 |
1936 | 56.40 | M | 64.0756 | 64.11995 |
1944 | 55.90 | M | 61.9964 | 62.04093 |
1947 | 55.80 | M | 61.2167 | 61.26130 |
1948 | 55.40 | M | 60.9568 | 61.00143 |
1955 | 54.80 | M | 59.1375 | 59.18229 |
1957 | 54.60 | M | 58.6177 | 58.66253 |
1961 | 53.60 | M | 57.5781 | 57.62302 |
1964 | 52.90 | M | 56.7984 | 56.84339 |
1967 | 52.60 | M | 56.0187 | 56.06376 |
1968 | 52.20 | M | 55.7588 | 55.80388 |
1970 | 51.90 | M | 55.2390 | 55.28413 |
1972 | 51.22 | M | 54.7192 | 54.76438 |
1975 | 50.59 | M | 53.9395 | 53.98474 |
1976 | 49.44 | M | 53.6796 | 53.72487 |
1981 | 49.36 | M | 52.3801 | 52.42548 |
1985 | 49.24 | M | 51.3405 | 51.38597 |
1986 | 48.74 | M | 51.0806 | 51.12610 |
1988 | 48.42 | M | 50.5608 | 50.60634 |
1994 | 48.21 | M | 49.0014 | 49.04708 |
2000 | 48.18 | M | 47.4420 | 47.48782 |
2000 | 47.84 | M | 47.4420 | 47.48782 |
1908 | 95.00 | F | 71.3528 | 71.39651 |
1910 | 86.60 | F | 70.8330 | 70.87676 |
1911 | 84.60 | F | 70.5731 | 70.61688 |
1912 | 78.80 | F | 70.3132 | 70.35700 |
1915 | 76.20 | F | 69.5335 | 69.57737 |
1920 | 73.60 | F | 68.2340 | 68.27798 |
1923 | 72.80 | F | 67.4543 | 67.49835 |
1924 | 72.20 | F | 67.1944 | 67.23848 |
1926 | 70.00 | F | 66.6746 | 66.71872 |
1929 | 69.40 | F | 65.8949 | 65.93909 |
1930 | 68.00 | F | 65.6350 | 65.67921 |
1931 | 66.60 | F | 65.3751 | 65.41934 |
1933 | 66.00 | F | 64.8553 | 64.89958 |
1934 | 65.40 | F | 64.5954 | 64.63971 |
1936 | 64.60 | F | 64.0756 | 64.11995 |
1956 | 62.00 | F | 58.8776 | 58.92241 |
1958 | 61.20 | F | 58.3578 | 58.40266 |
1960 | 60.20 | F | 57.8380 | 57.88290 |
1962 | 59.50 | F | 57.3182 | 57.36315 |
1964 | 58.90 | F | 56.7984 | 56.84339 |
1972 | 58.50 | F | 54.7192 | 54.76438 |
1973 | 57.54 | F | 54.4593 | 54.50450 |
1974 | 56.96 | F | 54.1994 | 54.24462 |
1976 | 55.65 | F | 53.6796 | 53.72487 |
1978 | 55.41 | F | 53.1598 | 53.20511 |
1980 | 54.79 | F | 52.6400 | 52.68536 |
1986 | 54.73 | F | 51.0806 | 51.12610 |
1992 | 54.48 | F | 49.5212 | 49.56683 |
1994 | 54.01 | F | 49.0014 | 49.04708 |
2000 | 53.77 | F | 47.4420 | 47.48782 |
2004 | 53.52 | F | 46.4024 | 46.44831 |
<- lm(time ~ 1 + year, data = SwimRecords)
model
<- SwimRecords %>%
SwimRecords_predict mutate(with_formula = 567.2420*1 + -0.2599*year) %>%
mutate(with_predict= predict(model, SwimRecords))
%>%
SwimRecords_predict ggplot(aes(x = year, y = time)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
2.3 Three inputs
We can see that the previous model allowed us to capture the effect of year on record time, but we still have some unexplained variation. We can include sex in the model to capture the difference in record times by sex.
Model specification:
\(y = w_1\cdot\mathbf{1} + w_2\cdot\mathbf{year} + w_3\cdot\mathbf{sex}\)
Call:
lm(formula = time ~ 1 + year + sex, data = SwimRecords)
Coefficients:
(Intercept) year sexM
555.7168 -0.2515 -9.7980
Fitted model:
\(y = 555.7168 \cdot \mathbf{1} + -0.2515 \cdot \mathbf{year} + -9.7980 \cdot \mathbf{sex}\)
year | time | sex | sex_numeric | with_formula | with_predict |
---|---|---|---|---|---|
1905 | 65.80 | M | 1 | 66.8113 | 66.88051 |
1908 | 65.60 | M | 1 | 66.0568 | 66.12612 |
1910 | 62.80 | M | 1 | 65.5538 | 65.62319 |
1912 | 61.60 | M | 1 | 65.0508 | 65.12026 |
1918 | 61.40 | M | 1 | 63.5418 | 63.61148 |
1920 | 60.40 | M | 1 | 63.0388 | 63.10855 |
1922 | 58.60 | M | 1 | 62.5358 | 62.60563 |
1924 | 57.40 | M | 1 | 62.0328 | 62.10270 |
1934 | 56.80 | M | 1 | 59.5178 | 59.58806 |
1935 | 56.60 | M | 1 | 59.2663 | 59.33660 |
1936 | 56.40 | M | 1 | 59.0148 | 59.08513 |
1944 | 55.90 | M | 1 | 57.0028 | 57.07343 |
1947 | 55.80 | M | 1 | 56.2483 | 56.31903 |
1948 | 55.40 | M | 1 | 55.9968 | 56.06757 |
1955 | 54.80 | M | 1 | 54.2363 | 54.30732 |
1957 | 54.60 | M | 1 | 53.7333 | 53.80440 |
1961 | 53.60 | M | 1 | 52.7273 | 52.79854 |
1964 | 52.90 | M | 1 | 51.9728 | 52.04415 |
1967 | 52.60 | M | 1 | 51.2183 | 51.28976 |
1968 | 52.20 | M | 1 | 50.9668 | 51.03830 |
1970 | 51.90 | M | 1 | 50.4638 | 50.53537 |
1972 | 51.22 | M | 1 | 49.9608 | 50.03244 |
1975 | 50.59 | M | 1 | 49.2063 | 49.27805 |
1976 | 49.44 | M | 1 | 48.9548 | 49.02659 |
1981 | 49.36 | M | 1 | 47.6973 | 47.76927 |
1985 | 49.24 | M | 1 | 46.6913 | 46.76341 |
1986 | 48.74 | M | 1 | 46.4398 | 46.51195 |
1988 | 48.42 | M | 1 | 45.9368 | 46.00902 |
1994 | 48.21 | M | 1 | 44.4278 | 44.50024 |
2000 | 48.18 | M | 1 | 42.9188 | 42.99146 |
2000 | 47.84 | M | 1 | 42.9188 | 42.99146 |
1908 | 95.00 | F | 0 | 75.8548 | 75.92408 |
1910 | 86.60 | F | 0 | 75.3518 | 75.42115 |
1911 | 84.60 | F | 0 | 75.1003 | 75.16969 |
1912 | 78.80 | F | 0 | 74.8488 | 74.91822 |
1915 | 76.20 | F | 0 | 74.0943 | 74.16383 |
1920 | 73.60 | F | 0 | 72.8368 | 72.90651 |
1923 | 72.80 | F | 0 | 72.0823 | 72.15212 |
1924 | 72.20 | F | 0 | 71.8308 | 71.90066 |
1926 | 70.00 | F | 0 | 71.3278 | 71.39773 |
1929 | 69.40 | F | 0 | 70.5733 | 70.64334 |
1930 | 68.00 | F | 0 | 70.3218 | 70.39188 |
1931 | 66.60 | F | 0 | 70.0703 | 70.14041 |
1933 | 66.00 | F | 0 | 69.5673 | 69.63749 |
1934 | 65.40 | F | 0 | 69.3158 | 69.38602 |
1936 | 64.60 | F | 0 | 68.8128 | 68.88310 |
1956 | 62.00 | F | 0 | 63.7828 | 63.85382 |
1958 | 61.20 | F | 0 | 63.2798 | 63.35090 |
1960 | 60.20 | F | 0 | 62.7768 | 62.84797 |
1962 | 59.50 | F | 0 | 62.2738 | 62.34504 |
1964 | 58.90 | F | 0 | 61.7708 | 61.84211 |
1972 | 58.50 | F | 0 | 59.7588 | 59.83040 |
1973 | 57.54 | F | 0 | 59.5073 | 59.57894 |
1974 | 56.96 | F | 0 | 59.2558 | 59.32748 |
1976 | 55.65 | F | 0 | 58.7528 | 58.82455 |
1978 | 55.41 | F | 0 | 58.2498 | 58.32162 |
1980 | 54.79 | F | 0 | 57.7468 | 57.81869 |
1986 | 54.73 | F | 0 | 56.2378 | 56.30991 |
1992 | 54.48 | F | 0 | 54.7288 | 54.80113 |
1994 | 54.01 | F | 0 | 54.2258 | 54.29820 |
2000 | 53.77 | F | 0 | 52.7168 | 52.78942 |
2004 | 53.52 | F | 0 | 51.7108 | 51.78357 |
<- lm(time ~ 1 + year + sex, data = SwimRecords)
model
<- SwimRecords %>%
SwimRecords_predict mutate(sex_numeric = case_when(
== 'M' ~ 1,
sex == 'F' ~ 0
sex %>%
)) mutate(with_formula = 555.7168*1 + -0.2515*year + -9.7980 *sex_numeric) %>%
mutate(with_predict= predict(model, SwimRecords))
%>%
SwimRecords_predict ggplot(aes(x = year, y = time, shape = sex)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
2.4 Interaction
Notice that the previous model now gets us record times getting faster by year, and different predictions for men and women (women have slower times). But this is missing another relationship we can see in our data: that women are getting faster, faster. To express that the effect of one explanatory variable on the response variable is different at different values of another explanatory variable (e.g. the effect of year on record times is different for men and women), we add a term to the model in which we multiply the values of the interacting variables.
We could say that we “expand the input space” of the model, since we add terms to capture the interaction
Model specification:
\(y = w_1\cdot\mathbf{1} + w_2\cdot\mathbf{year} + w_3\cdot\mathbf{sex} + w_4\cdot\mathbf{year\times{sex}}\)
Call:
lm(formula = time ~ 1 + year * sex, data = SwimRecords)
Coefficients:
(Intercept) year sexM year:sexM
697.3012 -0.3240 -302.4638 0.1499
Fitted model:
\[\begin{align}
y = &697.3012 \cdot \mathbf{1} + -0.3240 \cdot \mathbf{year} + -302.4638 \cdot \mathbf{sex}\\
&+ 0.1499 \cdot \mathbf{year\times{sex}}
\end{align}\]
year | time | sex | with_predict |
---|---|---|---|
1905 | 65.80 | M | 63.12106 |
1908 | 65.60 | M | 62.59867 |
1910 | 62.80 | M | 62.25041 |
1912 | 61.60 | M | 61.90215 |
1918 | 61.40 | M | 60.85738 |
1920 | 60.40 | M | 60.50912 |
1922 | 58.60 | M | 60.16086 |
1924 | 57.40 | M | 59.81260 |
1934 | 56.80 | M | 58.07131 |
1935 | 56.60 | M | 57.89718 |
1936 | 56.40 | M | 57.72305 |
1944 | 55.90 | M | 56.33002 |
1947 | 55.80 | M | 55.80763 |
1948 | 55.40 | M | 55.63350 |
1955 | 54.80 | M | 54.41459 |
1957 | 54.60 | M | 54.06634 |
1961 | 53.60 | M | 53.36982 |
1964 | 52.90 | M | 52.84743 |
1967 | 52.60 | M | 52.32504 |
1968 | 52.20 | M | 52.15091 |
1970 | 51.90 | M | 51.80266 |
1972 | 51.22 | M | 51.45440 |
1975 | 50.59 | M | 50.93201 |
1976 | 49.44 | M | 50.75788 |
1981 | 49.36 | M | 49.88723 |
1985 | 49.24 | M | 49.19072 |
1986 | 48.74 | M | 49.01659 |
1988 | 48.42 | M | 48.66833 |
1994 | 48.21 | M | 47.62355 |
2000 | 48.18 | M | 46.57878 |
2000 | 47.84 | M | 46.57878 |
1908 | 95.00 | F | 79.02170 |
1910 | 86.60 | F | 78.37361 |
1911 | 84.60 | F | 78.04956 |
1912 | 78.80 | F | 77.72552 |
1915 | 76.20 | F | 76.75338 |
1920 | 73.60 | F | 75.13315 |
1923 | 72.80 | F | 74.16101 |
1924 | 72.20 | F | 73.83697 |
1926 | 70.00 | F | 73.18887 |
1929 | 69.40 | F | 72.21674 |
1930 | 68.00 | F | 71.89269 |
1931 | 66.60 | F | 71.56864 |
1933 | 66.00 | F | 70.92055 |
1934 | 65.40 | F | 70.59651 |
1936 | 64.60 | F | 69.94842 |
1956 | 62.00 | F | 63.46750 |
1958 | 61.20 | F | 62.81941 |
1960 | 60.20 | F | 62.17131 |
1962 | 59.50 | F | 61.52322 |
1964 | 58.90 | F | 60.87513 |
1972 | 58.50 | F | 58.28276 |
1973 | 57.54 | F | 57.95872 |
1974 | 56.96 | F | 57.63467 |
1976 | 55.65 | F | 56.98658 |
1978 | 55.41 | F | 56.33849 |
1980 | 54.79 | F | 55.69040 |
1986 | 54.73 | F | 53.74612 |
1992 | 54.48 | F | 51.80185 |
1994 | 54.01 | F | 51.15375 |
2000 | 53.77 | F | 49.20948 |
2004 | 53.52 | F | 47.91330 |
<- lm(time ~ 1 + year * sex, data = SwimRecords)
model
<- SwimRecords %>%
SwimRecords_predict mutate(with_predict= predict(model, SwimRecords))
%>%
SwimRecords_predict ggplot(aes(x = year, y = time, shape = sex)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
2.5 Transformation
Now our model is doing a great job at predicting our data, but there may be more we want to do. For example, we can see that the model is not predicting women very well around the year 2000 (it is predicting they will be faster than they are). If we want to allow the model to have a curve shape, capturing that women gained on men for a while, but are no slowing down, we can add a term to the model in which we square the year. This allows us to capture this nonlinear curve or bend in the data (more on this for polynomials in the next section).
but notice that the model is fitting the data well, but still behaving a bit non-sensical toward the 2000s, predicint that record times are getter slower! Impossible!)
Model specification:
\[\begin{align}
y = &w_1\cdot\mathbf{1} + w_2\cdot\mathbf{year} + w_3\cdot\mathbf{sex} \\
&+ w_4\cdot\mathbf{year\times{sex}} +w_5\cdot\mathbf{year}^2
\end{align}\]
Call:
lm(formula = time ~ 1 + year * sex + I(year^2), data = SwimRecords)
Coefficients:
(Intercept) year sexM I(year^2) year:sexM
1.110e+04 -1.098e+01 -3.171e+02 2.729e-03 1.575e-01
Fitted model:
\[\begin{align}
y = &11100 \cdot \mathbf{1} + -10.98 \cdot \mathbf{year} + -317.1 \cdot \mathbf{sex}\\
&+ 0.1575 \cdot \mathbf{year\times{sex}} + 0.002729 \cdot \mathbf{year}^2
\end{align}\]
year | time | sex | with_predict |
---|---|---|---|
1905 | 65.80 | M | 66.81874 |
1908 | 65.60 | M | 65.55576 |
1910 | 62.80 | M | 64.74106 |
1912 | 61.60 | M | 63.94819 |
1918 | 61.40 | M | 61.70057 |
1920 | 60.40 | M | 60.99502 |
1922 | 58.60 | M | 60.31130 |
1924 | 57.40 | M | 59.64941 |
1934 | 56.80 | M | 56.66741 |
1935 | 56.60 | M | 56.39922 |
1936 | 56.40 | M | 56.13650 |
1944 | 55.90 | M | 54.23115 |
1947 | 55.80 | M | 53.60669 |
1948 | 55.40 | M | 53.40946 |
1955 | 54.80 | M | 52.18160 |
1957 | 54.60 | M | 51.87991 |
1961 | 53.60 | M | 51.34200 |
1964 | 52.90 | M | 50.99587 |
1967 | 52.60 | M | 50.69886 |
1968 | 52.20 | M | 50.61078 |
1970 | 51.90 | M | 50.45097 |
1972 | 51.22 | M | 50.31300 |
1975 | 50.59 | M | 50.14697 |
1976 | 49.44 | M | 50.10254 |
1981 | 49.36 | M | 49.96226 |
1985 | 49.24 | M | 49.94827 |
1986 | 48.74 | M | 49.95841 |
1988 | 48.42 | M | 49.99508 |
1994 | 48.21 | M | 50.23605 |
2000 | 48.18 | M | 50.67349 |
2000 | 47.84 | M | 50.67349 |
1908 | 95.00 | F | 82.16082 |
1910 | 86.60 | F | 81.03116 |
1911 | 84.60 | F | 80.47451 |
1912 | 78.80 | F | 79.92332 |
1915 | 76.20 | F | 78.30250 |
1920 | 73.60 | F | 75.71028 |
1923 | 72.80 | F | 74.22044 |
1924 | 72.20 | F | 73.73474 |
1926 | 70.00 | F | 72.77971 |
1929 | 69.40 | F | 71.38810 |
1930 | 68.00 | F | 70.93515 |
1931 | 66.60 | F | 70.48765 |
1933 | 66.00 | F | 69.60903 |
1934 | 65.40 | F | 69.17790 |
1936 | 64.60 | F | 68.33203 |
1956 | 62.00 | F | 61.07389 |
1958 | 61.20 | F | 60.46814 |
1960 | 60.20 | F | 59.88422 |
1962 | 59.50 | F | 59.32213 |
1964 | 58.90 | F | 58.78187 |
1972 | 58.50 | F | 56.83913 |
1973 | 57.54 | F | 56.62085 |
1974 | 56.96 | F | 56.40802 |
1976 | 55.65 | F | 55.99874 |
1978 | 55.41 | F | 55.61129 |
1980 | 54.79 | F | 55.24567 |
1986 | 54.73 | F | 54.27978 |
1992 | 54.48 | F | 53.51036 |
1994 | 54.01 | F | 53.29755 |
2000 | 53.77 | F | 52.79009 |
2004 | 53.52 | F | 52.56093 |
<- lm(time ~ 1 + year * sex + I(year^2), data = SwimRecords)
model
<- SwimRecords %>%
SwimRecords_predict mutate(with_predict= predict(model, SwimRecords))
%>%
SwimRecords_predict ggplot(aes(x = year, y = time, shape = sex)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
3 Linearizing nonlinear models
When you want to linearlize a nonlinear model, you’re trying to fit a linear model to data that doesn’t naturally follow a straight line. There are two common ways to approach this:
- Expanding the input space with polynomials. Polynomials can capture “bumps” or curves in the data. In this approach, we add terms to the model, like squares or cubes of the original variable.
- \(y = w_1 + w_2x + w_3x^2\)
- Transforming the data involves applying mathematical functions to existing inputs to alter their scale or distributions. Common transformations include taking the logarithm or square root. Taking the logarithm of a variable compresses its range and reduces skewness in the data (as in the brain size and body weight data).
- both output and input: \(log(y) = w_1 + w_2 log(x)\)
- just input: \(y = w_1 + w_2 log(x)\)
4 Plant heights (polynomials)
Polynomials capture “bumps” or curves in the data, and the number of these bumps depends on the degree of the polynomial. The higher the degree, the more complex the shape the polynomial can represent.
- Degree 1 (Linear): A straight line. There are no bumps or curves. The relationship between the predictor and the response is either increasing or decreasing at a constant rate.
- Degree 2 (Quadratic): A single bump or curve. The graph is either a U-shape (bowl) or an upside-down U-shape (hill), meaning it can capture one turning point.
- Degree 3 (Cubic): Can capture two bumps (or one “S” shaped curve). The graph can have two turning points, meaning it can start by increasing, then decrease, and increase again (or the opposite).
- Degree 4 (Quartic): Can capture three bumps or changes in direction. The graph can have up to three turning points, allowing for more complex shapes and curves in the data.
4.1 Degree 1 (Linear)
Remember, a Degree 1 (Linear) is a straight line. There are no bumps or curves. The relationship between the predictor and the response is either increasing or decreasing at a constant rate. This doesn’t seem to capture the relationship between light_exposure and plant_height in our data.
Model specification:
\(y = w_1\cdot\mathbf{1} + w_2 \cdot \mathbf{x}\)
Call:
lm(formula = plant_height ~ 1 + light_exposure, data = poly_plants)
Coefficients:
(Intercept) light_exposure
31.346 3.619
Fitted model:
\(y = 31.346 \cdot 1 + 3.619 \cdot x\)
plant | light_exposure | plant_height | with_formula | with_predict |
---|---|---|---|---|
Sunflower | 0 | 10 | 31.346 | 31.34615 |
Sunflower | 1 | 15 | 34.965 | 34.96504 |
Sunflower | 2 | 25 | 38.584 | 38.58392 |
Rose | 3 | 40 | 42.203 | 42.20280 |
Rose | 4 | 55 | 45.822 | 45.82168 |
Rose | 5 | 70 | 49.441 | 49.44056 |
Cactus | 6 | 85 | 53.060 | 53.05944 |
Cactus | 7 | 95 | 56.679 | 56.67832 |
Cactus | 8 | 90 | 60.298 | 60.29720 |
Orchid | 9 | 70 | 63.917 | 63.91608 |
Orchid | 10 | 40 | 67.536 | 67.53496 |
Orchid | 11 | 20 | 71.155 | 71.15385 |
<- read_csv('https://kathrynschuler.com/datasci/assests/csv/polynomial_plants.csv')
poly_plants
<- lm(plant_height ~ 1 + light_exposure, data = poly_plants)
model
<- poly_plants %>%
poly_plants mutate(with_formula = 31.346*1 + 3.619*light_exposure) %>%
mutate(with_predict= predict(model, poly_plants))
%>%
poly_plants ggplot(aes(x = light_exposure, y = plant_height)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
4.2 Degree 2 (Quadratic)
In a Degree 2 (Quadratic) polynomial, we can express a single bump or curve. The graph is either a U-shape (bowl) or an upside-down U-shape (hill), meaning it can capture one turning point. This provides a better fit for our data, allow us to express the light exposure goes up and then back down again. But it looks like there is another “bump” in the data, going back upward around light exposure of 1 or 2.
Model specification:
\(y = w_1\cdot\mathbf{1} + w_2 \cdot \mathbf{x} + w_3 \cdot \mathbf{x}^2\)
Call:
lm(formula = plant_height ~ 1 + light_exposure + I(light_exposure^2),
data = poly_plants)
Coefficients:
(Intercept) light_exposure I(light_exposure^2)
-9.245 27.973 -2.214
Fitted model:
\(y = -9.245 \cdot\mathbf{1} + 27.973 \cdot \mathbf{x} + -2.214 \cdot \mathbf{x}^2\)
plant | light_exposure | plant_height | with_predict |
---|---|---|---|
Sunflower | 0 | 10 | -9.244506 |
Sunflower | 1 | 15 | 16.514735 |
Sunflower | 2 | 25 | 37.845904 |
Rose | 3 | 40 | 54.749001 |
Rose | 4 | 55 | 67.224026 |
Rose | 5 | 70 | 75.270979 |
Cactus | 6 | 85 | 78.889860 |
Cactus | 7 | 95 | 78.080669 |
Cactus | 8 | 90 | 72.843407 |
Orchid | 9 | 70 | 63.178072 |
Orchid | 10 | 40 | 49.084665 |
Orchid | 11 | 20 | 30.563187 |
<- lm(plant_height ~ 1 + light_exposure + I(light_exposure^2), data = poly_plants)
model
<- poly_plants %>%
poly_plants mutate(with_predict= predict(model, poly_plants))
%>%
poly_plants ggplot(aes(x = light_exposure, y = plant_height)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
4.3 Degree 3 (Cubic)
In a Degree 3 (Cubic) polynomial, we can capture two bumps (or one “S” shaped curve). The graph can have two turning points, meaning it can start by increasing, then decrease, and increase again (or the opposite). This captures the data quite nicely.
Model specification:
\(y = w_1\cdot\mathbf{1} + w_2 \cdot \mathbf{x} + w_3 \cdot \mathbf{x}^2 + w_4 \cdot \mathbf{x}^3\)
Call:
lm(formula = plant_height ~ 1 + light_exposure + I(light_exposure^2) +
I(light_exposure^3), data = poly_plants)
Coefficients:
(Intercept) light_exposure I(light_exposure^2)
8.7363 2.7276 3.7796
I(light_exposure^3)
-0.3632
Fitted model:
\(y = 8.7363 \cdot \mathbf{1} + 2.7276 \cdot \mathbf{x} + 3.7796 \cdot \mathbf{x}^2 + -0.3632 \cdot \mathbf{x}^3\)
plant | light_exposure | plant_height | with_predict |
---|---|---|---|
Sunflower | 0 | 10 | 8.736264 |
Sunflower | 1 | 15 | 14.880120 |
Sunflower | 2 | 25 | 26.403596 |
Rose | 3 | 40 | 41.127206 |
Rose | 4 | 55 | 56.871462 |
Rose | 5 | 70 | 71.456877 |
Cactus | 6 | 85 | 82.703963 |
Cactus | 7 | 95 | 88.433233 |
Cactus | 8 | 90 | 86.465202 |
Orchid | 9 | 70 | 74.620380 |
Orchid | 10 | 40 | 50.719281 |
Orchid | 11 | 20 | 12.582418 |
<- lm(plant_height ~ 1 + light_exposure + I(light_exposure^2) + I(light_exposure^3), data = poly_plants)
model
<- poly_plants %>%
poly_plants mutate(with_predict= predict(model, poly_plants))
%>%
poly_plants ggplot(aes(x = light_exposure, y = plant_height)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
5 Brain size (log)
5.1 Untransformed
When we have a nonlinear relationship, as here, we could just try to fit a linear model to the untransformed data. It techincally works — there is no math reason that prevents us from fitting this model — but we can see that it is a very bad description of the data.
Model specification:
\(y = w_1\cdot\mathbf{1} + w_2\cdot\mathbf{body\_size\_kg}\)
Call:
lm(formula = brain_size_cc ~ 1 + body_size_kg, data = brain_data)
Coefficients:
(Intercept) body_size_kg
816.59014 0.05021
Fitted model:
\(y = 816.59014 \cdot\mathbf{1} + 0.05021 \cdot\mathbf{body\_size\_kg}\)
Species | brain_size_cc | body_size_kg | with_predict |
---|---|---|---|
Mouse | 0.4 | 2.0e-02 | 816.5911 |
Rat | 2.0 | 2.5e-01 | 816.6027 |
Rabbit | 12.0 | 1.5e+00 | 816.6655 |
Cat | 25.0 | 4.5e+00 | 816.8161 |
Dog | 50.0 | 1.0e+01 | 817.0923 |
Sheep | 150.0 | 7.0e+01 | 820.1049 |
Pig | 300.0 | 1.0e+02 | 821.6113 |
Goat | 450.0 | 5.0e+01 | 819.1007 |
Gorilla | 500.0 | 1.8e+02 | 825.6282 |
Horse | 600.0 | 4.0e+02 | 836.6747 |
Human | 1300.0 | 7.0e+01 | 820.1049 |
Chimpanzee | 400.0 | 6.0e+01 | 819.6028 |
Dolphin | 1500.0 | 2.0e+02 | 826.6324 |
Whale (Orca) | 6000.0 | 5.0e+03 | 1067.6469 |
Elephant | 6000.0 | 6.0e+03 | 1117.8583 |
Blue Whale | 8000.0 | 1.5e+05 | 8348.2943 |
Giraffe | 600.0 | 8.0e+02 | 856.7592 |
Rhinoceros | 450.0 | 1.2e+03 | 876.8438 |
Walrus | 400.0 | 8.0e+02 | 856.7592 |
Tiger | 90.0 | 2.2e+02 | 827.6366 |
Kangaroo | 50.0 | 6.0e+01 | 819.6028 |
Crocodile | 200.0 | 4.0e+02 | 836.6747 |
Penguin | 20.0 | 3.0e+01 | 818.0965 |
<- read_csv('https://kathrynschuler.com/datasci/assests/csv/animal_brain_body_size.csv') %>%
brain_data rename(brain_size_cc = `Brain Size (cc)`, body_size_kg = `Body Size (kg)`)
<- lm(brain_size_cc ~ 1 + body_size_kg, data = brain_data)
model
<- brain_data %>%
brain_data mutate(with_predict= predict(model, brain_data))
%>%
brain_data ggplot(aes(x = body_size_kg, y = brain_size_cc)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
5.2 Log transformed
We can apply a log transform directly in the model specification provided to R. This works great, but if we try to plot the fitted model on untransformed data (e.g. if we use brain_size_cc and body_size_kg as our y and x aesthetics) something doesn’t seem quite right. Instead, we should plot the data transformed to log as well, so the model predictions match the data.
Model specification:
\(log(y) = w_1\cdot\mathbf{1} + w_2 \cdot log(\mathbf{body\_size\_kg})\)
Call:
lm(formula = log(brain_size_cc) ~ 1 + log(body_size_kg), data = brain_data)
Coefficients:
(Intercept) log(body_size_kg)
2.2042 0.6687
Fitted model:
\(log(y) = 2.2042\cdot\mathbf{1} + 0.6687 \cdot log(\mathbf{body\_size\_kg})\)
Species | brain_size_cc | body_size_kg | log_brain_size_cc | log_body_size_kg | with_predict |
---|---|---|---|---|---|
Mouse | 0.4 | 2.0e-02 | -0.9162907 | -3.9120230 | -0.4116321 |
Rat | 2.0 | 2.5e-01 | 0.6931472 | -1.3862944 | 1.2772249 |
Rabbit | 12.0 | 1.5e+00 | 2.4849066 | 0.4054651 | 2.4753051 |
Cat | 25.0 | 4.5e+00 | 3.2188758 | 1.5040774 | 3.2099046 |
Dog | 50.0 | 1.0e+01 | 3.9120230 | 2.3025851 | 3.7438358 |
Sheep | 150.0 | 7.0e+01 | 5.0106353 | 4.2484952 | 5.0449906 |
Pig | 300.0 | 1.0e+02 | 5.7037825 | 4.6051702 | 5.2834853 |
Goat | 450.0 | 5.0e+01 | 6.1092476 | 3.9120230 | 4.8200046 |
Gorilla | 500.0 | 1.8e+02 | 6.2146081 | 5.1929569 | 5.6765155 |
Horse | 600.0 | 4.0e+02 | 6.3969297 | 5.9914645 | 6.2104467 |
Human | 1300.0 | 7.0e+01 | 7.1701195 | 4.2484952 | 5.0449906 |
Chimpanzee | 400.0 | 6.0e+01 | 5.9914645 | 4.0943446 | 4.9419160 |
Dolphin | 1500.0 | 2.0e+02 | 7.3132204 | 5.2983174 | 5.7469660 |
Whale (Orca) | 6000.0 | 5.0e+03 | 8.6995147 | 8.5171932 | 7.8993037 |
Elephant | 6000.0 | 6.0e+03 | 8.6995147 | 8.6995147 | 8.0212150 |
Blue Whale | 8000.0 | 1.5e+05 | 8.9871968 | 11.9183906 | 10.1735527 |
Giraffe | 600.0 | 8.0e+02 | 6.3969297 | 6.6846117 | 6.6739274 |
Rhinoceros | 450.0 | 1.2e+03 | 6.1092476 | 7.0900768 | 6.9450462 |
Walrus | 400.0 | 8.0e+02 | 5.9914645 | 6.6846117 | 6.6739274 |
Tiger | 90.0 | 2.2e+02 | 4.4998097 | 5.3936275 | 5.8106962 |
Kangaroo | 50.0 | 6.0e+01 | 3.9120230 | 4.0943446 | 4.9419160 |
Crocodile | 200.0 | 4.0e+02 | 5.2983174 | 5.9914645 | 6.2104467 |
Penguin | 20.0 | 3.0e+01 | 2.9957323 | 3.4011974 | 4.4784353 |
<- lm(log(brain_size_cc) ~ 1 + log(body_size_kg), data = brain_data)
model
<- brain_data %>%
brain_data mutate(
log_brain_size_cc = log(brain_size_cc),
log_body_size_kg = log(body_size_kg)
%>%
) mutate(with_predict= predict(model, brain_data))
%>%
brain_data ggplot(aes(x = log_body_size_kg, y = log_brain_size_cc)) +
geom_point() +
geom_line(aes(y = with_predict), color = "blue") +
theme_bw(base_size = 14)
6 Further reading
- Ch 6: Language of models in Statistical Modeling