Lab 7: Applied model specification

Not graded, just practice

Author

Katie Schuler

Published

October 17, 2024

Practice your new modeling skills with these practice exam questions! Best to open a fresh Google Colab notebook and test things out! Refer to the study guide to find answers as well.

1 Primate brains

Primates have brains of varying sizes, and one possible explanation for this variation is differences in body size. Larger-bodied primates may tend to have heavier brains, but this relationship is not always straightforward. To investigate whether body size can reliably explain differences in brain weight across primate species, let’s fit a model that predicts brain weight based on body size.

The data, in case you want to work with it yourself: primate brains

Code

data <- read_csv("https://kathrynschuler.com/datasci/assests/csv/primate_brains.csv")
glimpse(data)

Rows: 144
Columns: 5
$ taxon          <chr> "Alouatta_caraya", "Alouatta_palliata", "Alouatta_pigra…
$ body_weight_g  <dbl> 5597, 6359, 8940, 6247, 1073, 870, 871, 239, 6409, 8034…
$ brain_weight_g <dbl> 52.72, 50.91, 52.97, 56.57, 21.41, 16.78, 17.21, 7.17, …
$ diet_category  <chr> "Fol", "Fol", "Fol", "Fol", "Frug/Fol", "Frug", "Frug",…
$ group_size     <dbl> 6.68, 15.55, 5.93, 6.97, 3.00, 3.50, 3.51, 1.25, 16.40,…

Code

ggplot(data, aes( x = body_weight_g, y = brain_weight_g)) +
    geom_point()

1.2 Model specification

Suppose we specify the following model for the primate brains data: \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

What is the response variable?

brain_weight_g body_weight_g log(brain_weight_g) log(body_weight_g)
What is the explanatory variable?

brain_weight_g body_weight_g log(brain_weight_g) log(body_weight_g)
True or false, the functional form of this model can be expressed as a weighted sum of inputs? \(y=\sum_{i=1}^{n}w_ix_i\)

True False
Which of the following model terms are included in the model specification above? Choose all that apply.

Intercept Main Interaction Transformation
Specify the model equation in R notation.

Answer

# like this (explicit intercept)
log(brain_weight_g) ~ 1 + log_(body_weight_g)

# or like this (implicit intercept)
log(brain_weight_g) ~ log(body_weight_g)

1.3 Fitted model

Suppose you fit the model with lm() and return the following:


Call:
lm(formula = log(brain_weight_g) ~ 1 + log(body_weight_g), data = data)

Coefficients:
       (Intercept)  log(body_weight_g)  
           -2.4649              0.7752

Which of the following is \(w_1\) in the model specification \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

1 -2.4649 0.7752 Not enough information to determine this
Which of the following is \(w_2\) in the model specification \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

1 -2.4649 0.7752 Not enough information to determine this
Suppose a primate has a \(\log(body\_weight\_g)\) equal to 10. Which of the following would the model predict to be the primate’s \(\log(brain\_weight\_g)\)?

25.21 5.29 -10.7752 Not enough information to determine this
Which of the following figures could show the fitted model?

the blue line the red line the black line Not enough information to determine this

2 Social brain hypothesis

The Social Brain Hypothesis argues that the pressures of navigating increasingly complex social environments were a significant driver in the evolution of brain size and intelligence in humans and other primates.

Let’s specify and fit this model in R.

model <- lm(log(brain_weight_g) ~ 1 + log(group_size), 
    data = primate_brains)

Code

primate_brains <- primate_brains %>%
    mutate(y_body_group = predict(model, primate_brains))

ggplot(primate_brains, aes(, 
    y = log(brain_weight_g),
    x = log(group_size))
) +
geom_point(size = 2) +
geom_line(color = "blue", aes(y = y_body_group))

Fill in the blank: how many inputs does this model have?

b. Question
Answer

Specify the model as an equation

\(\log(brain\_weight\_g) = w_1 1 + w_2 \log(group\_size)\)

or, if you created new columns in your data with the the log transformed data, for example:

data <- data %>%
    mutate(log_brain_weight = log(brain_weight_g)) %>%
    mutate(log_group_size = log(group_size))

then you could have written:

\(\log(brain\_weight\_g) = w_1 1 + w_2 \log(group\_size)\)

Given the figure above, which of the following could be the free paramter estimate for \(w_1\)?

1 0.66 2.25 5 Not enough information to determine this
Given the figure above, which of the following could be the free paramter estimate for \(w_2\)?

1 0.66 2.25 5 Not enough information to determine this
Suppose we encounter a primate in a (log) group size of 4. What could be the model prediction for their (log) brain weight?

3.5 4.1 4.9 6.2 Not enough information to determine this

f. Question
Answer

Suppose we wanted to include \(\log(body\_size\_g)\) back into the model as an additional predictor of \(\log(brain\_size\_g)\). Specify the model in R.

log(brain_size_g) ~ 1 + log(group_size) + log(body_size_g)

3 Fruit v Leaf eaters

Diet may influence the relationship between brain and body size in primates because the type of food a species consumes can impact its ability to meet the energy demands of a larger brain. Fruit-eating primates have access to energy-rich, easily digestible food, which could support the metabolic costs of both a large body and a larger, more complex brain.

Let’s begin by adding diet_category to our plot mapped to the color aesthetic.

Code

primate_brains %>%
    ggplot(aes(
        y = log(brain_weight_g), 
        x = log(body_weight_g),
        color = diet_category
    )) +
    geom_point(size = 2)

Frugivorous (“Frug”) primates primarily eat fruit, while folivorous (“Fol”) primates primarily consume leaves. The “Frug/Fol” category refers to primates that combine both fruit and leaf consumption in their diet. “Om” stands for omnivores, which we might suspect is similar to “Frug/Fol” with more variation in diet. To simplify things, let’s focus our analysis on just the Fol and Frug categories.

Code

fruit_v_leaves <- primate_brains %>%
    filter(diet_category %in% c("Fol", "Frug")) 

fruit_v_leaves %>%
    ggplot(aes(
        x = log(body_weight_g), 
        y = log(brain_weight_g), 
        color = diet_category
    )) +
    geom_point()

Suppose we specify a model that predicts brain weight by body size and diet category.

Code

model <- lm(log(brain_weight_g) ~ log(body_weight_g) + diet_category, data = fruit_v_leaves) 

model


Call:
lm(formula = log(brain_weight_g) ~ log(body_weight_g) + diet_category, 
    data = fruit_v_leaves)

Coefficients:
       (Intercept)  log(body_weight_g)   diet_categoryFrug  
           -2.8047              0.7778              0.4576

a. Question
Answer

Specify the model with a mathematical expression.

\(\log(brain\_size\_g) = w_11 + w_2\log(body\_size\_g) + w_3diet\_category\)

b. Question
Answer

Notice we did not include an interaction term between body weight and diet category. Why might a modeler make this decision?

A modeler might choose not to include an interaction term based on exploratory visualization. The scatter plot shows roughly parallel lines for frugivorous and folivorous primates, which could indicate that body size influences brain weight similarly, regardless of diet.

You could have also said something relevant to model complexity: the modeler may have noticed in exploratory data analysis that body size seems to influence brain weight similarly, and decided to keep the model simpler and easier to interpret by leaving out the interaction.

True or false: the diet_category variable is categorical, so this is a classification problem.

True False

c. Question
Answer

Write the fitted model as a mathematical expression.

\(\log(brain\_size\_g) = -2.8047\times{1} + 0.7778\times{\log(body\_size\_g)} + 0.4576\times{diet\_category}\)

Based on the fitted model returned by lm() above, which level of diet_category is the reference level?

Fol Frug Not enough information to determine this

e. Question
Answer

What is the model’s prediction for a primate with a (log) body weight of 7 who eats leaves? Write your answer as a mathematical expression without simplifying it.

\(\log(brain\_size\_g) = -2.8047\times{1} + 0.7778\times{\mathbf{7}} + 0.4576\times{\mathbf{0}}\)

Fill in the blank. Of the figures below, figure could be the plot of the model we specified.

4 Matching plots to equations

Match the following plots to the equations below. Each plot can be mapped to a unique expression of the linear model equation.

\(y = w_11\)
\(y = w_11 + w_2x\)
\(y = w_11 + w_2z\)
\(y = w_11 + w_2x + w_3z\)
\(y = w_11 + w_2x + w_3z + w_4x\times{z} + w_5x^2\)
\(y = w_11 + w_2x + w_3x^2\)
Which of the equations above has the most inputs (enter a lowercase letter a-f)?
Which of the equations above is the most complex model? (enter a lowercase letter a-f)?

5 Polynomials

What is the purpose of including polynomial terms in a linear model?

To improve model interpretability To model nonlinear relationships To reduce overfitting in the model To ensure that residuals are normally distributed
Which of the following is an example of a quadratic polynomial term in a linear model?
1. \(x\)
2. \(x^2\)
3. \(\sqrt{x}\)
4. \(\log{x}\)
Why might higher-degree polynomial terms lead to overfitting in a linear model?

Higher-degree terms make the model too simple Higher-degree terms force the model to fit the noise in the data Polynomial terms always reduce the model's flexibility Polynomial terms make the model biased
Which of the following models includes both linear and quadratic terms
1. \(y = \beta_0 + \beta_1x\)
2. \(y = \beta_0 + \beta_1x^2\)
3. \(y = \beta_0 + \beta_1x + \beta_2x^2\)
4. \(y = \beta_0 + \beta_1x + \beta_2x^3\)

1 Primate brains

1.1 Type of model

1.2 Model specification

1.3 Fitted model

2 Social brain hypothesis

3 Fruit v Leaf eaters

4 Matching plots to equations

5 Polynomials