Lab 7: Applied model specification

Not graded, just practice

Author

Katie Schuler

Published

October 17, 2024

Practice your new modeling skills with these practice exam questions! Best to open a fresh Google Colab notebook and test things out! Refer to the study guide to find answers as well.

1 Primate brains

Primates have brains of varying sizes, and one possible explanation for this variation is differences in body size. Larger-bodied primates may tend to have heavier brains, but this relationship is not always straightforward. To investigate whether body size can reliably explain differences in brain weight across primate species, let’s fit a model that predicts brain weight based on body size.

The data, in case you want to work with it yourself: primate brains

Code
data <- read_csv("https://kathrynschuler.com/datasci/assests/csv/primate_brains.csv")
glimpse(data)
Rows: 144
Columns: 5
$ taxon          <chr> "Alouatta_caraya", "Alouatta_palliata", "Alouatta_pigra…
$ body_weight_g  <dbl> 5597, 6359, 8940, 6247, 1073, 870, 871, 239, 6409, 8034…
$ brain_weight_g <dbl> 52.72, 50.91, 52.97, 56.57, 21.41, 16.78, 17.21, 7.17, …
$ diet_category  <chr> "Fol", "Fol", "Fol", "Fol", "Frug/Fol", "Frug", "Frug",…
$ group_size     <dbl> 6.68, 15.55, 5.93, 6.97, 3.00, 3.50, 3.51, 1.25, 16.40,…
Code
ggplot(data, aes( x = body_weight_g, y = brain_weight_g)) +
    geom_point()

1.1 Type of model

  1. Is this a supervised or unsupervised learning problem?

  2. Is this regression or classification?

  3. Is the relationship between brain_weight_g and body_weight_g linear or nonlinear?

  4. Is the nonlinear relationship linearizable or non-linearizable?

  5. What function could we choose to linearize this relationships?

1.2 Model specification

Suppose we specify the following model for the primate brains data: \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

  1. What is the response variable?

  2. What is the explanatory variable?

  3. True or false, the functional form of this model can be expressed as a weighted sum of inputs? \(y=\sum_{i=1}^{n}w_ix_i\)

  4. Which of the following model terms are included in the model specification above? Choose all that apply.

  5. Specify the model equation in R notation.

Answer
# like this (explicit intercept)
log(brain_weight_g) ~ 1 + log_(body_weight_g)

# or like this (implicit intercept)
log(brain_weight_g) ~ log(body_weight_g)

1.3 Fitted model

Suppose you fit the model with lm() and return the following:


Call:
lm(formula = log(brain_weight_g) ~ 1 + log(body_weight_g), data = data)

Coefficients:
       (Intercept)  log(body_weight_g)  
           -2.4649              0.7752  
  1. Which of the following is \(w_1\) in the model specification \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

  2. Which of the following is \(w_2\) in the model specification \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

  3. Suppose a primate has a \(\log(body\_weight\_g)\) equal to 10. Which of the following would the model predict to be the primate’s \(\log(brain\_weight\_g)\)?

  4. Which of the following figures could show the fitted model?

2 Social brain hypothesis

The Social Brain Hypothesis argues that the pressures of navigating increasingly complex social environments were a significant driver in the evolution of brain size and intelligence in humans and other primates.

Let’s specify and fit this model in R.

model <- lm(log(brain_weight_g) ~ 1 + log(group_size), 
    data = primate_brains)
Code
primate_brains <- primate_brains %>%
    mutate(y_body_group = predict(model, primate_brains))

ggplot(primate_brains, aes(, 
    y = log(brain_weight_g),
    x = log(group_size))
) +
geom_point(size = 2) +
geom_line(color = "blue", aes(y = y_body_group)) 

  1. Fill in the blank: how many inputs does this model have?

Specify the model as an equation

\(\log(brain\_weight\_g) = w_1 1 + w_2 \log(group\_size)\)

or, if you created new columns in your data with the the log transformed data, for example:

data <- data %>%
    mutate(log_brain_weight = log(brain_weight_g)) %>%
    mutate(log_group_size = log(group_size))

then you could have written:

\(\log(brain\_weight\_g) = w_1 1 + w_2 \log(group\_size)\)

  1. Given the figure above, which of the following could be the free paramter estimate for \(w_1\)?

  2. Given the figure above, which of the following could be the free paramter estimate for \(w_2\)?

  3. Suppose we encounter a primate in a (log) group size of 4. What could be the model prediction for their (log) brain weight?

Suppose we wanted to include \(\log(body\_size\_g)\) back into the model as an additional predictor of \(\log(brain\_size\_g)\). Specify the model in R.

log(brain_size_g) ~ 1 + log(group_size) + log(body_size_g)

3 Fruit v Leaf eaters

Diet may influence the relationship between brain and body size in primates because the type of food a species consumes can impact its ability to meet the energy demands of a larger brain. Fruit-eating primates have access to energy-rich, easily digestible food, which could support the metabolic costs of both a large body and a larger, more complex brain.

Let’s begin by adding diet_category to our plot mapped to the color aesthetic.

Code
primate_brains %>%
    ggplot(aes(
        y = log(brain_weight_g), 
        x = log(body_weight_g),
        color = diet_category
    )) +
    geom_point(size = 2) 

Frugivorous (“Frug”) primates primarily eat fruit, while folivorous (“Fol”) primates primarily consume leaves. The “Frug/Fol” category refers to primates that combine both fruit and leaf consumption in their diet. “Om” stands for omnivores, which we might suspect is similar to “Frug/Fol” with more variation in diet. To simplify things, let’s focus our analysis on just the Fol and Frug categories.

Code
fruit_v_leaves <- primate_brains %>%
    filter(diet_category %in% c("Fol", "Frug")) 

fruit_v_leaves %>%
    ggplot(aes(
        x = log(body_weight_g), 
        y = log(brain_weight_g), 
        color = diet_category
    )) +
    geom_point() 

Suppose we specify a model that predicts brain weight by body size and diet category.

Code
model <- lm(log(brain_weight_g) ~ log(body_weight_g) + diet_category, data = fruit_v_leaves) 

model

Call:
lm(formula = log(brain_weight_g) ~ log(body_weight_g) + diet_category, 
    data = fruit_v_leaves)

Coefficients:
       (Intercept)  log(body_weight_g)   diet_categoryFrug  
           -2.8047              0.7778              0.4576  

Specify the model with a mathematical expression.

\(\log(brain\_size\_g) = w_11 + w_2\log(body\_size\_g) + w_3diet\_category\)

Notice we did not include an interaction term between body weight and diet category. Why might a modeler make this decision?

A modeler might choose not to include an interaction term based on exploratory visualization. The scatter plot shows roughly parallel lines for frugivorous and folivorous primates, which could indicate that body size influences brain weight similarly, regardless of diet.

You could have also said something relevant to model complexity: the modeler may have noticed in exploratory data analysis that body size seems to influence brain weight similarly, and decided to keep the model simpler and easier to interpret by leaving out the interaction.

  1. True or false: the diet_category variable is categorical, so this is a classification problem.

Write the fitted model as a mathematical expression.

\(\log(brain\_size\_g) = -2.8047\times{1} + 0.7778\times{\log(body\_size\_g)} + 0.4576\times{diet\_category}\)

  1. Based on the fitted model returned by lm() above, which level of diet_category is the reference level?

What is the model’s prediction for a primate with a (log) body weight of 7 who eats leaves? Write your answer as a mathematical expression without simplifying it.

\(\log(brain\_size\_g) = -2.8047\times{1} + 0.7778\times{\mathbf{7}} + 0.4576\times{\mathbf{0}}\)

  1. Fill in the blank. Of the figures below, figure could be the plot of the model we specified.

4 Matching plots to equations

Match the following plots to the equations below. Each plot can be mapped to a unique expression of the linear model equation.

  1. \(y = w_11\)

  2. \(y = w_11 + w_2x\)

  3. \(y = w_11 + w_2z\)

  4. \(y = w_11 + w_2x + w_3z\)

  5. \(y = w_11 + w_2x + w_3z + w_4x\times{z} + w_5x^2\)

  6. \(y = w_11 + w_2x + w_3x^2\)

  7. Which of the equations above has the most inputs (enter a lowercase letter a-f)?

  8. Which of the equations above is the most complex model? (enter a lowercase letter a-f)?

5 Polynomials

  1. What is the purpose of including polynomial terms in a linear model?

  2. Which of the following is an example of a quadratic polynomial term in a linear model?

    1. \(x\)
    2. \(x^2\)
    3. \(\sqrt{x}\)
    4. \(\log{x}\)

  3. Why might higher-degree polynomial terms lead to overfitting in a linear model?

  4. Which of the following models includes both linear and quadratic terms

    1. \(y = \beta_0 + \beta_1x\)
    2. \(y = \beta_0 + \beta_1x^2\)
    3. \(y = \beta_0 + \beta_1x + \beta_2x^2\)
    4. \(y = \beta_0 + \beta_1x + \beta_2x^3\)