Week 15: Multilevel models

Data Science for Studying Language and the Mind

Katie Schuler

2024-12-05

Announcements

  • You must attempt pset06 to be elligible to drop a pset (canvas reflects your grade now)
    • No, you can’t turn it in blank – a good faith attempt
  • Please complete the Final exam survey on canvas if you haven’t already
    • Required to receive 1% bonus to final course grade (90% becomes 91%)
  • Wesley is repairing the small hiccup in pset grading (each out of different point values)
  • Regrade requests for Exam 2 considered tonight

You are here

Data science with R
  • R basics
  • Data visualization
  • Data wrangling
Stats & Model building
  • Sampling distribution
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Mixed-effect models

Addressing the elephant

Conditions for inference

  • Up until now, there is no math reason we can’t fit a model (mostly)
  • But when we start interpreting and using statistical inference we have to be more careful. We want to infer something about the population.
  • Certain conditions need to be met for the results of hypothesis tests and confidence intervals to have intepretable meanings.

What are the conditions?

  1. Linearity
  2. Normality of residuals
  3. Homogeneity of variance
  4. Independence

We can check Linearity, Normality, and Equality with check_model(). But the independence condition can only be verified through an understanding of how the data was collected.

1. Linearity

The relationship between the response and explantory variables must be linear. The model assumes that the response variable can be expressed as a linear combination (weighted sum of inputs).

Reminder: residuals

2. Normality

The residuals should be a normal distribution centered at zero.

Important for hypothesis testing, because many theoretical distributions (t-test, F-test, etc) assume a normal distribution.

3. Homogeneity of variance

The residuals should have equal variance across all values of the explanatory variables (homoscedasticity). The value/spread of residuals should not depend on the value of our explanatory variable.

Important because unequal variances can distort the standard errors of our parameter estimates (unreliable estimates of reliability!)

4. Independence

The observations in our data must be independent. Important because this can inflate our test statistics

Multilevel models

🥸 Aliases of multilevel models

  • Mixed models
  • Mixed effect models
  • Hierarchical models
  • Mixed effect regression
  • Probably even more!

Basics of multilevel models

A great visualization: http://mfviz.com/hierarchical-models/

Demo with lexdec

library(languageR)

Lexical decision data

lexdec %>% glimpse
Rows: 1,659
Columns: 28
$ Subject        <fct> A1, A1, A1, A1, A1, A1, A1, A1, A1, A1, A1, A1, A1, A1,…
$ RT             <dbl> 6.340359, 6.308098, 6.349139, 6.186209, 6.025866, 6.180…
$ Trial          <int> 23, 27, 29, 30, 32, 33, 34, 38, 41, 42, 45, 46, 47, 48,…
$ Sex            <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F…
$ NativeLanguage <fct> English, English, English, English, English, English, E…
$ Correct        <fct> correct, correct, correct, correct, correct, correct, c…
$ PrevType       <fct> word, nonword, nonword, word, nonword, word, word, nonw…
$ PrevCorrect    <fct> correct, correct, correct, correct, correct, correct, c…
$ Word           <fct> owl, mole, cherry, pear, dog, blackberry, strawberry, s…
$ Frequency      <dbl> 4.859812, 4.605170, 4.997212, 4.727388, 7.667626, 4.060…
$ FamilySize     <dbl> 1.3862944, 1.0986123, 0.6931472, 0.0000000, 3.1354942, …
$ SynsetCount    <dbl> 0.6931472, 1.9459101, 1.6094379, 1.0986123, 2.0794415, …
$ Length         <int> 3, 4, 6, 4, 3, 10, 10, 8, 6, 6, 5, 8, 9, 5, 3, 4, 8, 6,…
$ Class          <fct> animal, animal, plant, plant, animal, plant, plant, ani…
$ FreqSingular   <int> 54, 69, 83, 44, 1233, 26, 50, 63, 11, 24, 104, 28, 7, 2…
$ FreqPlural     <int> 74, 30, 49, 68, 828, 31, 65, 47, 9, 42, 95, 28, 4, 45, …
$ DerivEntropy   <dbl> 0.7912, 0.6968, 0.4754, 0.0000, 1.2129, 0.3492, 0.0000,…
$ Complex        <fct> simplex, simplex, simplex, simplex, simplex, complex, c…
$ rInfl          <dbl> -0.31015493, 0.81450804, 0.51879379, -0.42744401, 0.397…
$ meanRT         <dbl> 6.3582, 6.4150, 6.3426, 6.3353, 6.2956, 6.3959, 6.3475,…
$ SubjFreq       <dbl> 3.12, 2.40, 3.88, 4.52, 6.04, 3.28, 5.04, 2.80, 3.12, 3…
$ meanSize       <dbl> 3.4758, 2.9999, 1.6278, 1.9908, 4.6429, 1.5831, 1.8922,…
$ meanWeight     <dbl> 3.1806, 2.6112, 1.2081, 1.6114, 4.5167, 1.1365, 1.4951,…
$ BNCw           <dbl> 12.057065, 5.738806, 5.716520, 2.050370, 74.838494, 1.2…
$ BNCc           <dbl> 0.000000, 4.062251, 3.249801, 1.462410, 50.859385, 0.16…
$ BNCd           <dbl> 6.175602, 2.850278, 12.588727, 7.363218, 241.561040, 1.…
$ BNCcRatio      <dbl> 0.000000, 0.707856, 0.568493, 0.713242, 0.679589, 0.127…
$ BNCdRatio      <dbl> 0.512198, 0.496667, 2.202166, 3.591166, 3.227765, 0.934…

Explore RT by Frequency

ggplot(lexdec, aes(x = Frequency, y = RT)) +
  geom_point()

Model RT by Frequency

model <- lm(RT ~ Frequency, data = lexdec) 
summary(model)

Call:
lm(formula = RT ~ Frequency, data = lexdec)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55407 -0.16153 -0.03494  0.11699  1.08768 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.588778   0.022296 295.515   <2e-16 ***
Frequency   -0.042872   0.004533  -9.459   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2353 on 1657 degrees of freedom
Multiple R-squared:  0.05123,   Adjusted R-squared:  0.05066 
F-statistic: 89.47 on 1 and 1657 DF,  p-value: < 2.2e-16

Plot the model

ggplot(lexdec, aes(x = Frequency, y = RT)) +
  geom_point() +
  geom_smooth(method = "lm")

Check assumptions

check_model(model)

Is independence violated?

lexdec %>% 
  group_by(Subject) %>% 
  tally()
# A tibble: 21 × 2
   Subject     n
   <fct>   <int>
 1 A1         79
 2 A2         79
 3 A3         79
 4 C          79
 5 D          79
 6 I          79
 7 J          79
 8 K          79
 9 M1         79
10 M2         79
# ℹ 11 more rows

Can we just add subject?

model_subj <- lm(RT ~ Frequency + Subject, data = lexdec)
summary(model_subj)

Call:
lm(formula = RT ~ Frequency + Subject, data = lexdec)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42365 -0.11792 -0.02019  0.08908  1.06683 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.481422   0.026243 246.981  < 2e-16 ***
Frequency   -0.042872   0.003485 -12.301  < 2e-16 ***
SubjectA2   -0.057353   0.028791  -1.992 0.046532 *  
SubjectA3    0.120001   0.028791   4.168 3.23e-05 ***
SubjectC     0.044581   0.028791   1.548 0.121709    
SubjectD     0.128337   0.028791   4.458 8.85e-06 ***
SubjectI    -0.024612   0.028791  -0.855 0.392750    
SubjectJ    -0.023057   0.028791  -0.801 0.423335    
SubjectK    -0.084903   0.028791  -2.949 0.003234 ** 
SubjectM1   -0.104522   0.028791  -3.630 0.000292 ***
SubjectM2    0.254975   0.028791   8.856  < 2e-16 ***
SubjectP     0.151447   0.028791   5.260 1.63e-07 ***
SubjectR1    0.029529   0.028791   1.026 0.305214    
SubjectR2    0.194500   0.028791   6.756 1.97e-11 ***
SubjectR3    0.115171   0.028791   4.000 6.61e-05 ***
SubjectS     0.095937   0.028791   3.332 0.000881 ***
SubjectT1    0.085070   0.028791   2.955 0.003174 ** 
SubjectT2    0.529758   0.028791  18.400  < 2e-16 ***
SubjectV     0.309583   0.028791  10.753  < 2e-16 ***
SubjectW1   -0.013835   0.028791  -0.481 0.630907    
SubjectW2    0.182732   0.028791   6.347 2.84e-10 ***
SubjectZ     0.321141   0.028791  11.154  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1809 on 1637 degrees of freedom
Multiple R-squared:  0.4457,    Adjusted R-squared:  0.4386 
F-statistic: 62.69 on 21 and 1637 DF,  p-value: < 2.2e-16

Math says yes! But this is not a great idea.

The model specification

in R:

RT ~ Frequency + Subject

in Eq:

\[\begin{align} RT_i &= \beta_0 + \beta_1 \cdot \text{Frequency}_i + \beta_2 \cdot \text{Subject}_2 + \beta_3 \cdot \text{Subject}_3 \\ &+ \beta_4 \cdot \text{Subject}_4 + \beta_5 \cdot \text{Subject}_5 + \cdots + \beta_{21} \cdot \text{Subject}_{21} + \epsilon_i \end{align}\]

Solution: multilevel-model!

lmer instead of lm and include random intercepts

model_multi <- lmer(RT ~ Frequency + (1|Subject), data = lexdec)
summary(model_multi)
Linear mixed model fit by REML ['lmerMod']
Formula: RT ~ Frequency + (1 | Subject)
   Data: lexdec

REML criterion at convergence: -866.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3537 -0.6542 -0.1158  0.4880  5.8884 

Random effects:
 Groups   Name        Variance Std.Dev.
 Subject  (Intercept) 0.02373  0.1541  
 Residual             0.03274  0.1809  
Number of obs: 1659, groups:  Subject, 21

Fixed effects:
             Estimate Std. Error t value
(Intercept)  6.588778   0.037737   174.6
Frequency   -0.042872   0.003485   -12.3

Correlation of Fixed Effects:
          (Intr)
Frequency -0.439

Model specification

\[\begin{align} RT_{ij} &= \beta_0 + \beta_1 \cdot \text{Frequency}_i + u_j + \epsilon_{ij} \end{align}\]

In a fitted model:

  • \(\beta_0\) is overall intercept
  • \(\beta_1\) is the fixed effect of frequency on RT
  • \(u_j\) is the random intercept for subject \(j\), representing the individual deviation for each subject from the overall mean RT.

Returning the random effects

We can get the value of \(u_j\) for each subject with ranef.

# Get the random effects (u_j values)
ranef(model_multi)
$Subject
    (Intercept)
A1 -0.105513481
A2 -0.161881960
A3  0.012427356
C  -0.061697510
D   0.020621048
I  -0.129703500
J  -0.128174991
K  -0.188959004
M1 -0.208241374
M2  0.145085242
P   0.043334300
R1 -0.076491194
R2  0.085648101
R3  0.007680557
S  -0.011222924
T1 -0.021903925
T2  0.415151383
V   0.198755575
W1 -0.119111253
W2  0.074082535
Z   0.210115019

with conditional variances for "Subject" 

Plotting the model with predict

lexdec_w_predict <- lexdec %>%
  mutate(prediction = predict(model_multi, lexdec)) %>%
  mutate(prediction_fixed = predict(model_multi, lexdec, re.form = NA)) %>%
  select(Subject, Word, Frequency, RT, prediction, prediction_fixed)

lexdec_w_predict %>% filter(Frequency == 4.859812)
     Subject Word Frequency       RT prediction prediction_fixed
1         A1  owl  4.859812 6.340359   6.274916         6.380429
499       A2  owl  4.859812 6.082219   6.218548         6.380429
723       A3  owl  4.859812 6.483107   6.392857         6.380429
1021       C  owl  4.859812 6.171701   6.318732         6.380429
113        D  owl  4.859812 6.309918   6.401051         6.380429
1564       I  owl  4.859812 6.109248   6.250726         6.380429
331        J  owl  4.859812 6.356108   6.252255         6.380429
791        K  owl  4.859812 6.011267   6.191470         6.380429
454       M1  owl  4.859812 6.089045   6.172188         6.380429
1593      M2  owl  4.859812 6.429719   6.525515         6.380429
187        P  owl  4.859812 6.436150   6.423764         6.380429
1357      R1  owl  4.859812 6.234411   6.303938         6.380429
1161      R2  owl  4.859812 6.378426   6.466078         6.380429
1297      R3  owl  4.859812 6.472346   6.388110         6.380429
1084       S  owl  4.859812 6.359574   6.369207         6.380429
588       T1  owl  4.859812 6.393591   6.358526         6.380429
1252      T2  owl  4.859812 6.583409   6.795581         6.380429
1477       V  owl  4.859812 6.796824   6.579185         6.380429
670       W1  owl  4.859812 6.278521   6.261318         6.380429
880       W2  owl  4.859812 6.499787   6.454512         6.380429
309        Z  owl  4.859812 6.706862   6.590545         6.380429

Plotting the model with predict

lexdec_w_predict %>%
  ggplot(aes(x = Frequency, y = RT)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(color = Subject, y = prediction)) 

Plotting the model with predict

lexdec_w_predict %>%
  ggplot(aes(x = Frequency, y = RT)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(color = Subject, y = prediction)) +
  geom_line(aes(y = prediction_fixed), linewidth = 3, color = "blue")

All of our usual stuff applies!

summary(model_multi)
Linear mixed model fit by REML ['lmerMod']
Formula: RT ~ Frequency + (1 | Subject)
   Data: lexdec

REML criterion at convergence: -866.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3537 -0.6542 -0.1158  0.4880  5.8884 

Random effects:
 Groups   Name        Variance Std.Dev.
 Subject  (Intercept) 0.02373  0.1541  
 Residual             0.03274  0.1809  
Number of obs: 1659, groups:  Subject, 21

Fixed effects:
             Estimate Std. Error t value
(Intercept)  6.588778   0.037737   174.6
Frequency   -0.042872   0.003485   -12.3

Correlation of Fixed Effects:
          (Intr)
Frequency -0.439
  • we can’t use infer anymore, but all the rest!

Thanks for a great semester!