# Model Reliability

Data Science for Studying Language and the Mind

Katie Schuler

2023-11-07

## You are here

##### Data science with R
• Hello, world!
• R basics
• Data importing
• Data visualization
• Data wrangling
##### Stats & Model buidling
• Sampling distribution
• Hypothesis testing
• Model specification
• Model fitting
• Model accuracy
• Model reliability
• Classification
• Feature engineering (preprocessing)
• Inference for regression
• Mixed-effect models

## Model building overview

• Model specification: what is the form?
• Model fitting: you have the form, how do you guess the free parameters?
• Model accuracy: you’ve estimated the parameters, how well does that model describe your data?
• Model reliability: when you estimate the parameters, there is some uncertainty on them

# Dataset

data_n10 <- read_csv("http://kathrynschuler.com/datasets/model-reliability-sample10.csv")
data_n200 <- read_csv("http://kathrynschuler.com/datasets/model-reliability-sample200.csv") 

#### Explore the data

##### Specify a model
• supervised learning | regression | linear
• y ~ x
• $y=w_0+w_1x_1$

##### Specify and fit with infer
data_n10 %>%
specify(y ~ x) %>%
fit()
# A tibble: 2 × 2
term      estimate
<chr>        <dbl>
1 intercept    1.75
2 x            0.733

How certain can we be about the parameter estimates we obtained?

observed_fit <- data_n10 %>%
specify(y ~ x) %>%
fit()

observed_fit
# A tibble: 2 × 2
term      estimate
<chr>        <dbl>
1 intercept    1.75
2 x            0.733

But… why is there uncertainty around the parameter estimates at all?

## Because of sampling error

We are interested in the model parameters that best describe the population from which the sample was drawn (not a given sample)

• Due to sampling error, we can expect some variability in the model parameters that describe a sample of data.

## Model reliability

• We can think of model reliability as the stability of the parameters of a fitted model.
• The more data we collect, the more reliable the model parameters will be.

### Confidence intervals via bootstrapping

We can obtain confidence intervals around parameter estimates for models in the same we we did for point estimates like the mean: bootstrapping

1. Draw bootstrap samples from the observed data
2. Fit the model of interest to each bootstrapped sample
3. Construct the sampling distribution of parameter estimates across bootstraps

### Confidence intervals with infer

Fit bootstraps

boot_fits <- data_n200 %>%
specify(y ~ x) %>%
generate(
reps = 1000,
type = "bootstrap"
) %>%
fit()

head(boot_fits)
# A tibble: 6 × 3
# Groups:   replicate [3]
replicate term      estimate
<int> <chr>        <dbl>
1         1 intercept    1.84
2         1 x            0.485
3         2 intercept    1.95
4         2 x            0.585
5         3 intercept    1.82
6         3 x            0.332

Get confidence interval

ci <- boot_fits %>%
get_confidence_interval(
point_estimate = observed_fit,
level = 0.95
)

ci 
# A tibble: 2 × 3
term      lower_ci upper_ci
<chr>        <dbl>    <dbl>
1 intercept    1.78     2.06
2 x            0.362    0.634

Visualize distribution & ci

bootstrapped_fits %>%
visualize() +
shade_ci(endpoints = ci)