Problem set 5
due Friday, December 8 at 11:59pm
Allocate about 1 hour per problem, though some will take longer than others. You may need more time if programming is completely new to you, or less if you have some experience already.
 Instructions

Upload your
.ipynb
notebook to gradescope by 11:59pm on the due date.
 Note that each problem will be graded according to this rubric. Solutions that include packages or functions not covered in this course will recieve a score no higher than 2.
 You may collaborate with any of your classmates, but you must write your own code/solutions, understand all parts of the problem, and name your collaborators.
 You should also cite any outside sources you consulted, like Stack Overflow or ChatGPT, with a comment near the relevant lines of code (see example below). Recycled code that has not been cited will be considered plagerism and receive a zero.
# code here was inspired by user2554330 on stack overflow:
# https://stackoverflow.com/questions/69091812/iseverythingavectorinr
As always, create a new colab R notebook. Include the title “Problem set 5”, your name, the date, and any collaborators somewhere at the top.
Dataset Information
You will work with the verbs
dataset, part of the languageR
package. This dataset is a simplified version of the dative
dataset, which is described as follows:
Data describing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection.
Dative alternation refers to the fact that some sentences can have two different structures that convey the same basic meaning. In the verbs
dataset, we are looking at ditransitive verbs (such as ‘give’ or ‘offer’) which can realize the recipient in two ways: as a noun phrase (NP) “give you a book”, or as a prepositional phrase (PP) “give a book to you”. We want to investigate what factors in the data (AnimacyofRec
, AnimacyofTheme
, LengthofTheme
) impact whether the recipient is realized as a noun phrase (NP) or a prepositional phrase (PP) (RealizationOfRec
).
Problem 1: Data Exploration
 Load the dataset and perform exploratory data analysis
 Examine the dataset’s structure (
glimpse
), summary statistics, and handle missing values if any.  Visualize the pairwise relationship between the response variable and each of the three explanatory variables using appropriate plots (scatter plots, histograms, etc.).
Problem 2: Model Specification
 Choose an appropriate type of model and explain why you’ve selected this type (e.g. regression, classification)
 Specify the model with an equation (e.g. \(y = \beta_0 + \beta_1x_1\)). Note that you can create math equations in google colab text blocks by placing the equation between two dollar signs:
$y = \beta_0 + \beta_1x_1$
 Specify the model in R code (e.g.
y ~ x
)
Problem 3: Model Fitting
 Fit the model you specified in R
 What does the model tell us? How is realization of NP/PP impacted by the animacy of the theme (
AnimacyOfTheme
), the animacy of the recipient (AnimacyOfRec
), and the length of the theme (LengthofTheme
)?  What is the meaning of the intercept? Convert the estimate for the intercept to probability.
Problem 4: Model Accuracy
 Assess the accuracy of the model using crossvalidation
 Take care to use
collect_metrics()
to report any accuracy metrics  Add text explaining how accurate your model is, looking at the mean and standard deviation of accuracy.
Problem 5: Model Reliability
 Assess the reliability of your parameter estimates via bootstrapping
 Get a 95% confidence interval around the parameter estimates of your model.
 Visualize your bootstrapped distribution and shade the confidence interval.
If you get the error contrasts can be applied only to factors with 2 or more levels
, try the bootstrap with a seed of 2 (set.seed(2)
) and a reduced number of replicates reps = 500
!
Problem 6: Model Predictions
Using the coefficients returned from your model from Problem 3, get the predicted probability of PP
or NP
(your choice) for the following:
AnimacyOfTheme
andAnimacyOfRec
are both animate, andLengthOfTheme
is 5AnimacyOfTheme
is inanimate,AnimacyOfRec
is animate, andLengthOfTheme
is 3
A previous version of Problem 6 stated:
Create a visualization to plot your model’s fit (
geom_smooth()
)
You have a choice to do either version of Problem 6.