Problem set 5

due Friday, December 8 at 11:59pm

Estimated time: 6 hours

Allocate about 1 hour per problem, though some will take longer than others. You may need more time if programming is completely new to you, or less if you have some experience already.

Instructions
Upload your .ipynb notebook to gradescope by 11:59pm on the due date.
# code here was inspired by user2554330 on stack overflow: 
# https://stackoverflow.com/questions/69091812/is-everything-a-vector-in-r

As always, create a new colab R notebook. Include the title “Problem set 5”, your name, the date, and any collaborators somewhere at the top.

Dataset Information

You will work with the verbs dataset, part of the languageR package. This dataset is a simplified version of the dative dataset, which is described as follows:

Data describing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection.

Dative alternation refers to the fact that some sentences can have two different structures that convey the same basic meaning. In the verbs dataset, we are looking at ditransitive verbs (such as ‘give’ or ‘offer’) which can realize the recipient in two ways: as a noun phrase (NP) “give you a book”, or as a prepositional phrase (PP) “give a book to you”. We want to investigate what factors in the data (AnimacyofRec, AnimacyofTheme, LengthofTheme) impact whether the recipient is realized as a noun phrase (NP) or a prepositional phrase (PP) (RealizationOfRec).

Problem 1: Data Exploration

  • Load the dataset and perform exploratory data analysis
  • Examine the dataset’s structure (glimpse), summary statistics, and handle missing values if any.
  • Visualize the pairwise relationship between the response variable and each of the three explanatory variables using appropriate plots (scatter plots, histograms, etc.).

Problem 2: Model Specification

  • Choose an appropriate type of model and explain why you’ve selected this type (e.g. regression, classification)
  • Specify the model with an equation (e.g. \(y = \beta_0 + \beta_1x_1\)). Note that you can create math equations in google colab text blocks by placing the equation between two dollar signs: $y = \beta_0 + \beta_1x_1$
  • Specify the model in R code (e.g. y ~ x)

Problem 3: Model Fitting

  • Fit the model you specified in R
  • What does the model tell us? How is realization of NP/PP impacted by the animacy of the theme (AnimacyOfTheme), the animacy of the recipient (AnimacyOfRec), and the length of the theme (LengthofTheme)?
  • What is the meaning of the intercept? Convert the estimate for the intercept to probability.

Problem 4: Model Accuracy

  • Assess the accuracy of the model using cross-validation
  • Take care to use collect_metrics() to report any accuracy metrics
  • Add text explaining how accurate your model is, looking at the mean and standard deviation of accuracy.

Problem 5: Model Reliability

  • Assess the reliability of your parameter estimates via bootstrapping
  • Get a 95% confidence interval around the parameter estimates of your model.
  • Visualize your bootstrapped distribution and shade the confidence interval.

If you get the error contrasts can be applied only to factors with 2 or more levels, try the bootstrap with a seed of 2 (set.seed(2)) and a reduced number of replicates reps = 500!

Problem 6: Model Predictions

Using the coefficients returned from your model from Problem 3, get the predicted probability of PP or NP (your choice) for the following:

  • AnimacyOfTheme and AnimacyOfRec are both animate, and LengthOfTheme is 5
  • AnimacyOfTheme is inanimate, AnimacyOfRec is animate, and LengthOfTheme is 3

A previous version of Problem 6 stated:

Create a visualization to plot your model’s fit (geom_smooth())

You have a choice to do either version of Problem 6.