Hello, world!

Data Science for Studying Language and the Mind

Katie Schuler

Data science

Data science is about making decisions based on incomplete information.

Figure 1: from Kok & de Lange (2014)

This concept is not new. Brains were built for doing this!

But we have new tools and lots more data!

Figure 2: from https://www.domo.com/data-never-sleeps.

Data science workflow

The folks who wrote R for Data Science proposed the following data science workflow:

Figure 3: from R for Data Science

Overview of course

We will spend the first few weeks getting comfortable programming in R, including some useful skills for data science:

  • R basics
  • Data importing
  • Data visualization
  • Data wrangling

Overview of course

Then, we will spend the next several weeks building a foundation in basic statistics and model building:

  • Probability distributions
  • Sampling variability
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability

Overview of course

Finally we will cover a selection of more advanced topics that are often applied in language and mind fields, with a focus on basic understanding:

  • Classification
  • Feature engineering (preprocessing)
  • Inference for regression
  • Mixed-effect models

Syllabus, briefly

Each week will include two lectures and a lab:

  • Lectures are on Tuesdays and Thursdays at 10:15am and will be a mix of conceptual overviews and R tutorials. It is a good idea to bring your laptop so you can follow along and try stuff in R!
  • Lab is on Thursday or Friday and will consist of (ungraded) practice problems and concept review with TAs. You may attend any lab section that works for your schedule.

Syllabus, briefly

There are 10 graded assessments:

  • 6 Problem sets in which you will be asked to apply your newly aquired R programming skills.
  • 4 Quizzes in which you will be tested on your understanding of lecture concepts.

Syllabus, briefly

There are a few policies to take note of:

  • Missed quizzes cannot be made up except in cases of genuine conflict or emergency (documentation and course action notice required)
  • You may request an extension on any problem set of up to 3 days. But extensions beyond 3 days will not be granted (because delying solutions will negative impact other students).
  • You may submit any missed quiz or problem set by the end of the semester for half-credit (50%).

Why R?

With many programming languages available for data science (e.g. R, Python, Julia, MATLAB), why use R?

  • Built for stats, specifically
  • Makes nice visualizations
  • Lots of people are doing it, especially in academia
  • Easier for beginners to understand
  • Free and open source (though so are Python and Julia, MATLAB costs $)

Many ways to use R

Google Colab

  • Google Colab is a cloud-based Jupyter notebook that allows you to write, execute, and share code like a google doc.
  • We use Google Colab because it’s simple and accessible to everyone. You can start programming right away, no setup required!

Secretly, R!

Google Colab officially supports Python, but secretly supports R (and Julia, too!)