Data Science for Studying Language and the Mind
Data visualization
Adapted from R4DS Ch 9: Layers and some materials by Dr. Colin Rundel at Duke
Google Colab already has ggplot2
installed by default. There is no need to run install.packages()
The basic ggplot (review from last time!)
dataframe in palmerpenguins
Subjective frequency ratings, ratings of estimated weight, and ratings of estimated size, averaged over subjects, for 81 concrete English nouns. – languageR
We will make use of the following variables:
- actual word frequencymeanFamiliarity
- subjective frequency ratingClass
There are many geoms. We will start with these, and add a few additional geoms as we move through the course:
geom_histogram() |
histogram, distribution of a continuous variable |
geom_density() |
distribution of a continuous variable |
geom_bar() |
distribution of a categorical data |
geom_point() |
scatterplot |
geom_smooth() |
smoothed line of best fit |
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. – R4DS
bins - How many bins should we have?
binwidth - How wide should the bins be?
color - What should the outline color be?
fill - What should the fill color be?
Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. – R4DS
Map Class to color aesthetic
Set linewidth
Map Class to fill and set alpha
To examine the distribution of a categorical variable, you can use a bar chart. The height of the bars displays how many observations occurred with each x value. – R4DS
- stackedWe can use stacked bar plots to visualize the relationship between two categorical variables
- relative frequencyWe can use relative frequency to visualize the relationship between two categorical variables (as a percentage)
- dodgedWe can use a dodged bar plot to visualize the relationship between two categorical variables side-by-side, not stacked
Scatterplots are useful for displaying the relationship between two numerical variables – R4DS
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
) +
color = "blue",
size = 3
) +
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
with geom_smooth()
draws a best fitting curve
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
) +
color = "blue",
size = 3
) +
geom_smooth() +
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
with geom_smooth(method="lm")
draws the best fitting linear model
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
) +
color = "blue",
size = 3
) +
geom_smooth(method="lm") +
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
with geom_smooth(method="lm")
We can also map to color, by specifying globally
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity,
color = Class
) +
size = 3
) +
geom_smooth(method="lm") +
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
with geom_smooth(method="lm")
Or include only a single smooth, by specifying color in the point geom only
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
) +
aes(color = Class),
size = 3
) +
geom_smooth(method="lm") +
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
smaller plots that display different subsets of data
- just columnsfacet_grid()
- just columnsand note we can still map other aesthetics!
- just rowsfacet_wrap()
- number of columnsggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
) +
mapping = aes(color = Class),
size = 3
) +
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20) +
scale_color_brewer(palette = "Paired")
returns the last plot
saves last plot
ggplot comes with many Complete themes
the pipe takes the thing on its left and passes it along to the function on its right
the pipe takes the thing on its left and passes it along to the function on its right
x %>% f(y)
is equivalent to f(x, y)
and ggplotPractice adding aesthetics and layers by creating this!
Figure 2: Data from penguins
dataframe in palmerpenguins
Need a challenge? Use the
data from thedatasauRus
R package to create this!