Data Visualization

Data Science for Studying Language and the Mind

Katie Schuler

2023-09-07

You are here

Data science with R
  • Hello, world!
  • R basics
  • Data importing
  • Data visualization
  • Data wrangling
Stats & Model buidling
  • Probability distributions
  • Sampling variability
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model accuracy
  • Model reliability
More advanced
  • Classification
  • Feature engineering (preprocessing)
  • Inference for regression
  • Mixed-effect models

Acknowledgements

Adapted from R for Data Science Ch 2 and some materials by Dr. Colin Rundel at Duke

Why visualize?

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Datasaurus dozen

# A tibble: 13 × 6
   dataset    mean_x mean_y std_dev_x std_dev_y corr_x_y
   <chr>       <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
 1 away         54.3   47.8      16.8      26.9  -0.0641
 2 bullseye     54.3   47.8      16.8      26.9  -0.0686
 3 circle       54.3   47.8      16.8      26.9  -0.0683
 4 dino         54.3   47.8      16.8      26.9  -0.0645
 5 dots         54.3   47.8      16.8      26.9  -0.0603
 6 h_lines      54.3   47.8      16.8      26.9  -0.0617
 7 high_lines   54.3   47.8      16.8      26.9  -0.0685
 8 slant_down   54.3   47.8      16.8      26.9  -0.0690
 9 slant_up     54.3   47.8      16.8      26.9  -0.0686
10 star         54.3   47.8      16.8      26.9  -0.0630
11 v_lines      54.3   47.8      16.8      26.9  -0.0694
12 wide_lines   54.3   47.8      16.8      26.9  -0.0666
13 x_shape      54.3   47.8      16.8      26.9  -0.0656

Datasaurus dozen

Why visualize?

Visualization is a fundamentally human activity. A good visualization will show you things you did not expect or raise new questions about the data. A good visualization might also hint that you’re asking the wrong question or that you need to collect different data. Visualizations can surprise you, but they don’t scale particularly well because they require a human to interpret them.” – R4DS

ggplot2 loads with tidyverse

library(tidyverse)

Why ggplot2?

“R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.” – R4DS

Today’s data: ratings

Subjective frequency ratings, ratings of estimated weight, and ratings of estimated size, averaged over subjects, for 81 concrete English nouns. – languageR

library(languageR)
glimpse(ratings)
Rows: 81
Columns: 14
$ Word             <fct> almond, ant, apple, apricot, asparagus, avocado, badg…
$ Frequency        <dbl> 4.204693, 5.347108, 6.304449, 3.828641, 3.663562, 3.4…
$ FamilySize       <dbl> 0.0000000, 1.3862944, 1.0986123, 0.0000000, 0.0000000…
$ SynsetCount      <dbl> 1.0986123, 1.0986123, 1.0986123, 1.3862944, 1.0986123…
$ Length           <int> 6, 3, 5, 7, 9, 7, 6, 6, 3, 6, 3, 8, 10, 9, 8, 5, 9, 5…
$ Class            <fct> plant, animal, plant, plant, plant, plant, animal, pl…
$ FreqSingular     <int> 24, 69, 315, 26, 19, 24, 53, 74, 155, 37, 118, 15, 26…
$ FreqPlural       <int> 42, 140, 231, 19, 19, 6, 78, 77, 103, 14, 180, 19, 31…
$ DerivEntropy     <dbl> 0.0000, 0.5620, 0.4960, 0.0000, 0.0000, 0.0000, 0.634…
$ Complex          <fct> simplex, simplex, simplex, simplex, simplex, simplex,…
$ rInfl            <dbl> -0.54232429, -0.70026465, 0.30900484, 0.30010459, 0.0…
$ meanWeightRating <dbl> 1.4860, 3.3489, 2.1948, 1.3216, 1.4424, 1.3256, 3.047…
$ meanSizeRating   <dbl> 1.8912, 3.6275, 2.4730, 1.7597, 1.8660, 1.7737, 3.369…
$ meanFamiliarity  <dbl> 3.72, 3.60, 5.84, 4.40, 3.68, 4.12, 2.12, 5.68, 3.20,…

Today’s data: ratings

We will make use of the following variables:

  1. Frequency - actual word frequency
  2. meanFamiliarity - subjective frequency rating
  3. Class - whether word is a plant or animal
Rows: 81
Columns: 4
$ Word            <fct> almond, ant, apple, apricot, asparagus, avocado, badge…
$ Frequency       <dbl> 4.204693, 5.347108, 6.304449, 3.828641, 3.663562, 3.43…
$ meanFamiliarity <dbl> 3.72, 3.60, 5.84, 4.40, 3.68, 4.12, 2.12, 5.68, 3.20, …
$ Class           <fct> plant, animal, plant, plant, plant, plant, animal, pla…

Today’s goal

Create this figure showing the relationship between actual frequency and subjective frequency rating of each word, considering the class the word belongs to

The basic ggplot

  1. Using your data
  2. define how variables in your dataset are mapped to visual properties (aesthetics)
  3. determine the geometrical object that a plot uses to represent data (geom)

1 data

Use ratings data

ggplot(
    data = ratings
 )

2 aesthetic mapping

Map Frequency to x-axis and meanFamiliarity to y-axis.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 )

3 geom

Represent each value with a point.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point()

Adding to the basics

Mapping categorical variables

When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here color) … a process known as scaling. – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity,
        color = Class
    )
 ) +
  geom_point()

Global vs. local aesthetics

  • globally in ggplot(), which are passed down to all geoms
  • locally in geom_*() which are used by that geom only
ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class)
    ) 

Mapping vs. setting aesthetics

  • mapping allows us to determine a geom’s aesthetics based on a variable, and is passed as argument in aes()
  • setting allows us to set a geom’s aestheics to a constant value (not based on any variable), and passed as argument in geom_*() directly
ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) 

Labels

labels: title and subtitle

Add title “Subjective frequency ratings” with subtitle “for 81 english nouns”

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns"
  ) 

labels: x and y axis

Label x-axis “Actual frequency” and y-axis “Frequency rating”

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating"
  ) 

labels: legend

Label the legend “word class”.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) 

Theme and adjusting scaling

themes

Apply classic theme with base_size 20.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) 

scales: changing color

Remember: When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here color) … a process known as scaling. – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) +
  scale_color_brewer(palette = "Paired")

Aesthetics

color

Map the color aesthetic to a variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class),
    size = 3
    ) +
 theme_classic(base_size = 20)

color

Set a constant value for the color aesthetic

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    color = "blue",
    size = 3
    ) +
 theme_classic(base_size = 20)

size

Setting a constant value for the size aesthetic

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class),
    size = 3
    ) +
 theme_classic(base_size = 20)

size

Mapped the size aesthetic to a variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        size = Complex
    ),
    ) +
 theme_classic(base_size = 20)

shape

Map the shape aesthetic to a different variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        shape = Complex),
    size = 3
    ) +
 theme_classic(base_size = 20)

shape

Map the shape aesthetic to the same variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        shape = Class),
    size = 3
    ) +
 theme_classic(base_size = 20)

alpha

Set a constant value for the alpha aesthetic

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        shape = Class),
    alpha = 0.5,
    size = 3
    ) +
 theme_classic(base_size = 20)

alpha

Mapped to a variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        shape = Class,
        alpha = Length),
    size = 3
    ) +
 theme_classic(base_size = 20)

Geometric objects

geom_*() aka geoms

There are many. We will start with these, and add a few additional geoms as we move through the course:

geom_histogram() histogram, distribution of a continuous variable
geom_density() distribution of a continuous variable
geom_bar() distribution of a categorical data
geom_point() scatterplot
geom_smooth() smoothed line of best fit

geom_histogram()

A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram() 

geom_histogram()

bins - How many bins should we have?

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram(
        bins = 10
    )

geom_histogram()

binwidth - How wide should the bins be?

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram(
        binwidth = 0.25
    )

geom_histogram()

color - What should the outline color be?

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram(
        binwidth = 0.25,
        color = "red"
    )

geom_histogram()

fill - What should the fill color be?

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram(
        binwidth = 0.25,
        color = "red",
        fill = "lightblue"
    )

geom_density()

Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_density() 

geom_density()

Map Class to color aesthetic

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity,
        color = Class
    )
) + 
    geom_density() 

geom_density()

Set linewidth

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity,
        color = Class
    )
) + 
    geom_density(linewidth = 2) 

geom_density()

Map Class to fill and set alpha

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity,
        fill = Class
    )
) + 
    geom_density(alpha = 0.5) 

geom_bar()

To examine the distribution of a categorical variable, you can use a bar chart. The height of the bars displays how many observations occurred with each x value. – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = Class
    )
) + 
    geom_bar()

geom_bar() - stacked

We can use stacked bar plots to visualize the relationship between two categorical variables

ggplot(
    data = ratings,
    mapping = aes(
        x = Class,
        fill = Complex
    )
) + 
    geom_bar()

geom_bar() - relative frequency

We can use relative frequency to visualize the relationship between two categorical variables (as a percentage)

ggplot(
    data = ratings,
    mapping = aes(
        x = Class,
        fill = Complex
    )
) + 
    geom_bar(position = "fill")

geom_bar() - dodged

We can use a dodged bar plot to visualize the relationship between two categorical variables side-by-side, not stacked

ggplot(
    data = ratings,
    mapping = aes(
        x = Class,
        fill = Complex
    )
) + 
    geom_bar(position = "dodge")

geom_point()

Scatterplots are useful for displaying the relationship between two numerical variables – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    color = "blue", 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) 

geom_point() with geom_smooth()

draws a best fitting curve

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    color = "blue", 
    size = 3
    ) +
  geom_smooth() +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) 

geom_point() with geom_smooth(method="lm")

draws the best fitting linear model

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    color = "blue", 
    size = 3
    ) +
  geom_smooth(method="lm") +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) 

geom_point() with geom_smooth(method="lm")

We can also map to color, by specifying globally

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity,
        color = Class
    )
 ) +
  geom_point( 
    size = 3
    ) +
  geom_smooth(method="lm") +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) 

geom_point() with geom_smooth(method="lm")

Or include only a single smooth, by specifying color in the point geom only

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    aes(color = Class),
    size = 3
    ) +
  geom_smooth(method="lm") +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) 

Facets

smaller plots that display different subsets of data

facet_grid()

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_grid(Class ~ Complex) +
  theme_classic(base_size = 20) 

facet_grid()

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_grid(Class ~ Complex) +
  theme_classic(base_size = 20) 

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point(
    aes(color = Class, shape = Complex)
  ) +
  theme_classic(base_size = 20) 

facet_grid() - just columns

and note we can still map other aesthetics!

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_grid(. ~ Complex) +
  theme_classic(base_size = 20) 

facet_grid() - just columns

and note we can still map other aesthetics!

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point(
    aes(color = Class),
    shape = "triangle"
  ) +
  facet_grid(. ~ Complex) +
  theme_classic(base_size = 20) 

facet_grid() - just rows

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_grid(Class ~ .) +
  theme_classic(base_size = 20) 

facet_wrap()

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_wrap(~ Class) +
  theme_classic(base_size = 20) 

facet_wrap() - number of columns

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_wrap(~ Class, ncol = 1) +
  theme_classic(base_size = 20) 

Helper functions

remember our goal plot?

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) +
  scale_color_brewer(palette = "Paired")

last_plot()

returns the last plot

last_plot()

ggsave()

saves last plot

ggsave("plot.png", width=5, height=5)

Themes

ggplot comes with many Complete themes

Default theme

last_plot() + theme_gray(base_size=20)

Sample themes

last_plot() + theme_bw(base_size=20)

last_plot() + theme_classic(base_size=20)

last_plot() + theme_minimal(base_size=20)

last_plot() + theme_void(base_size=20)

Shortcuts

ggplot2 calls

Explicit argument names:

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
) + 
   geom_point()

Implied argument names:

ggplot(
    ratings,
    aes(
        x = Frequency,
        y = meanFamiliarity
    )
) + 
   geom_point()

the pipe %>%

the pipe takes the thing on its left and passes it along to the function on its right

x %>% f(y) is equivalent to f(x, y)

x <- c(1.0, 2.245, 3, 4.22222)
x
[1] 1.00000 2.24500 3.00000 4.22222
# pass x as an argument to function usual way
round(x, digits = 2)
[1] 1.00 2.24 3.00 4.22
# pass x as an argument to function with pipe
x %>% round(digits = 2)
[1] 1.00 2.24 3.00 4.22

the pipe %>% and ggplot

Implied argument names:

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
) + 
   geom_point()

Implied argument names + pipe:

ratings %>% 
ggplot(
    aes(
        x = Frequency,
        y = meanFamiliarity
    )
) + 
   geom_point()

Exercise 1

The basic ggplot

Figure 1: Data from penguins dataframe in palmerpenguins package

Exercise 2

Adding aesthetics and layers

Figure 2: Data from penguins dataframe in palmerpenguins package