ggplot2 examples

Kith Pradhan

2024-05-14

A Basic Scatter Plot with geom_point

The first example will use the builtin mtcars dataset to show off a simple scatterplot.

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#this library makes the output look a little better
#if it's not installed yet call
#install.packages("pander")
library("pander")

#load the data
data(mtcars)

#look at the first few rows
#mtcars %>% head() %>% pander()

#here's another way to do it
pander(head(mtcars))
Table continues below
  mpg cyl disp hp drat wt qsec vs
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0
Valiant 18.1 6 225 105 2.76 3.46 20.22 1
  am gear carb
Mazda RX4 1 4 4
Mazda RX4 Wag 1 4 4
Datsun 710 1 4 1
Hornet 4 Drive 0 3 1
Hornet Sportabout 0 3 2
Valiant 0 3 1

Let’s look at the relationship between mpg and hp with a simple scatterplot using geom_point().

Remember these three key points to using ggplot2.

  1. the data
  2. the geom
  3. the mapping from data to geom.
ggplot(mtcars, aes(x=mpg, y=hp)) +
    geom_point()

aesthetics

We can assign variables to be visualized through the geom’s features. In the code below, the gear of the car is shown off as color. gear is being treated as a numeric and ranges over 3 distinct values.

ggplot(mtcars, aes(x=mpg, y=hp, color=gear)) +
    geom_point()

By default, gear is stored as a floating point number. If you change its class to a factor, ggplot will treat it as a categorical variable, and assign attributes based on a discrete scale rather than continuous.

#turn gear into a factor
mtcars$gear2 = as.factor(mtcars$gear)
#Here's another way to do it
mtcars <- mtcars %>% mutate(gear2 = as.factor(gear))

pander(head(mtcars))
Table continues below
  mpg cyl disp hp drat wt qsec vs
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0
Valiant 18.1 6 225 105 2.76 3.46 20.22 1
  am gear carb gear2
Mazda RX4 1 4 4 4
Mazda RX4 Wag 1 4 4 4
Datsun 710 1 4 1 4
Hornet 4 Drive 0 3 1 3
Hornet Sportabout 0 3 2 3
Valiant 0 3 1 3
ggplot(mtcars, aes(x=mpg, y=hp, color=gear2)) +
    geom_point()

Displaying Heatmaps with geom_tile

In these examples we’ll be using the builtin starwars dataset to show off how to make a heatmap in ggplot2. If you’ve made heatmaps in base R with the image() function, you’re already familiar with how to display data in matrix form. This is a very intuitive way to handle 2D data.

ggplot2 uses a different way to handle 2D data for heatmaps. Here, we treat each individual heatmap tile as a separate observation. Instead of a rectangular matrix with nSubjects rows and nVariables columns, we need to make a new data structure that has nSubjects * nVariables rows, with one or more columns to distinguish the variables and their values.

Luckily for us, there is a tidyverse function called unnest() that can rearrange the data for us automatically.

For the first example, let’s look at the films that each character had an appearance in using geom_tile()

data("starwars")
head(starwars$name)
## [1] "Luke Skywalker" "C-3PO"          "R2-D2"          "Darth Vader"   
## [5] "Leia Organa"    "Owen Lars"
head(starwars$films)
## [[1]]
## [1] "The Empire Strikes Back" "Revenge of the Sith"    
## [3] "Return of the Jedi"      "A New Hope"             
## [5] "The Force Awakens"      
## 
## [[2]]
## [1] "The Empire Strikes Back" "Attack of the Clones"   
## [3] "The Phantom Menace"      "Revenge of the Sith"    
## [5] "Return of the Jedi"      "A New Hope"             
## 
## [[3]]
## [1] "The Empire Strikes Back" "Attack of the Clones"   
## [3] "The Phantom Menace"      "Revenge of the Sith"    
## [5] "Return of the Jedi"      "A New Hope"             
## [7] "The Force Awakens"      
## 
## [[4]]
## [1] "The Empire Strikes Back" "Revenge of the Sith"    
## [3] "Return of the Jedi"      "A New Hope"             
## 
## [[5]]
## [1] "The Empire Strikes Back" "Revenge of the Sith"    
## [3] "Return of the Jedi"      "A New Hope"             
## [5] "The Force Awakens"      
## 
## [[6]]
## [1] "Attack of the Clones" "Revenge of the Sith"  "A New Hope"
#This is the original form of the data
starwars %>% select(name, films) %>% head() %>% pander()
name films
Luke Skywalker The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens
C-3PO The Empire Strikes Back, Attack of the Clones, The Phantom Menace, Revenge of the Sith, Return of the Jedi, A New Hope
R2-D2 The Empire Strikes Back, Attack of the Clones, The Phantom Menace, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens
Darth Vader The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope
Leia Organa The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens
Owen Lars Attack of the Clones, Revenge of the Sith, A New Hope
#this is how it needs to be to use ggplot's heatmap
starwars_films <- starwars %>% select(name, films) %>% unnest(films)
starwars_films %>% head(30) %>% pander()
name films
Luke Skywalker The Empire Strikes Back
Luke Skywalker Revenge of the Sith
Luke Skywalker Return of the Jedi
Luke Skywalker A New Hope
Luke Skywalker The Force Awakens
C-3PO The Empire Strikes Back
C-3PO Attack of the Clones
C-3PO The Phantom Menace
C-3PO Revenge of the Sith
C-3PO Return of the Jedi
C-3PO A New Hope
R2-D2 The Empire Strikes Back
R2-D2 Attack of the Clones
R2-D2 The Phantom Menace
R2-D2 Revenge of the Sith
R2-D2 Return of the Jedi
R2-D2 A New Hope
R2-D2 The Force Awakens
Darth Vader The Empire Strikes Back
Darth Vader Revenge of the Sith
Darth Vader Return of the Jedi
Darth Vader A New Hope
Leia Organa The Empire Strikes Back
Leia Organa Revenge of the Sith
Leia Organa Return of the Jedi
Leia Organa A New Hope
Leia Organa The Force Awakens
Owen Lars Attack of the Clones
Owen Lars Revenge of the Sith
Owen Lars A New Hope
#now show the heatmap
ggplot(starwars_films, aes(x=films, y=name)) +
    geom_tile()

Let’s look at another heatmap. The types of starships piloted by each character, broken down by the pilot’s species. Here, we’re rearranging the necessary info into a tiled data structure, then we’re adding extra info from the original data with left_join()

#make a new data.frame that has the info we want to visualize
starwars_ships <- starwars %>% select(name, starships) %>% unnest(starships)
pander(starwars_ships) 
name starships
Luke Skywalker X-wing
Luke Skywalker Imperial shuttle
Darth Vader TIE Advanced x1
Biggs Darklighter X-wing
Obi-Wan Kenobi Jedi starfighter
Obi-Wan Kenobi Trade Federation cruiser
Obi-Wan Kenobi Naboo star skiff
Obi-Wan Kenobi Jedi Interceptor
Obi-Wan Kenobi Belbullab-22 starfighter
Anakin Skywalker Trade Federation cruiser
Anakin Skywalker Jedi Interceptor
Anakin Skywalker Naboo fighter
Chewbacca Millennium Falcon
Chewbacca Imperial shuttle
Han Solo Millennium Falcon
Han Solo Imperial shuttle
Wedge Antilles X-wing
Jek Tono Porkins X-wing
Boba Fett Slave 1
Lando Calrissian Millennium Falcon
Arvel Crynyd A-wing
Nien Nunb Millennium Falcon
Ric Olié Naboo Royal Starship
Darth Maul Scimitar
Plo Koon Jedi starfighter
Gregar Typho Naboo fighter
Grievous Belbullab-22 starfighter
Poe Dameron T-70 X-wing fighter
Padmé Amidala H-type Nubian yacht
Padmé Amidala Naboo star skiff
Padmé Amidala Naboo fighter
#add info from the original data.frame to the new data.frame
starwars_shipsSpecies <- left_join(starwars_ships, select(starwars, name, species), by="name")
pander(starwars_shipsSpecies)
name starships species
Luke Skywalker X-wing Human
Luke Skywalker Imperial shuttle Human
Darth Vader TIE Advanced x1 Human
Biggs Darklighter X-wing Human
Obi-Wan Kenobi Jedi starfighter Human
Obi-Wan Kenobi Trade Federation cruiser Human
Obi-Wan Kenobi Naboo star skiff Human
Obi-Wan Kenobi Jedi Interceptor Human
Obi-Wan Kenobi Belbullab-22 starfighter Human
Anakin Skywalker Trade Federation cruiser Human
Anakin Skywalker Jedi Interceptor Human
Anakin Skywalker Naboo fighter Human
Chewbacca Millennium Falcon Wookiee
Chewbacca Imperial shuttle Wookiee
Han Solo Millennium Falcon Human
Han Solo Imperial shuttle Human
Wedge Antilles X-wing Human
Jek Tono Porkins X-wing Human
Boba Fett Slave 1 Human
Lando Calrissian Millennium Falcon Human
Arvel Crynyd A-wing Human
Nien Nunb Millennium Falcon Sullustan
Ric Olié Naboo Royal Starship NA
Darth Maul Scimitar Zabrak
Plo Koon Jedi starfighter Kel Dor
Gregar Typho Naboo fighter Human
Grievous Belbullab-22 starfighter Kaleesh
Poe Dameron T-70 X-wing fighter Human
Padmé Amidala H-type Nubian yacht Human
Padmé Amidala Naboo star skiff Human
Padmé Amidala Naboo fighter Human
#color the heatmap elements by species
ggplot(starwars_shipsSpecies, aes(x=starships, y=name, fill=species)) +
    geom_tile() + 
    theme(axis.text.x = element_text(angle = 90))

Exploring the Olympics Dataset

Here are a few more ggplot2 examples using the olympics dataset.
We’ll be looking for associations between the number/type of medals won and other interesting variables.

events <- read_csv("athlete_events.csv")
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl  (5): ID, Age, Height, Weight, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This dataset contains 271116 rows and 15 columns. To make things more manageable, we’ll be restricting our exploration to only the most popular games and teams.

Question 1 (review)

How do you find the sports & teams with the largest number of players?

#sports with the most number of players
events  %>% count(Sport) %>% arrange(desc(n)) %>% top_n(n=10, wt=n) 
## # A tibble: 10 × 2
##    Sport                    n
##    <chr>                <int>
##  1 Athletics            38624
##  2 Gymnastics           26707
##  3 Swimming             23195
##  4 Shooting             11448
##  5 Cycling              10859
##  6 Fencing              10735
##  7 Rowing               10595
##  8 Cross Country Skiing  9133
##  9 Alpine Skiing         8829
## 10 Wrestling             7154
#from here you can extract the names from the Sport column
top_sports <- events  %>% 
  count(Sport) %>% 
  arrange(desc(n)) %>% 
  top_n(n=10, wt=n) %>%
  pull(Sport)

top_sports
##  [1] "Athletics"            "Gymnastics"           "Swimming"            
##  [4] "Shooting"             "Cycling"              "Fencing"             
##  [7] "Rowing"               "Cross Country Skiing" "Alpine Skiing"       
## [10] "Wrestling"
#another way to do it
top_sports <- names(tail(sort(table(events$Sport)), 10))
top_teams <- names(tail(sort(table(events$Team)), 10))

top_sports
##  [1] "Wrestling"            "Alpine Skiing"        "Cross Country Skiing"
##  [4] "Rowing"               "Fencing"              "Cycling"             
##  [7] "Shooting"             "Swimming"             "Gymnastics"          
## [10] "Athletics"
top_teams
##  [1] "Hungary"       "Australia"     "Sweden"        "Japan"        
##  [5] "Canada"        "Germany"       "Italy"         "Great Britain"
##  [9] "France"        "United States"

Question 2 (review)

Let’s also restrict ourselves to only using the players that won a medal. We want the players that won a medal in any of the top sports from any of the top teams. How do you set up this smaller dataset?

events_top <- events %>% 
    filter(!is.na(Medal )) %>%
    filter(Sport %in% top_sports, Team %in% top_teams)

events_top
## # A tibble: 9,881 × 15
##       ID Name     Sex     Age Height Weight Team  NOC   Games  Year Season City 
##    <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
##  1    56 Ren Aba… M        21     NA     NA Fran… FRA   1956…  1956 Summer Melb…
##  2    62 Giovann… M        21    198     90 Italy ITA   2016…  2016 Summer Rio …
##  3    84 Stephen… M        26    165     55 Unit… USA   2004…  2004 Summer Athi…
##  4    99 Pter Ab… M        30    181     79 Hung… HUN   1992…  1992 Summer Barc…
##  5   100 Oszkr A… M        22     NA     NA Hung… HUN   1936…  1936 Summer Berl…
##  6   106 Agostin… M        22    188     96 Italy ITA   1988…  1988 Summer Seoul
##  7   106 Agostin… M        29    188     96 Italy ITA   1996…  1996 Summer Atla…
##  8   106 Agostin… M        34    188     96 Italy ITA   2000…  2000 Summer Sydn…
##  9   107 Carmine… M        22    182     90 Italy ITA   1984…  1984 Summer Los …
## 10   107 Carmine… M        26    182     90 Italy ITA   1988…  1988 Summer Seoul
## # ℹ 9,871 more rows
## # ℹ 3 more variables: Sport <chr>, Event <chr>, Medal <chr>

Question 3

Which team had the most number of medals? Show the results in a bar graph using geom_col.

pander(count(events_top, Team, Medal))
Team Medal n
Australia Bronze 252
Australia Gold 194
Australia Silver 286
Canada Bronze 200
Canada Gold 95
Canada Silver 160
France Bronze 355
France Gold 276
France Silver 294
Germany Bronze 319
Germany Gold 267
Germany Silver 300
Great Britain Bronze 354
Great Britain Gold 270
Great Britain Silver 406
Hungary Bronze 201
Hungary Gold 196
Hungary Silver 176
Italy Bronze 263
Italy Gold 400
Italy Silver 319
Japan Bronze 142
Japan Gold 133
Japan Silver 132
Sweden Bronze 245
Sweden Gold 276
Sweden Silver 210
United States Bronze 690
United States Gold 1602
United States Silver 868
#bar plot of number of medals by team
count(events_top, Team, Medal) %>%
    ggplot(aes(x=Team, y=n, fill=Medal)) +
        geom_col()

#stack them side by side
count(events_top, Team, Medal) %>%
    ggplot(aes(x=Team, y=n, fill=Medal)) +
        geom_col(position="dodge")

#let's change the colors
count(events_top, Team, Medal) %>%
    ggplot(aes(x=Team, y=n, fill=Medal)) +
        geom_col(position="dodge") + 
        scale_fill_manual(values = c(
            "Gold" = "gold", 
            "Bronze" = "goldenrod", 
            "Silver" = "gray"))

#what if you want to order the teams by the total number of medals?
count(events_top, Team, Medal) %>%
    ggplot(aes(x=fct_rev(fct_reorder(Team, n)), y=n, fill=Medal)) +
        geom_col(position="dodge") + 
        scale_fill_manual(values = c(
            "Gold" = "gold", 
            "Bronze" = "goldenrod", 
            "Silver" = "gray"))

#Change the order, gold-silver-bronze
events_top$Medal <- factor(events_top$Medal, levels=c("Gold", "Silver", "Bronze"))
count(events_top, Team, Medal) %>%
    ggplot(aes(x=fct_rev(fct_reorder(Team, n)), y=n, fill=Medal)) +
        geom_col(position="dodge") + 
        scale_fill_manual(values = c(
            "Gold" = "gold", 
            "Bronze" = "goldenrod", 
            "Silver" = "gray")) +
        theme(axis.text.x = element_text(angle = 90)) +
        labs(x="Teams", y="Medal Count")

Question 4

Look at the total number of medals over the years for each team. Do you notice any weird patterns? Can you correct them?

#total number of medals over the years
events_top %>% count(Team, Year) 
## # A tibble: 296 × 3
##    Team       Year     n
##    <chr>     <dbl> <int>
##  1 Australia  1896     2
##  2 Australia  1900     5
##  3 Australia  1904     4
##  4 Australia  1906     3
##  5 Australia  1920     7
##  6 Australia  1924     9
##  7 Australia  1928     4
##  8 Australia  1932     5
##  9 Australia  1936     1
## 10 Australia  1948    16
## # ℹ 286 more rows
#look at the number of medals by team and year
events_top %>% 
    count(Team, Year) %>% 
    ggplot(aes(x=Year, y=n, color=Team)) +
        geom_point() +
        geom_line()

#another way to look at the number of medals by team and year
events_top %>% 
    count(Team, Year) %>% 
    ggplot(aes(x=Year, y=n, color=Team)) +
        geom_point() +
        geom_line() +
        facet_wrap(~Team)

The spiky pattern in the number of medals won through the years is because many of the top sports only exist in the summer olympics. Inconsistencies like are very hard to spot if you’re only reading the numbers instead of visualizing them.
Let’s see what it looks like if we restrict ourselves to the summer games.

#total number of medals over the years
events_top %>% 
    filter(Season == "Summer") %>%
    count(Team, Year) 
## # A tibble: 261 × 3
##    Team       Year     n
##    <chr>     <dbl> <int>
##  1 Australia  1896     2
##  2 Australia  1900     5
##  3 Australia  1904     4
##  4 Australia  1906     3
##  5 Australia  1920     7
##  6 Australia  1924     9
##  7 Australia  1928     4
##  8 Australia  1932     5
##  9 Australia  1936     1
## 10 Australia  1948    16
## # ℹ 251 more rows
#look at the number of medals by team and year
events_top %>% 
    filter(Season == "Summer") %>%
    count(Team, Year) %>% 
    ggplot(aes(x=Year, y=n, color=Team)) +
        geom_point() +
        geom_line() +
        facet_wrap(~Team)

Question 5

Do you see any patterns in the weights of medal winning participants over the years? Show the weight distributions with geom_histogram and geom_density_ridges

#distribution of weights by medal
ggplot(events_top, aes(x=Weight)) + geom_histogram()
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(events_top, aes(x=Weight, fill=Sex)) + geom_histogram()
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(events_top, aes(x=Weight, fill=Sex)) + geom_histogram() + facet_wrap(~Year)
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggridges is another very useful package for displaying multiple histograms. It’s not part of tidyverse, but can be installed very easily and uses the standard ggplot2 calling form. Be sure to specify a categorical variable to the group aesthetic if you want to visualize distinct ridges.

#if it's not installed yet.
#install.packages("ggridges")
library("ggridges")

ggplot(events_top, aes(x=Weight, group=Year, y=Year)) + 
  geom_density_ridges() + 
  facet_wrap(~Sex, ncol=2)
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_density_ridges()`).

#separate it by sport
ggplot(events_top, aes(x=Weight, group=Year, y=Year, fill=Sex)) + 
  geom_density_ridges() + 
  facet_wrap(Sport~Sex, ncol=4)
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_density_ridges()`).