Kith Pradhan
2024-05-14
The first example will use the builtin mtcars
dataset to
show off a simple scatterplot.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#this library makes the output look a little better
#if it's not installed yet call
#install.packages("pander")
library("pander")
#load the data
data(mtcars)
#look at the first few rows
#mtcars %>% head() %>% pander()
#here's another way to do it
pander(head(mtcars))
 | mpg | cyl | disp | hp | drat | wt | qsec | vs |
---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21 | 6 | 160 | 110 | 3.9 | 2.62 | 16.46 | 0 |
Mazda RX4 Wag | 21 | 6 | 160 | 110 | 3.9 | 2.875 | 17.02 | 0 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.32 | 18.61 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.44 | 17.02 | 0 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.46 | 20.22 | 1 |
 | am | gear | carb |
---|---|---|---|
Mazda RX4 | 1 | 4 | 4 |
Mazda RX4 Wag | 1 | 4 | 4 |
Datsun 710 | 1 | 4 | 1 |
Hornet 4 Drive | 0 | 3 | 1 |
Hornet Sportabout | 0 | 3 | 2 |
Valiant | 0 | 3 | 1 |
Let’s look at the relationship between mpg and hp with a simple scatterplot using geom_point().
Remember these three key points to using ggplot2.
We can assign variables to be visualized through the geom’s features. In the code below, the gear of the car is shown off as color. gear is being treated as a numeric and ranges over 3 distinct values.
By default, gear is stored as a floating point number. If you change its class to a factor, ggplot will treat it as a categorical variable, and assign attributes based on a discrete scale rather than continuous.
#turn gear into a factor
mtcars$gear2 = as.factor(mtcars$gear)
#Here's another way to do it
mtcars <- mtcars %>% mutate(gear2 = as.factor(gear))
pander(head(mtcars))
 | mpg | cyl | disp | hp | drat | wt | qsec | vs |
---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21 | 6 | 160 | 110 | 3.9 | 2.62 | 16.46 | 0 |
Mazda RX4 Wag | 21 | 6 | 160 | 110 | 3.9 | 2.875 | 17.02 | 0 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.32 | 18.61 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.44 | 17.02 | 0 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.46 | 20.22 | 1 |
 | am | gear | carb | gear2 |
---|---|---|---|---|
Mazda RX4 | 1 | 4 | 4 | 4 |
Mazda RX4 Wag | 1 | 4 | 4 | 4 |
Datsun 710 | 1 | 4 | 1 | 4 |
Hornet 4 Drive | 0 | 3 | 1 | 3 |
Hornet Sportabout | 0 | 3 | 2 | 3 |
Valiant | 0 | 3 | 1 | 3 |
In these examples we’ll be using the builtin starwars dataset to show off how to make a heatmap in ggplot2. If you’ve made heatmaps in base R with the image() function, you’re already familiar with how to display data in matrix form. This is a very intuitive way to handle 2D data.
ggplot2 uses a different way to handle 2D data for heatmaps. Here, we treat each individual heatmap tile as a separate observation. Instead of a rectangular matrix with nSubjects rows and nVariables columns, we need to make a new data structure that has nSubjects * nVariables rows, with one or more columns to distinguish the variables and their values.
Luckily for us, there is a tidyverse function called unnest() that can rearrange the data for us automatically.
For the first example, let’s look at the films that each character had an appearance in using geom_tile()
## [1] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader"
## [5] "Leia Organa" "Owen Lars"
## [[1]]
## [1] "The Empire Strikes Back" "Revenge of the Sith"
## [3] "Return of the Jedi" "A New Hope"
## [5] "The Force Awakens"
##
## [[2]]
## [1] "The Empire Strikes Back" "Attack of the Clones"
## [3] "The Phantom Menace" "Revenge of the Sith"
## [5] "Return of the Jedi" "A New Hope"
##
## [[3]]
## [1] "The Empire Strikes Back" "Attack of the Clones"
## [3] "The Phantom Menace" "Revenge of the Sith"
## [5] "Return of the Jedi" "A New Hope"
## [7] "The Force Awakens"
##
## [[4]]
## [1] "The Empire Strikes Back" "Revenge of the Sith"
## [3] "Return of the Jedi" "A New Hope"
##
## [[5]]
## [1] "The Empire Strikes Back" "Revenge of the Sith"
## [3] "Return of the Jedi" "A New Hope"
## [5] "The Force Awakens"
##
## [[6]]
## [1] "Attack of the Clones" "Revenge of the Sith" "A New Hope"
name | films |
---|---|
Luke Skywalker | The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens |
C-3PO | The Empire Strikes Back, Attack of the Clones, The Phantom Menace, Revenge of the Sith, Return of the Jedi, A New Hope |
R2-D2 | The Empire Strikes Back, Attack of the Clones, The Phantom Menace, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens |
Darth Vader | The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope |
Leia Organa | The Empire Strikes Back, Revenge of the Sith, Return of the Jedi, A New Hope, The Force Awakens |
Owen Lars | Attack of the Clones, Revenge of the Sith, A New Hope |
#this is how it needs to be to use ggplot's heatmap
starwars_films <- starwars %>% select(name, films) %>% unnest(films)
starwars_films %>% head(30) %>% pander()
name | films |
---|---|
Luke Skywalker | The Empire Strikes Back |
Luke Skywalker | Revenge of the Sith |
Luke Skywalker | Return of the Jedi |
Luke Skywalker | A New Hope |
Luke Skywalker | The Force Awakens |
C-3PO | The Empire Strikes Back |
C-3PO | Attack of the Clones |
C-3PO | The Phantom Menace |
C-3PO | Revenge of the Sith |
C-3PO | Return of the Jedi |
C-3PO | A New Hope |
R2-D2 | The Empire Strikes Back |
R2-D2 | Attack of the Clones |
R2-D2 | The Phantom Menace |
R2-D2 | Revenge of the Sith |
R2-D2 | Return of the Jedi |
R2-D2 | A New Hope |
R2-D2 | The Force Awakens |
Darth Vader | The Empire Strikes Back |
Darth Vader | Revenge of the Sith |
Darth Vader | Return of the Jedi |
Darth Vader | A New Hope |
Leia Organa | The Empire Strikes Back |
Leia Organa | Revenge of the Sith |
Leia Organa | Return of the Jedi |
Leia Organa | A New Hope |
Leia Organa | The Force Awakens |
Owen Lars | Attack of the Clones |
Owen Lars | Revenge of the Sith |
Owen Lars | A New Hope |
Let’s look at another heatmap. The types of starships piloted by each character, broken down by the pilot’s species. Here, we’re rearranging the necessary info into a tiled data structure, then we’re adding extra info from the original data with left_join()
#make a new data.frame that has the info we want to visualize
starwars_ships <- starwars %>% select(name, starships) %>% unnest(starships)
pander(starwars_ships)
name | starships |
---|---|
Luke Skywalker | X-wing |
Luke Skywalker | Imperial shuttle |
Darth Vader | TIE Advanced x1 |
Biggs Darklighter | X-wing |
Obi-Wan Kenobi | Jedi starfighter |
Obi-Wan Kenobi | Trade Federation cruiser |
Obi-Wan Kenobi | Naboo star skiff |
Obi-Wan Kenobi | Jedi Interceptor |
Obi-Wan Kenobi | Belbullab-22 starfighter |
Anakin Skywalker | Trade Federation cruiser |
Anakin Skywalker | Jedi Interceptor |
Anakin Skywalker | Naboo fighter |
Chewbacca | Millennium Falcon |
Chewbacca | Imperial shuttle |
Han Solo | Millennium Falcon |
Han Solo | Imperial shuttle |
Wedge Antilles | X-wing |
Jek Tono Porkins | X-wing |
Boba Fett | Slave 1 |
Lando Calrissian | Millennium Falcon |
Arvel Crynyd | A-wing |
Nien Nunb | Millennium Falcon |
Ric Olié | Naboo Royal Starship |
Darth Maul | Scimitar |
Plo Koon | Jedi starfighter |
Gregar Typho | Naboo fighter |
Grievous | Belbullab-22 starfighter |
Poe Dameron | T-70 X-wing fighter |
Padmé Amidala | H-type Nubian yacht |
Padmé Amidala | Naboo star skiff |
Padmé Amidala | Naboo fighter |
#add info from the original data.frame to the new data.frame
starwars_shipsSpecies <- left_join(starwars_ships, select(starwars, name, species), by="name")
pander(starwars_shipsSpecies)
name | starships | species |
---|---|---|
Luke Skywalker | X-wing | Human |
Luke Skywalker | Imperial shuttle | Human |
Darth Vader | TIE Advanced x1 | Human |
Biggs Darklighter | X-wing | Human |
Obi-Wan Kenobi | Jedi starfighter | Human |
Obi-Wan Kenobi | Trade Federation cruiser | Human |
Obi-Wan Kenobi | Naboo star skiff | Human |
Obi-Wan Kenobi | Jedi Interceptor | Human |
Obi-Wan Kenobi | Belbullab-22 starfighter | Human |
Anakin Skywalker | Trade Federation cruiser | Human |
Anakin Skywalker | Jedi Interceptor | Human |
Anakin Skywalker | Naboo fighter | Human |
Chewbacca | Millennium Falcon | Wookiee |
Chewbacca | Imperial shuttle | Wookiee |
Han Solo | Millennium Falcon | Human |
Han Solo | Imperial shuttle | Human |
Wedge Antilles | X-wing | Human |
Jek Tono Porkins | X-wing | Human |
Boba Fett | Slave 1 | Human |
Lando Calrissian | Millennium Falcon | Human |
Arvel Crynyd | A-wing | Human |
Nien Nunb | Millennium Falcon | Sullustan |
Ric Olié | Naboo Royal Starship | NA |
Darth Maul | Scimitar | Zabrak |
Plo Koon | Jedi starfighter | Kel Dor |
Gregar Typho | Naboo fighter | Human |
Grievous | Belbullab-22 starfighter | Kaleesh |
Poe Dameron | T-70 X-wing fighter | Human |
Padmé Amidala | H-type Nubian yacht | Human |
Padmé Amidala | Naboo star skiff | Human |
Padmé Amidala | Naboo fighter | Human |
#color the heatmap elements by species
ggplot(starwars_shipsSpecies, aes(x=starships, y=name, fill=species)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90))
Here are a few more ggplot2 examples using the olympics
dataset.
We’ll be looking for associations between the number/type of medals won
and other interesting variables.
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl (5): ID, Age, Height, Weight, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This dataset contains 271116 rows and 15 columns. To make things more manageable, we’ll be restricting our exploration to only the most popular games and teams.
How do you find the sports & teams with the largest number of players?
#sports with the most number of players
events %>% count(Sport) %>% arrange(desc(n)) %>% top_n(n=10, wt=n)
## # A tibble: 10 × 2
## Sport n
## <chr> <int>
## 1 Athletics 38624
## 2 Gymnastics 26707
## 3 Swimming 23195
## 4 Shooting 11448
## 5 Cycling 10859
## 6 Fencing 10735
## 7 Rowing 10595
## 8 Cross Country Skiing 9133
## 9 Alpine Skiing 8829
## 10 Wrestling 7154
#from here you can extract the names from the Sport column
top_sports <- events %>%
count(Sport) %>%
arrange(desc(n)) %>%
top_n(n=10, wt=n) %>%
pull(Sport)
top_sports
## [1] "Athletics" "Gymnastics" "Swimming"
## [4] "Shooting" "Cycling" "Fencing"
## [7] "Rowing" "Cross Country Skiing" "Alpine Skiing"
## [10] "Wrestling"
#another way to do it
top_sports <- names(tail(sort(table(events$Sport)), 10))
top_teams <- names(tail(sort(table(events$Team)), 10))
top_sports
## [1] "Wrestling" "Alpine Skiing" "Cross Country Skiing"
## [4] "Rowing" "Fencing" "Cycling"
## [7] "Shooting" "Swimming" "Gymnastics"
## [10] "Athletics"
## [1] "Hungary" "Australia" "Sweden" "Japan"
## [5] "Canada" "Germany" "Italy" "Great Britain"
## [9] "France" "United States"
Let’s also restrict ourselves to only using the players that won a medal. We want the players that won a medal in any of the top sports from any of the top teams. How do you set up this smaller dataset?
events_top <- events %>%
filter(!is.na(Medal )) %>%
filter(Sport %in% top_sports, Team %in% top_teams)
events_top
## # A tibble: 9,881 × 15
## ID Name Sex Age Height Weight Team NOC Games Year Season City
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 56 Ren Aba… M 21 NA NA Fran… FRA 1956… 1956 Summer Melb…
## 2 62 Giovann… M 21 198 90 Italy ITA 2016… 2016 Summer Rio …
## 3 84 Stephen… M 26 165 55 Unit… USA 2004… 2004 Summer Athi…
## 4 99 Pter Ab… M 30 181 79 Hung… HUN 1992… 1992 Summer Barc…
## 5 100 Oszkr A… M 22 NA NA Hung… HUN 1936… 1936 Summer Berl…
## 6 106 Agostin… M 22 188 96 Italy ITA 1988… 1988 Summer Seoul
## 7 106 Agostin… M 29 188 96 Italy ITA 1996… 1996 Summer Atla…
## 8 106 Agostin… M 34 188 96 Italy ITA 2000… 2000 Summer Sydn…
## 9 107 Carmine… M 22 182 90 Italy ITA 1984… 1984 Summer Los …
## 10 107 Carmine… M 26 182 90 Italy ITA 1988… 1988 Summer Seoul
## # ℹ 9,871 more rows
## # ℹ 3 more variables: Sport <chr>, Event <chr>, Medal <chr>
Which team had the most number of medals? Show the results in a bar graph using geom_col.
Team | Medal | n |
---|---|---|
Australia | Bronze | 252 |
Australia | Gold | 194 |
Australia | Silver | 286 |
Canada | Bronze | 200 |
Canada | Gold | 95 |
Canada | Silver | 160 |
France | Bronze | 355 |
France | Gold | 276 |
France | Silver | 294 |
Germany | Bronze | 319 |
Germany | Gold | 267 |
Germany | Silver | 300 |
Great Britain | Bronze | 354 |
Great Britain | Gold | 270 |
Great Britain | Silver | 406 |
Hungary | Bronze | 201 |
Hungary | Gold | 196 |
Hungary | Silver | 176 |
Italy | Bronze | 263 |
Italy | Gold | 400 |
Italy | Silver | 319 |
Japan | Bronze | 142 |
Japan | Gold | 133 |
Japan | Silver | 132 |
Sweden | Bronze | 245 |
Sweden | Gold | 276 |
Sweden | Silver | 210 |
United States | Bronze | 690 |
United States | Gold | 1602 |
United States | Silver | 868 |
#bar plot of number of medals by team
count(events_top, Team, Medal) %>%
ggplot(aes(x=Team, y=n, fill=Medal)) +
geom_col()
#stack them side by side
count(events_top, Team, Medal) %>%
ggplot(aes(x=Team, y=n, fill=Medal)) +
geom_col(position="dodge")
#let's change the colors
count(events_top, Team, Medal) %>%
ggplot(aes(x=Team, y=n, fill=Medal)) +
geom_col(position="dodge") +
scale_fill_manual(values = c(
"Gold" = "gold",
"Bronze" = "goldenrod",
"Silver" = "gray"))
#what if you want to order the teams by the total number of medals?
count(events_top, Team, Medal) %>%
ggplot(aes(x=fct_rev(fct_reorder(Team, n)), y=n, fill=Medal)) +
geom_col(position="dodge") +
scale_fill_manual(values = c(
"Gold" = "gold",
"Bronze" = "goldenrod",
"Silver" = "gray"))
#Change the order, gold-silver-bronze
events_top$Medal <- factor(events_top$Medal, levels=c("Gold", "Silver", "Bronze"))
count(events_top, Team, Medal) %>%
ggplot(aes(x=fct_rev(fct_reorder(Team, n)), y=n, fill=Medal)) +
geom_col(position="dodge") +
scale_fill_manual(values = c(
"Gold" = "gold",
"Bronze" = "goldenrod",
"Silver" = "gray")) +
theme(axis.text.x = element_text(angle = 90)) +
labs(x="Teams", y="Medal Count")
Look at the total number of medals over the years for each team. Do you notice any weird patterns? Can you correct them?
## # A tibble: 296 × 3
## Team Year n
## <chr> <dbl> <int>
## 1 Australia 1896 2
## 2 Australia 1900 5
## 3 Australia 1904 4
## 4 Australia 1906 3
## 5 Australia 1920 7
## 6 Australia 1924 9
## 7 Australia 1928 4
## 8 Australia 1932 5
## 9 Australia 1936 1
## 10 Australia 1948 16
## # ℹ 286 more rows
#look at the number of medals by team and year
events_top %>%
count(Team, Year) %>%
ggplot(aes(x=Year, y=n, color=Team)) +
geom_point() +
geom_line()
#another way to look at the number of medals by team and year
events_top %>%
count(Team, Year) %>%
ggplot(aes(x=Year, y=n, color=Team)) +
geom_point() +
geom_line() +
facet_wrap(~Team)
The spiky pattern in the number of medals won through the years is
because many of the top sports only exist in the summer olympics.
Inconsistencies like are very hard to spot if you’re only reading the
numbers instead of visualizing them.
Let’s see what it looks like if we restrict ourselves to the summer
games.
#total number of medals over the years
events_top %>%
filter(Season == "Summer") %>%
count(Team, Year)
## # A tibble: 261 × 3
## Team Year n
## <chr> <dbl> <int>
## 1 Australia 1896 2
## 2 Australia 1900 5
## 3 Australia 1904 4
## 4 Australia 1906 3
## 5 Australia 1920 7
## 6 Australia 1924 9
## 7 Australia 1928 4
## 8 Australia 1932 5
## 9 Australia 1936 1
## 10 Australia 1948 16
## # ℹ 251 more rows
#look at the number of medals by team and year
events_top %>%
filter(Season == "Summer") %>%
count(Team, Year) %>%
ggplot(aes(x=Year, y=n, color=Team)) +
geom_point() +
geom_line() +
facet_wrap(~Team)
Do you see any patterns in the weights of medal winning participants over the years? Show the weight distributions with geom_histogram and geom_density_ridges
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggridges is another very useful package for displaying multiple histograms. It’s not part of tidyverse, but can be installed very easily and uses the standard ggplot2 calling form. Be sure to specify a categorical variable to the group aesthetic if you want to visualize distinct ridges.
#if it's not installed yet.
#install.packages("ggridges")
library("ggridges")
ggplot(events_top, aes(x=Weight, group=Year, y=Year)) +
geom_density_ridges() +
facet_wrap(~Sex, ncol=2)
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_density_ridges()`).
#separate it by sport
ggplot(events_top, aes(x=Weight, group=Year, y=Year, fill=Sex)) +
geom_density_ridges() +
facet_wrap(Sport~Sex, ncol=4)
## Warning: Removed 2902 rows containing non-finite outside the scale range
## (`stat_density_ridges()`).