If you’re new to the tidyverse, I recommend that you first read part one of this two-part series on transitioning into the tidyverse. Part 1 focuses on what I feel are the most important aspects and packages of the tidyverse: tidy thinking, piping, dplyr and ggplot2.
This second part of the two-part series focuses on the remaining (less essential, but still immensely useful) packages that make up the tidyverse: tidyr, purrr, readr, tibbles, as well as some additional type-specific packages (lubridate, forcats and stringr). Additional resources include the set of tidyverse cheatsheets, as well as the R for Data Science book.
Start by loading the tidyverse package into your environment.
library(tidyverse)
Then load the gapminder data.
# to download the data directly:
<- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv")
gapminder_orig # define a copy of the original dataset that we will clean and play with
<- gapminder_orig gapminder
Data shaping: tidyr
Tidyr aims to help you reshape your data, and is very useful if you receive data in a format that isn’t already “tidy”. I also find myself using tidyr functions to help me calculate specific types of summaries and plots.
For instance, tidyr helps you convert your data between (a) long-form data where each variable is in a single column
country year lifeExp
1 Australia 1992 77.560
2 Australia 1997 78.830
3 Australia 2002 80.370
4 Australia 2007 81.235
5 Canada 1992 77.950
6 Canada 1997 78.610
7 Canada 2002 79.770
8 Canada 2007 80.653
9 United States 1992 76.090
10 United States 1997 76.810
11 United States 2002 77.310
12 United States 2007 78.242
and (b) wide-form data where a single variable is separated into multiple columns based on some grouping (in this case, the life expectancy variable is separated into three columns, one for each country):
year Australia_lifeExp Canada_lifeExp United States_lifeExp
1 1992 77.560 77.950 76.090
2 1997 78.830 78.610 76.810
3 2002 80.370 79.770 77.310
4 2007 81.235 80.653 78.242
Gathering and spreading
The main tidyr functions are spread()
and gather()
. If you are familiar with the older reshape2 R package, you can think of tidyr as the tidyverse version, where spread()
is the equivalent of cast
, and gather()
is the equivalent of melt()
. If not… never mind!
Think of spread()
as a function that will spread a single variable’s “values” across multiple columns based on a “key”, or grouping variable. Similarly, think of gather()
as a function that will gather a variable whose “values” are spread across multiple columns (where the “key” is the grouping variable that distinguishes the columns) into a single column.
The main things you need to figure out when using spread()
and gather()
are what are the “key” and what are the “value” columns of your data frame. If you are spreading your data (to make it wider), then your key and value variables are existing variables in the data. If you are gathering your data (making it longer), then you will need to define key and value variables that will become variable names in your long-form data frame.
Below I’ll show how this works with a small subset of the gapminder dataset, corresponding to the life expectancy for US, Australia, and Canada for each year in the data after 1990.
Suppose that you started with the long-form data.
<- gapminder %>%
gapminder_sample_long filter(country %in% c("Australia", "United States", "Canada"), year > 1990) %>%
select(country, year, lifeExp)
gapminder_sample_long
country year lifeExp
1 Australia 1992 77.560
2 Australia 1997 78.830
3 Australia 2002 80.370
4 Australia 2007 81.235
5 Canada 1992 77.950
6 Canada 1997 78.610
7 Canada 2002 79.770
8 Canada 2007 80.653
9 United States 1992 76.090
10 United States 1997 76.810
11 United States 2002 77.310
12 United States 2007 78.242
A wide-form version might have the life expectancy variable spread into three variables, one for each country (it would also be perfectly feasible to separate by year). So in this case, the value that you want to spread is the lifeExp
variable, and the key that you want to spread/group by is the country
variable.
<- gapminder_sample_long %>%
gapminder_sample_wide spread(key = country, value = lifeExp)
gapminder_sample_wide
year Australia Canada United States
1 1992 77.560 77.950 76.090
2 1997 78.830 78.610 76.810
3 2002 80.370 79.770 77.310
4 2007 81.235 80.653 78.242
So the columns with the country names, Australia
, Canada
, and United States
contain the lifeExp
values corresponding to those countries for each year. Note that the year
variable has been retained in the wide form. If you had tried to do this without the year
variable in the data frame, you would run into an error that said "Error: Each row of output must be identified by a unique combination of keys."
Try running the following code.
%>%
gapminder_sample_long select(-year) %>%
spread(key = country, value = lifeExp)
This is because when the year column is missing, there is no variable that tells purrr which values should go in the same rows together. This error message is a common source of frustration in tidyr, and Hadley has been working on replacements for gather()
and spread()
called pivot_wider()
and pivot_longer()
: https://tidyr.tidyverse.org/dev/articles/pivot.html. They haven’t been incorporated into the CRAN versions of tidyr and the tidyverse yet though, but they probably will be soon. If you understand the principles of gather()
and spread()
then when the new pivot functions are introduced, it will be easy to learn how to use them.
If you wanted to go from the wide form to the long-form, you need to gather together the life expectancy values. This time, the country
key and lifeExp
value variable names do not currently exist in the data frame. The key
and value
arguments that you provide in the gather()
function are what will be used as the names of the variables for the long-form version you’re about to create. Just so you can see that these variables did not need to exist in the original data, you will call the key country_var
and the value lifeExp_var
(previously unused names).
gapminder_sample_wide
year Australia Canada United States
1 1992 77.560 77.950 76.090
2 1997 78.830 78.610 76.810
3 2002 80.370 79.770 77.310
4 2007 81.235 80.653 78.242
%>%
gapminder_sample_wide gather(key = country_var, value = lifeExp_var)
country_var lifeExp_var
1 year 1992.000
2 year 1997.000
3 year 2002.000
4 year 2007.000
5 Australia 77.560
6 Australia 78.830
7 Australia 80.370
8 Australia 81.235
9 Canada 77.950
10 Canada 78.610
11 Canada 79.770
12 Canada 80.653
13 United States 76.090
14 United States 76.810
15 United States 77.310
16 United States 78.242
Oh no…. something went wrong! The year
variable has been included as a key (country). Since there is no distinction between the three country columns (Australia
, Canada
, and United States
) and the year
column, the year
column was included in the gathering process. To exclude a column from the gathering process, you can explicitly remove it using e.g. -year
as an argument to the gather function.
%>%
gapminder_sample_wide gather(key = country_var, value = lifeExp_var, -year)
year country_var lifeExp_var
1 1992 Australia 77.560
2 1997 Australia 78.830
3 2002 Australia 80.370
4 2007 Australia 81.235
5 1992 Canada 77.950
6 1997 Canada 78.610
7 2002 Canada 79.770
8 2007 Canada 80.653
9 1992 United States 76.090
10 1997 United States 76.810
11 2002 United States 77.310
12 2007 United States 78.242
Combining and separating variables
The unite()
function combines columns into a single column. For instance, you can combine the country and year variables into a single variable, countryyear
.
<- gapminder_sample_long %>%
gapminder_sample_united unite("countryyear", country, year, sep = "_")
gapminder_sample_united
countryyear lifeExp
1 Australia_1992 77.560
2 Australia_1997 78.830
3 Australia_2002 80.370
4 Australia_2007 81.235
5 Canada_1992 77.950
6 Canada_1997 78.610
7 Canada_2002 79.770
8 Canada_2007 80.653
9 United States_1992 76.090
10 United States_1997 76.810
11 United States_2002 77.310
12 United States_2007 78.242
Conversely, you can separate single columns into multiple columns. Below, I undo the unite()
that I performed above using separate()
.
%>%
gapminder_sample_united separate(countryyear, c("country", "year"), sep = "_")
country year lifeExp
1 Australia 1992 77.560
2 Australia 1997 78.830
3 Australia 2002 80.370
4 Australia 2007 81.235
5 Canada 1992 77.950
6 Canada 1997 78.610
7 Canada 2002 79.770
8 Canada 2007 80.653
9 United States 1992 76.090
10 United States 1997 76.810
11 United States 2002 77.310
12 United States 2007 78.242
Tidyr also has some nice functions for dealing with missing values including
drop_na()
that will remove every row that has a missing value (NA
) in it.replace_na()
that will replace every missing value (NA
) with whatever value you specify.
Replacing loops: purrr
Iteration in the tidyverse is handled using purrr
; a feline-friendly package for applying “map” functions (although it does a few other neat things too). If you are experienced in base R, then you’re probably familiar with the apply()
functions that can be used in place of loops for iteratively applying a function. The most common apply functions are
apply(matrix, margin, fun)
applies a function, fun
, across each of the rows (if you set margin = 1
) or each of the columns (if you set margin = 2
) of a matrix
(or array) and returns a vector.
sapply(object, fun)
applies a function, fun
, to each entry of an object
(which could be a vector, a data frame or a list), and usually returns a vector, but sometimes it returns a matrix, and often it is difficult to guess what type of object it will return before you run your code.
lapply(list, fun)
applies a function, fun
, to each entry of a list
, and returns a list. This one at least usually makes sense relative to the other apply functions.
While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply
…). Each of purrr
’s map functions can be applied to vectors, lists and data frames.
It is useful to remember that a data frame is a special type of a list where each columns of the data frame corresponds to an entry of the list. Each entry of the data frame-list is a vector of the same length (although the vectors do not need to be of the same type).
One of the primary features of purrr
’s map functions is that you need to specify the form of your output as a function suffix separated by an underscore. The first element is always the data object over which you want to iterate, and the second argument is always the function that you want to iteratively apply. For example:
map(object, fun)
is the primary mapping function and returns a listmap_df(object, fun)
returns a data framemap_dbl(object, fun)
returns a numeric (double) vectormap_chr(object, fun)
returns a character vectormap_lgl(object, fun)
returns a logical vector
The input to any map
function is always either
a vector (of any type), in which case the iteration is done over the entries of the vector
a list, in which case the iteration is performed over the elements of the list
a data frame, in which case the iteration is performed over the columns of the data frame (which, since a data frame is a special kind of list, is technically the same as the previous point)
The output of each map function is specified by the term that follows the underscore in the function name.
Fundamentally, maps are for iteration. In the example below I will iterate through the vector c(1, 4, 7)
by adding 10 to each entry. The following code is how you would do this using the base R apply functions
lapply(c(1, 4, 7), function(number) {
return(number + 10)
})
[[1]]
[1] 11
[[2]]
[1] 14
[[3]]
[1] 17
sapply(c(1, 4, 7), function(number) {
return(number + 10)
})
[1] 11 14 17
For the purrr
equivalents, if you want your output to be a list, you would use map()
, if you want it to be a numeric vector, then you would use map_dbl()
, if you want it to be a character, then it is map_chr()
.
library(purrr)
map(c(1, 4, 7), function(number) {
return(number + 10)
})
[[1]]
[1] 11
[[2]]
[1] 14
[[3]]
[1] 17
map_dbl(c(1, 4, 7), function(number) {
return(number + 10)
})
[1] 11 14 17
map_chr(c(1, 4, 7), function(number) {
return(number + 10)
})
Warning: Automatic coercion from double to character was deprecated in purrr 1.0.0.
ℹ Please use an explicit call to `as.character()` within `map_chr()` instead.
[1] "11.000000" "14.000000" "17.000000"
If you want to return a data frame, then you would use map_df
(but you need to make sure that in each iteration you’re returning a data frame which has consistent column names).
map_df(c(1, 4, 7), function(number) {
return(data.frame(old_number = number,
new_number = number + 10))
})
old_number new_number
1 1 11
2 4 14
3 7 17
map2
and pmap
are versions of map functions that work over multiple data frames/lists/vectors at once. There are also fancy things that you can do with purrr that include iterating over entire lists of data as entries to columns of a tibble, but I won’t talk about those here. My next blog post will be on purrr so keep a look out if you want to learn more. For a more comprehensive look at purrr, I recommend Jenny Bryan’s tutorial.
Loading data: readr
At face-value, readr is probably the least exciting tidyverse package. At first glance, it mostly appears to offer tidyverse equivalents to the classic base R data loading functions such as read.csv()
. Calling a readr data loading function is usually the same as the base R versions, but they use an underscore _
separator rather than a period separator .
, as in read_csv()
.
The main advantages of the readr versions is that the data is read in directly as a tibble, and the readr loading functions do a much better job at deciding what type each variable should be (and it makes it easier to specify what types the columns should be at the time of loading if you have strong opinions).
However, a closer look reveals that readr
has some hidden talents that are sure to come in handy! For instance, readr has a series of parse_
functions that convert abnormally represented data into normally represented data. For instance, parse_number()
will extract the numeric component of strings with numbers in them. So if your collaborators give you a file with prices that have preceding “$”s or other characters in them, parse_number()
will remove them for you without you having to do anything involving regular expressions.
parse_number(c("$1,234.5", "$12.45", "99%"))
[1] 1234.50 12.45 99.00
Tidyr can also be used to convert dates and times coded as strings to actual date-time formats.
parse_datetime("2010-10-01 21:45")
[1] "2010-10-01 21:45:00 UTC"
But to be honest, I prefer to use the lubridate package for doing things with dates (see below).
Okay, so maybe tidyr isn’t the most exciting package, but that parse_number()
thing is pretty neat!
Storing data: tibbles
Tibbles are the tidyverse version of a data frame. You’ve probably used tibbles before without even realizing. They look and behave a LOT like a data frame. Often when you input a data frame to a tidyverse function, it comes out the other end as a tibble. The differences are minor and you’re unlikely to notice them if you’re just starting out, so I wouldn’t worry about whether your data is stored as a data frame or a tibble.
The main differences that you might notice is in how they are printed to the console: tibbles are automatically truncated to 10 rows when printed into the console, and if you have too many variables, many of the variables are hidden from view. I secretly sometimes view my tibbles in the console using as.data.frame(data)
so that it doesn’t truncate. Probably a better thing to do would be to View(data)
, but that opens a whole new window which I sometimes find kind of annoying.
Tibbles only become important much later down the tidyverse track when you want to use list columns to do fancy stuff with purrr.
Dates, factors and strings: lubridate, forcats and stringr
While not technically a part of the tidyverse, there are also very useful packages for manipulating type-specific variables: lubridate for dates/times, forcats for factors and stringr for strings.
Handling dates and times: lubridate
Lubridate makes it really straightforward to deal with dates. One might say… it lubricates them… one might also not say that, because it’s a bit weird.
Lubridate offers a simple way of converting dates/times stored as strings to dates/times stored as dates/times, and it makes it easy to do math with dates.
The primary set of functions are date-time-reading functions that convert strings to dates. To decide which function to use, you will need to figure out what format your dates are in (by… looking at them…). For instance, if your date is coded as "August 2nd 2019"
or "08-05-19"
or "08/02/19"
, then you would use the mdy()
function because it is coded as “month-day-year”:
library(lubridate)
mdy("August 2nd 2019")
[1] "2019-08-02"
mdy("8-2-2019")
[1] "2019-08-02"
mdy("8/2/19")
[1] "2019-08-02"
If your dates were coded as “year-month-day” then you would use the ymd()
function, and so on.
Strings that contain times can be parsed using hms()
for “hour-minute-second”.
hms("8:45:12")
[1] "8H 45M 12S"
And date-times can be parsed using ymd_hms()
, ymd_hm()
, ymd_h()
, as well as for the other date versions (mdy_hms()
, dmy_hms()
, etc…).
mdy_hms("March 13th 2019 at 9:02:00")
[1] "2019-03-13 09:02:00 UTC"
mdy_hm("03-13-19, 9:02")
[1] "2019-03-13 09:02:00 UTC"
You can add fixed periods of time to dates easily using the years()
, months()
, days()
, hours()
, etc… functions. For instance:
mdy("August 2nd 2019") + days(42)
[1] "2019-09-13"
Once your dates are in an actual date format, you can do intuitive mathematical calculations with date-times:
mdy_hms("August 2nd 2019, 1:21:30 pm") - mdy_hms("August 1st 2019, 11:23:33 am")
Time difference of 1.08191 days
Plus ggplot2 handles lubridate-dates really well.
Handling factors: forcats
Factors are somehow simultaneously very useful and the worst thing ever. Fortunately, since I discovered the forcats
package, my factors have been on their best behaviour.
The forcats package has a few really useful functions. The ones I use most often are
fct_inorder()
for reordering the levels of a factor so that the levels are in the order that they appear in the factor vector.fct_infreq()
for reordering the levels of a factor so that the levels are in order of most to least frequent.fct_rev()
for reversing the order of the levels of a factor.fct_relevel()
for manually reordering the levels of the factor.fct_reorder()
for reordering the levels based on their relationship to another variable.
There are other functions too, but I rarely use them. Check out the forcats cheatsheet!
An example of how the forcats package makes my life easier is when I want to reorder the factor levels. Factor levels are usually alphabetical by default, and I often want the factor levels to be in a specific order.
As an exercise both in ggplot2 and dplyr, I want to make a plot that shows the difference between life expectancy between 2007 and 1952 and arrange the countries in order of greatest difference in life expectancy.
<- gapminder %>%
gapminder_life_exp_diff # filter to the starting and ending years only
filter(year == 1952 | year == 2007) %>%
# ensure that the data are arranged so that 1952 is first and 2007 is second
# within each year
arrange(country, year) %>%
# for country, add a variable corresponding to the difference between life
# expectency in 2007 and 1952
group_by(country) %>%
mutate(lifeExp_diff = lifeExp[2] - lifeExp[1],
# also calculate the largest population for the country (based on the two years)
max_pop = max(pop)) %>%
ungroup() %>%
# arrange in order of the biggest difference in life expectency
arrange(lifeExp_diff) %>%
# restrict to countries with a population of at least 30,000 so we can fit
# the plot in a reasonable space
filter(max_pop > 30000000) %>%
select(country, year, continent, lifeExp, lifeExp_diff)
gapminder_life_exp_diff
# A tibble: 72 × 5
country year continent lifeExp lifeExp_diff
<chr> <int> <chr> <dbl> <dbl>
1 South Africa 1952 Africa 45.0 4.33
2 South Africa 2007 Africa 49.3 4.33
3 Congo Dem. Rep. 1952 Africa 39.1 7.32
4 Congo Dem. Rep. 2007 Africa 46.5 7.32
5 United States 1952 Americas 68.4 9.80
6 United States 2007 Americas 78.2 9.80
7 United Kingdom 1952 Europe 69.2 10.2
8 United Kingdom 2007 Europe 79.4 10.2
9 Nigeria 1952 Africa 36.3 10.5
10 Nigeria 2007 Africa 46.9 10.5
# … with 62 more rows
To understand what the intermediate dplyr steps are doing in the code below, I suggest printing each step out to the console (without defining a new data frame) - i.e. first print gapminder %>% filter(year == 1952 | year == 2007)
, then print gapminder %>% filter(year == 1952 | year == 2007) %>% arrange(country, year)
, etc.
The next task is to make a dot plot that shows the life expectancy in 1952 and 2007 for each country. Since the countries in our data frame is arranged in order of smallest to biggest difference in life expectancy, one would expect that the plot will be too. However, the countries in the plot still appear in alphabetical order! The problem is that ggplot2 plots factors in order of their levels, but the arrange()
dplyr function rearranges the order of the rows in the data frame but does not change the order of the factor levels.
%>%
gapminder_life_exp_diff ggplot() +
geom_point(aes(x = lifeExp, y = country, col = as.factor(year)))
If I tried to fix this using base R, I would undoubtedly end up messing up which country is which. Fortunately this is really, really easy to fix using forcats! The fct_inorder()
function will automatically reorder the levels of a factor in the order in which they appear in the vector. So all I need to do is add one line of pre-processing code before I make my plot: mutate(country = fct_inorder(country))
.
%>%
gapminder_life_exp_diff mutate(country = fct_inorder(country)) %>%
ggplot() +
geom_point(aes(x = lifeExp, y = country, col = as.factor(year)))
I’m a bit pedantic about data viz, so I can’t leave this plot looking like this, so I’m just going to place some ggplot2 code here for making this plot waaaaay more badass. Try to read through the code and understand what its doing. This isn’t a lesson in forcats, it’s a lesson in EDA!
%>%
gapminder_life_exp_diff mutate(country = fct_inorder(country)) %>%
# for each country define a varaible for min and max life expectancy
group_by(country) %>%
mutate(max_lifeExp = max(lifeExp),
min_lifeExp = min(lifeExp)) %>%
ungroup() %>%
ggplot() +
# plot a horizontal line from min to max life expectency for each country
geom_segment(aes(x = min_lifeExp, xend = max_lifeExp,
y = country, yend = country,
col = continent), alpha = 0.5, size = 7) +
# add a point for each life expectancy data point
geom_point(aes(x = lifeExp, y = country, col = continent), size = 8) +
# add text of the country name as well as the max and min life expectency
geom_text(aes(x = min_lifeExp + 0.7, y = country,
label = paste(country, round(min_lifeExp))),
col = "grey50", hjust = "right") +
geom_text(aes(x = max_lifeExp - 0.7, y = country,
label = round(max_lifeExp)),
col = "grey50", hjust = "left") +
# ensure that the left-most text is not cut off
scale_x_continuous(limits = c(20, 85)) +
# choose a different colour palette
scale_colour_brewer(palette = "Pastel2") +
# set the title
labs(title = "Change in life expectancy",
subtitle = "Between 1952 and 2007",
x = "Life expectancy (in 1952 and 2007)",
y = NULL,
col = "Continent") +
# remove the grey background
theme_classic() +
# remove the axes and move the legend to the top
theme(legend.position = "top",
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank())
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Handling strings: stringr
R used to be terrible at handling strings. Stringr has made string-handling a LOT easier. The functions all start with str_
and end with what you want to do to the string.
For instance, to return a logical that specifies whether a specific pattern exists in the string (the equivalent of grepl()
in base R), you can use the str_detect()
function
str_detect("I like bananas", "banana")
[1] TRUE
My friend Sara Stoudt @sastoudt wrote a wrote a very useful post for the tidyverse website comparing stringr with its Base R equivalents (https://stringr.tidyverse.org/articles/from-base.html). She provides the following useful table (hers is a bit longer - I’m just showing the parts I find most useful):
Action | Base R | Tidyverse |
---|---|---|
Identify the location of a pattern | gregexpr(pattern, x) |
str_locate_all(x, pattern) |
Keep strings matching a pattern | grep(pattern, x, value = TRUE) |
str_subset(x, pattern) |
Identify position matching a pattern | grep(pattern, x) |
str_which(x, pattern) |
Detect presence or absence of a pattern | grepl(pattern, x) |
str_detect(x, pattern) |
Replace a pattern | gsub(pattern, replacement, x) |
str_replace_all(x, pattern, replacement) |
Calculate the number of characters in a string | nchar(x) |
str_length(x) |
Split a string into pieces | strsplit(x, pattern) |
str_split(x, pattern) |
Extract a subset of a string | substr(x, start, end) |
str_sub(x, start, end) |
Convert a string to lowercase | tolower(x) |
str_to_lower(x) |
Convert a string to “Title Case” | tools::toTitleCase(x) |
str_to_title(x) |
Convert a string to uppercase | toupper(x) |
str_to_upper(x) |
Trim white space from a string | trimws(x) |
str_trim(x) |
If you’d like to see a little more of stringr, check out Sara’s post!