Purrr is one of those tidyverse packages that you keep hearing about, and you know you should probably learn it, but you just never seem to get around to it.
At it’s core, purrr is all about iteration. Purrr introduces map functions (the tidyverse’s answer to base R’s apply functions, but more in line with functional programming practices) as well as some new functions for manipulating lists. To get a quick snapshot of any tidyverse package, a nice place to go is the cheatsheet. I find these particularly useful after I’ve already got the basics of a package down, because I inevitably realise that there are a bunch of functionalities I knew nothing about.
Another useful resource for learning about purrr is Jenny Bryan’s tutorial. Jenny’s tutorial is fantastic, but is a lot longer than mine. This post is a lot shorter and my goal is to get you up and running with purrr very quickly.
While the workhorse of dplyr is the data frame, the workhorse of purrr is the list. If you aren’t familiar with lists, hopefully this will help you understand what they are:
A vector is a way of storing many individual elements (a single number or a single character or string) of the same type together in a single object,
A data frame is a way of storing many vectors of the same length but possibly of different types together in a single object
A list is a way of storing many objects of any type (e.g. data frames, plots, vectors) together in a single object
Here is an example of a list that has three elements: a single number, a vector and a data frame
<- list(my_number = 5,
my_first_list my_vector = c("a", "b", "c"),
my_dataframe = data.frame(a = 1:3, b = c("q", "b", "z"), c = c("bananas", "are", "so very great")))
my_first_list
$my_number
[1] 5
$my_vector
[1] "a" "b" "c"
$my_dataframe
a b c
1 1 q bananas
2 2 b are
3 3 z so very great
Note that a data frame is actually a special case of a list where each element of the list is a vector of the same length.
Map functions: beyond apply
A map function is one that applies the same action/function to every element of an object (e.g. each entry of a list or a vector, or each of the columns of a data frame).
If you’re familiar with the base R apply()
functions, then it turns out that you are already familiar with map functions, even if you didn’t know it!
The apply()
functions are set of super useful base-R functions for iteratively performing an action across entries of a vector or list without having to write a for-loop. While there is nothing fundamentally wrong with the base R apply functions, the syntax is somewhat inconsistent across the different apply functions, and the expected type of the object they return is often ambiguous (at least it is for sapply
…).
The naming convention of the map functions are such that the type of the output is specified by the term that follows the underscore in the function name.
map(.x, .f)
is the main mapping function and returns a listmap_df(.x, .f)
returns a data framemap_dbl(.x, .f)
returns a numeric (double) vectormap_chr(.x, .f)
returns a character vectormap_lgl(.x, .f)
returns a logical vector
Consistent with the way of the tidyverse, the first argument of each mapping function is always the data object that you want to map over, and the second argument is always the function that you want to iteratively apply to each element of the input object.
The input object to any map
function is always either
a vector (of any type), in which case the iteration is done over the entries of the vector,
a list, in which case the iteration is performed over the elements of the list,
a data frame, in which case the iteration is performed over the columns of the data frame (which, since a data frame is a special kind of list, is technically the same as the previous point).
Since the first argument is always the data, this means that map functions play nicely with pipes (%>%
). If you’ve never seen pipes before, they’re really useful (originally from the magrittr
package, but also ported with the dplyr
package and thus with the tidyverse
). Piping allows you to string together many functions by piping an object (which itself might be the output of a function) into the first argument of the next function. If you’d like to learn more about pipes, check out my tidyverse blog posts.
Throughout this post I will demonstrate each of purrr’s functionalities using both a simple numeric example (to explain the concept) and the gapminder data (to show a more complex example).
Simplest usage: repeated looping with map
Fundamentally, maps are for iteration. In the example below I will iterate through the vector c(1, 4, 7)
by adding 10 to each entry. This function applied to a single number, which we will call .x
, can be defined as
<- function(.x) {
addTen return(.x + 10)
}
The map()
function below iterates addTen()
across all entries of the vector, .x = c(1, 4, 7)
, and returns the output as a list
library(tidyverse)
map(.x = c(1, 4, 7),
.f = addTen)
[[1]]
[1] 11
[[2]]
[1] 14
[[3]]
[1] 17
Fortunately, you don’t actually need to specify the argument names
map(c(1, 4, 7), addTen)
[[1]]
[1] 11
[[2]]
[1] 14
[[3]]
[1] 17
Note that
the first element of the output is the result of applying the function to the first element of the input (
1
),the second element of the output is the result of applying the function to the second element of the input (
4
),and the third element of the output is the result of applying the function to the third element of the input (
7
).
The following code chunks show that no matter if the input object is a vector, a list, or a data frame, map()
always returns a list.
map(list(1, 4, 7), addTen)
[[1]]
[1] 11
[[2]]
[1] 14
[[3]]
[1] 17
map(data.frame(a = 1, b = 4, c = 7), addTen)
$a
[1] 11
$b
[1] 14
$c
[1] 17
If we wanted the output of map
to be some other object type, we need to use a different function. For instance to map the input to a numeric (double) vector, you can use the map_dbl()
(“map to a double”) function.
map_dbl(c(1, 4, 7), addTen)
[1] 11 14 17
To map to a character vector, you can use the map_chr()
(“map to a character”) function.
map_chr(c(1, 4, 7), addTen)
Warning: Automatic coercion from double to character was deprecated in purrr 1.0.0.
ℹ Please use an explicit call to `as.character()` within `map_chr()` instead.
[1] "11.000000" "14.000000" "17.000000"
If you want to return a data frame, then you would use the map_df()
function. However, you need to make sure that in each iteration you’re returning a data frame which has consistent column names. map_df
will automatically bind the rows of each iteration.
For this example, I want to return a data frame whose columns correspond to the original number and the number plus ten.
map_df(c(1, 4, 7), function(.x) {
return(data.frame(old_number = .x,
new_number = addTen(.x)))
})
old_number new_number
1 1 11
2 4 14
3 7 17
Note that in this case, I defined an “anonymous” function as our output for each iteration. An anonymous function is a temporary function (that you define as the function argument to the map). Here I used the argument name .x
, but I could have used anything.
Another function to be aware of is modify()
, which is just like the map functions, but always returns an object the same type as the input object.
library(tidyverse)
modify(c(1, 4, 7), addTen)
[1] 11 14 17
modify(list(1, 4, 7), addTen)
[[1]]
[1] 11
[[2]]
[1] 14
[[3]]
[1] 17
modify(data.frame(1, 4, 7), addTen)
X1 X4 X7
1 11 14 17
Modify also has a pretty useful sibling, modify_if()
, that only applies the function to elements that satisfy a specific criteria (specified by a “predicate function”, the second argument called .p
). For instance, the following example only modifies the third entry since it is greater than 5.
modify_if(.x = list(1, 4, 7),
.p = function(x) x > 5,
.f = addTen)
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 17
The tilde-dot shorthand for functions
To make the code more concise you can use the tilde-dot shorthand for anonymous functions (the functions that you create as arguments of other functions).
The notation works by replacing
function(x) {
+ 10
x }
with
~{.x + 10}
~
indicates that you have started an anonymous function, and the argument of the anonymous function can be referred to using .x
(or simply .
). Unlike normal function arguments that can be anything that you like, the tilde-dot function argument is always .x
.
Thus, instead of defining the addTen()
function separately, we could use the tilde-dot shorthand
map_dbl(c(1, 4, 7), ~{.x + 10})
[1] 11 14 17
Applying map functions in a slightly more interesting context
Throughout this tutorial, we will use the gapminder dataset that can be loaded directly if you’re connected to the internet. Each function will first be demonstrated using a simple numeric example, and then will be demonstrated using a more complex practical example based on the gapminder dataset.
My general workflow involves loading the original data and saving it as an object with a meaningful name and an _orig
suffix. I then define a copy of the original dataset without the _orig
suffix. Having an original copy of my data in my environment means that it is easy to check that my manipulations do what I expected. I will make direct data cleaning modifications to the gapminder
data frame, but will never edit the gapminder_orig
data frame.
# to download the data directly:
<- read.csv("https://raw.githubusercontent.com/rlbarter/personal-website-quarto/main/blog/data/gapminder.csv")
gapminder_orig # define a copy of the original dataset that we will clean and play with
<- gapminder_orig gapminder
The gapminder dataset has 1704 rows containing information on population, life expectancy and GDP per capita by year and country.
A “tidy” data frame is one where every row is a single observational unit (in this case, indexed by country and year), and every column corresponds to a variable that is measured for each observational unit (in this case, for each country and year, a measurement is made for population, continent, life expectancy and GDP). If you’d like to learn more about “tidy data”, I highly recommend reading Hadley Wickham’s tidy data article.
dim(gapminder)
[1] 1704 6
head(gapminder)
country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
Since gapminder
is a data frame, the map_
functions will iterate over each column. An example of simple usage of the map_
functions is to summarize each column. For instance, you can identify the type of each column by applying the class()
function to each column. Since the output of the class()
function is a character, we will use the map_chr()
function:
# apply the class() function to each column
%>% map_chr(class) gapminder
country continent year lifeExp pop gdpPercap
"character" "character" "integer" "numeric" "integer" "numeric"
I frequently do this to get a quick snapshot of each column type of a new dataset directly in the console. As a habit, I usually pipe in the data using %>%
, rather than provide it as an argument. Remember that the pipe places the object to the left of the pipe in the first argument of the function to the right.
Similarly, if you wanted to identify the number of distinct values in each column, you could apply the n_distinct()
function from the dplyr package to each column. Since the output of n_distinct()
is a numeric (a double), you might want to use the map_dbl()
function so that the results of each iteration (the application of n_distinct()
to each column) are concatenated into a numeric vector:
# apply the n_distinct() function to each column
%>% map_dbl(n_distinct) gapminder
country continent year lifeExp pop gdpPercap
142 5 12 1626 1704 1704
If you want to do something a little more complicated, such return a few different summaries of each column in a data frame, you can use map_df()
. When things are getting a little bit more complicated, you typically need to define an anonymous function that you want to apply to each column. Using the tilde-dot notation, the anonymous function below calculates the number of distinct entries and the type of the current column (which is accessible as .x
), and then combines them into a two-column data frame. Once it has iterated through each of the columns, the map_df
function combines the data frames row-wise into a single data frame.
%>% map_df(~(data.frame(n_distinct = n_distinct(.x),
gapminder class = class(.x))))
n_distinct class
1 142 character
2 5 character
3 12 integer
4 1626 numeric
5 1704 integer
6 1704 numeric
Note that we’ve lost the variable names! The variable names correspond to the names of the objects over which we are iterating (in this case, the column names), and these are not automatically included as a column in the output data frame. You can tell map_df()
to include them using the .id
argument of map_df()
. This will automatically take the name of the element being iterated over and include it in the column corresponding to whatever you set .id
to.
%>% map_df(~(data.frame(n_distinct = n_distinct(.x),
gapminder class = class(.x))),
.id = "variable")
variable n_distinct class
1 country 142 character
2 continent 5 character
3 year 12 integer
4 lifeExp 1626 numeric
5 pop 1704 integer
6 gdpPercap 1704 numeric
If you’re having trouble thinking through these map actions, I recommend that you first figure out what the code would be to do what you want for a single element, and then paste it into the map_df()
function (a nice trick I saw Hadley Wickham used a few years ago when he presented on purrr at RLadies SF).
For instance, since the first element of the gapminder data frame is the first column, let’s define .x
in our environment to be this first column.
# take the first element of the gapminder data
<- gapminder %>% pluck(1)
.x # look at the first 6 rows
head(.x)
[1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
[6] "Afghanistan"
Then, you can create a data frame for this column that contains the number of distinct entries, and the class of the column.
data.frame(n_distinct = n_distinct(.x),
class = class(.x))
n_distinct class
1 142 character
Since this has done what was expected want for the first column, you can paste this code into the map function using the tilde-dot shorthand.
%>% map_df(~(data.frame(n_distinct = n_distinct(.x),
gapminder class = class(.x))),
.id = "variable")
variable n_distinct class
1 country 142 character
2 continent 5 character
3 year 12 integer
4 lifeExp 1626 numeric
5 pop 1704 integer
6 gdpPercap 1704 numeric
map_df()
is definitely one of the most powerful functions of purrr
in my opinion, and is probably the one that I use most.
Maps with multiple input objects
After gaining a basic understanding of purrr’s map functions, you can start to do some fancier stuff. For instance, what if you want to perform a map that iterates through two objects. The code below uses map functions to create a list of plots that compare life expectancy and GDP per capita for each continent/year combination.
The map function that maps over two objects instead of 1 is called map2()
. The first two arguments are the two objects you want to iterate over, and the third is the function (with two arguments, one for each object).
map2(.x = object1, # the first object to iterate over
.y = object2, # the second object to iterate over
.f = plotFunction(.x, .y))
First, you need to define a vector (or list) of continents and a paired vector (or list) of years that you want to iterate through. Note that in our continent/year example
the first iteration will correspond to the first continent in the continent vector and the first year in the year vector,
the second iteration will correspond to the second continent in the continent vector and the second year in the year vector.
This might seem obvious, but it is a natural instinct to incorrectly assume that map2()
will automatically perform the action on all combinations that can be made from the two vectors. For instance if you have a continent vector .x = c("Americas", "Asia")
and a year vector .y = c(1952, 2007)
, then you might assume that map2
will iterate over the Americas for 1952 and for 2007, and then Asia for 1952 and 2007. It won’t though. The iteration will actually be first the Americas for 1952 only, and then Asia for 2007 only.
First, let’s get our vectors of continents and years, starting by obtaining all distinct combinations of continents and years that appear in the data.
<- gapminder %>% distinct(continent, year)
continent_year continent_year
continent year
1 Asia 1952
2 Asia 1957
3 Asia 1962
4 Asia 1967
5 Asia 1972
6 Asia 1977
7 Asia 1982
8 Asia 1987
9 Asia 1992
10 Asia 1997
11 Asia 2002
12 Asia 2007
13 Europe 1952
14 Europe 1957
15 Europe 1962
16 Europe 1967
17 Europe 1972
18 Europe 1977
19 Europe 1982
20 Europe 1987
21 Europe 1992
22 Europe 1997
23 Europe 2002
24 Europe 2007
25 Africa 1952
26 Africa 1957
27 Africa 1962
28 Africa 1967
29 Africa 1972
30 Africa 1977
31 Africa 1982
32 Africa 1987
33 Africa 1992
34 Africa 1997
35 Africa 2002
36 Africa 2007
37 Americas 1952
38 Americas 1957
39 Americas 1962
40 Americas 1967
41 Americas 1972
42 Americas 1977
43 Americas 1982
44 Americas 1987
45 Americas 1992
46 Americas 1997
47 Americas 2002
48 Americas 2007
49 Oceania 1952
50 Oceania 1957
51 Oceania 1962
52 Oceania 1967
53 Oceania 1972
54 Oceania 1977
55 Oceania 1982
56 Oceania 1987
57 Oceania 1992
58 Oceania 1997
59 Oceania 2002
60 Oceania 2007
Then extracting the continent and year pairs as separate vectors
# extract the continent and year pairs as separate vectors
<- continent_year %>% pull(continent) %>% as.character
continents <- continent_year %>% pull(year) years
If you want to use tilde-dot short-hand, the anonymous arguments will be .x
for the first object being iterated over, and .y
for the second object being iterated over.
Before jumping straight into the map function, it’s a good idea to first figure out what the code will be for just first iteration (the first continent and the first year, which happen to be Asia in 1952).
# try to figure out the code for the first example
<- continents[1]
.x <- years[1]
.y # make a scatterplot of GDP vs life expectancy in all Asian countries for 1952
%>%
gapminder filter(continent == .x,
== .y) %>%
year ggplot() +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
ggtitle(glue::glue(.x, " ", .y))
This seems to have worked. So you can then copy-and-paste the code into the map2
function
<- map2(.x = continents,
plot_list .y = years,
.f = ~{
%>%
gapminder filter(continent == .x,
== .y) %>%
year ggplot() +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
ggtitle(glue::glue(.x, " ", .y))
})
And you can look at a few of the entries of the list to see that they make sense
1]] plot_list[[
22]] plot_list[[
pmap()
allows you to iterate over an arbitrary number of objects (i.e. more than two).
List columns and Nested data frames
Tibbles are tidyverse data frames. Some crazy stuff starts happening when you learn that tibble columns can be lists (as opposed to vectors, which is what they usually are). This is where the difference between tibbles and data frames becomes real.
For instance, a tibble can be “nested” where the tibble is essentially split into separate data frames based on a grouping variable, and these separate data frames are stored as entries of a list (that is then stored in the data
column of the data frame).
Below I nest the gapminder data by continent.
<- gapminder %>%
gapminder_nested group_by(continent) %>%
nest()
gapminder_nested
# A tibble: 5 × 2
# Groups: continent [5]
continent data
<chr> <list>
1 Asia <tibble [396 × 5]>
2 Europe <tibble [360 × 5]>
3 Africa <tibble [624 × 5]>
4 Americas <tibble [300 × 5]>
5 Oceania <tibble [24 × 5]>
The first column is the variable that we grouped by, continent
, and the second column is the rest of the data frame corresponding to that group (as if you had filtered the data frame to the specific continent). To see this, the code below shows that the first entry in the data
column corresponds to the entire gapminder dataset for Asia.
$data[[1]] gapminder_nested
# A tibble: 396 × 5
country year lifeExp pop gdpPercap
<chr> <int> <dbl> <int> <dbl>
1 Afghanistan 1952 28.8 8425333 779.
2 Afghanistan 1957 30.3 9240934 821.
3 Afghanistan 1962 32.0 10267083 853.
4 Afghanistan 1967 34.0 11537966 836.
5 Afghanistan 1972 36.1 13079460 740.
6 Afghanistan 1977 38.4 14880372 786.
7 Afghanistan 1982 39.9 12881816 978.
8 Afghanistan 1987 40.8 13867957 852.
9 Afghanistan 1992 41.7 16317921 649.
10 Afghanistan 1997 41.8 22227415 635.
# ℹ 386 more rows
Using dplyr pluck()
function, this can be written as
%>%
gapminder_nested # extract the first entry from the data column
pluck("data", 1)
# A tibble: 396 × 5
country year lifeExp pop gdpPercap
<chr> <int> <dbl> <int> <dbl>
1 Afghanistan 1952 28.8 8425333 779.
2 Afghanistan 1957 30.3 9240934 821.
3 Afghanistan 1962 32.0 10267083 853.
4 Afghanistan 1967 34.0 11537966 836.
5 Afghanistan 1972 36.1 13079460 740.
6 Afghanistan 1977 38.4 14880372 786.
7 Afghanistan 1982 39.9 12881816 978.
8 Afghanistan 1987 40.8 13867957 852.
9 Afghanistan 1992 41.7 16317921 649.
10 Afghanistan 1997 41.8 22227415 635.
# ℹ 386 more rows
Similarly, the 5th entry in the data
column corresponds to the entire gapminder dataset for Oceania.
%>% pluck("data", 5) gapminder_nested
# A tibble: 24 × 5
country year lifeExp pop gdpPercap
<chr> <int> <dbl> <int> <dbl>
1 Australia 1952 69.1 8691212 10040.
2 Australia 1957 70.3 9712569 10950.
3 Australia 1962 70.9 10794968 12217.
4 Australia 1967 71.1 11872264 14526.
5 Australia 1972 71.9 13177000 16789.
6 Australia 1977 73.5 14074100 18334.
7 Australia 1982 74.7 15184200 19477.
8 Australia 1987 76.3 16257249 21889.
9 Australia 1992 77.6 17481977 23425.
10 Australia 1997 78.8 18565243 26998.
# ℹ 14 more rows
You might be asking at this point why you would ever want to nest your data frame? It just doesn’t seem like that useful a thing to do… until you realise that you now have the power to use dplyr manipulations on more complex objects that can be stored in a list.
However, since actions such as mutate()
are applied directly to the entire column (which is usually a vector, so is fine), we run into issues when we try to mutate a list. For instance, since columns are usually vectors, normal vectorized functions work just fine on them
tibble(vec_col = 1:10) %>%
mutate(vec_sum = sum(vec_col))
# A tibble: 10 × 2
vec_col vec_sum
<int> <int>
1 1 55
2 2 55
3 3 55
4 4 55
5 5 55
6 6 55
7 7 55
8 8 55
9 9 55
10 10 55
but when the column is a list, vectorized functions don’t know what to do with them, and we get an error that says Error in sum(x) : invalid 'type' (list) of argument
. Try
tibble(list_col = list(c(1, 5, 7),
5,
c(10, 10, 11))) %>%
mutate(list_sum = sum(list_col))
To apply mutate functions to a list-column, you need to wrap the function you want to apply in a map function.
tibble(list_col = list(c(1, 5, 7),
5,
c(10, 10, 11))) %>%
mutate(list_sum = map(list_col, sum))
# A tibble: 3 × 2
list_col list_sum
<list> <list>
1 <dbl [3]> <dbl [1]>
2 <dbl [1]> <dbl [1]>
3 <dbl [3]> <dbl [1]>
Since map()
returns a list itself, the list_sum
column is thus itself a list
tibble(list_col = list(c(1, 5, 7),
5,
c(10, 10, 11))) %>%
mutate(list_sum = map(list_col, sum)) %>%
pull(list_sum)
[[1]]
[1] 13
[[2]]
[1] 5
[[3]]
[1] 31
What could we do if we wanted it to be a vector? We could use the map_dbl()
function instead!
tibble(list_col = list(c(1, 5, 7),
5,
c(10, 10, 11))) %>%
mutate(list_sum = map_dbl(list_col, sum))
# A tibble: 3 × 2
list_col list_sum
<list> <dbl>
1 <dbl [3]> 13
2 <dbl [1]> 5
3 <dbl [3]> 31
Nesting the gapminder data
Let’s return to the nested gapminder dataset. I want to calculate the average life expectancy within each continent and add it as a new column using mutate()
. Based on the example above, can you explain why the following code doesn’t work?
%>%
gapminder_nested mutate(avg_lifeExp = mean(data$lifeExp))
I was hoping that this code would extract the lifeExp
column from each data frame. But I’m applying the mutate to the data
column, which itself doesn’t have an entry called lifeExp
since it’s a list of data frames. How could I get access to the lifeExp
column of the data frames stored in the data
list? Using a map
function of course!
Think of an individual data frame as .x
. Again, I will first figure out the code for calculating the mean life expectancy for the first entry of the column. The following code defines .x
to be the first entry of the data
column (this is the data frame for Asia).
# the first entry of the "data" column
<- gapminder_nested %>% pluck("data", 1)
.x .x
# A tibble: 396 × 5
country year lifeExp pop gdpPercap
<chr> <int> <dbl> <int> <dbl>
1 Afghanistan 1952 28.8 8425333 779.
2 Afghanistan 1957 30.3 9240934 821.
3 Afghanistan 1962 32.0 10267083 853.
4 Afghanistan 1967 34.0 11537966 836.
5 Afghanistan 1972 36.1 13079460 740.
6 Afghanistan 1977 38.4 14880372 786.
7 Afghanistan 1982 39.9 12881816 978.
8 Afghanistan 1987 40.8 13867957 852.
9 Afghanistan 1992 41.7 16317921 649.
10 Afghanistan 1997 41.8 22227415 635.
# ℹ 386 more rows
Then to calculate the average life expectancy for Asia, I could write
mean(.x$lifeExp)
[1] 60.0649
So copy-pasting this into the tilde-dot anonymous function argument of the map_dbl()
function within mutate()
, I get what I wanted!
%>%
gapminder_nested mutate(avg_lifeExp = map_dbl(data, ~{mean(.x$lifeExp)}))
# A tibble: 5 × 3
# Groups: continent [5]
continent data avg_lifeExp
<chr> <list> <dbl>
1 Asia <tibble [396 × 5]> 60.1
2 Europe <tibble [360 × 5]> 71.9
3 Africa <tibble [624 × 5]> 48.9
4 Americas <tibble [300 × 5]> 64.7
5 Oceania <tibble [24 × 5]> 74.3
This code iterates through the data frames stored in the data
column, returns the average life expectancy for each data frame, and concatonates the results into a numeric vector (which is then stored as a column called avg_lifeExp
).
I hear what you’re saying… this is something that we could have done a lot more easily using standard dplyr commands (such as summarise()
). True, but hopefully it helped you understand why you need to wrap mutate functions inside map functions when applying them to list columns.
Even if this example was less than inspiring, I promise the next example will knock your socks off!
The next exampe will demonstrate how to fit a model separately for each continent, and evaluate it, all within a single tibble. First, I will fit a linear model for each continent and store it as a list-column. If the data frame for a single continent is .x
, then the model I want to fit is lm(lifeExp ~ pop + gdpPercap + year, data = .x)
(check for yourself that this does what you expect). So I can copy-past this command into the map()
function within the mutate()
# fit a model separately for each continent
<- gapminder_nested %>%
gapminder_nested mutate(lm_obj = map(data, ~lm(lifeExp ~ pop + gdpPercap + year, data = .x)))
gapminder_nested
# A tibble: 5 × 3
# Groups: continent [5]
continent data lm_obj
<chr> <list> <list>
1 Asia <tibble [396 × 5]> <lm>
2 Europe <tibble [360 × 5]> <lm>
3 Africa <tibble [624 × 5]> <lm>
4 Americas <tibble [300 × 5]> <lm>
5 Oceania <tibble [24 × 5]> <lm>
Where the first linear model (for Asia) is
%>% pluck("lm_obj", 1) gapminder_nested
Call:
lm(formula = lifeExp ~ pop + gdpPercap + year, data = .x)
Coefficients:
(Intercept) pop gdpPercap year
-7.833e+02 4.228e-11 2.510e-04 4.251e-01
I can then predict the response for the data stored in the data
column using the corresponding linear model. So I have two objects I want to iterate over: the data and the linear model object. This means I want to use map2()
. When things get a little more complicated I like to have multiple function arguments, so I’m going to use a full anonymous function rather than the tilde-dot shorthand.
# predict the response for each continent
<- gapminder_nested %>%
gapminder_nested mutate(pred = map2(lm_obj, data, function(.lm, .data) predict(.lm, .data)))
gapminder_nested
# A tibble: 5 × 4
# Groups: continent [5]
continent data lm_obj pred
<chr> <list> <list> <list>
1 Asia <tibble [396 × 5]> <lm> <dbl [396]>
2 Europe <tibble [360 × 5]> <lm> <dbl [360]>
3 Africa <tibble [624 × 5]> <lm> <dbl [624]>
4 Americas <tibble [300 × 5]> <lm> <dbl [300]>
5 Oceania <tibble [24 × 5]> <lm> <dbl [24]>
And I can then calculate the correlation between the predicted response and the true response, this time using the map2()_dbl
function since I want the output the be a numeric vector rather than a list of single elements.
# calculate the correlation between observed and predicted response for each continent
<- gapminder_nested %>%
gapminder_nested mutate(cor = map2_dbl(pred, data, function(.pred, .data) cor(.pred, .data$lifeExp)))
gapminder_nested
# A tibble: 5 × 5
# Groups: continent [5]
continent data lm_obj pred cor
<chr> <list> <list> <list> <dbl>
1 Asia <tibble [396 × 5]> <lm> <dbl [396]> 0.723
2 Europe <tibble [360 × 5]> <lm> <dbl [360]> 0.834
3 Africa <tibble [624 × 5]> <lm> <dbl [624]> 0.645
4 Americas <tibble [300 × 5]> <lm> <dbl [300]> 0.779
5 Oceania <tibble [24 × 5]> <lm> <dbl [24]> 0.987
Holy guacamole, that is so awesome!
Advanced exercise
The goal of this exercise is to fit a separate linear model for each continent without splitting up the data. Create the following data frame that has the continent, each term in the model for the continent, its linear model coefficient estimate, and standard error.
# A tibble: 20 × 6
continent term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Asia (Intercept) -7.83e+ 2 4.83e+1 -16.2 1.22e-45
2 Asia pop 4.23e-11 2.04e-9 0.0207 9.83e- 1
3 Asia year 4.25e- 1 2.44e-2 17.4 1.13e-50
4 Asia gdpPercap 2.51e- 4 3.01e-5 8.34 1.31e-15
5 Europe (Intercept) -1.61e+ 2 2.28e+1 -7.09 7.44e-12
6 Europe pop -8.18e- 9 7.80e-9 -1.05 2.95e- 1
7 Europe year 1.16e- 1 1.16e-2 9.96 8.88e-21
8 Europe gdpPercap 3.25e- 4 2.15e-5 15.2 2.21e-40
9 Africa (Intercept) -4.70e+ 2 3.39e+1 -13.9 2.17e-38
10 Africa pop -3.68e- 9 1.89e-8 -0.195 8.45e- 1
11 Africa year 2.61e- 1 1.71e-2 15.2 1.07e-44
12 Africa gdpPercap 1.12e- 3 1.01e-4 11.1 2.46e-26
13 Americas (Intercept) -5.33e+ 2 4.10e+1 -13.0 6.40e-31
14 Americas pop -2.15e- 8 8.62e-9 -2.49 1.32e- 2
15 Americas year 3.00e- 1 2.08e-2 14.4 3.79e-36
16 Americas gdpPercap 6.75e- 4 7.15e-5 9.44 1.13e-18
17 Oceania (Intercept) -2.10e+ 2 5.12e+1 -4.10 5.61e- 4
18 Oceania pop 8.37e- 9 3.34e-8 0.251 8.05e- 1
19 Oceania year 1.42e- 1 2.65e-2 5.34 3.19e- 5
20 Oceania gdpPercap 2.03e- 4 8.47e-5 2.39 2.66e- 2
Hint: starting from the gapminder
dataset, use group_by()
and nest()
to nest by continent, use a mutate together with map
to fit a linear model for each continent, use another mutate with broom::tidy()
to get a data frame of model coefficients for each model, and a transmute
to get just the columns you want, followed by an unnest()
to re-expand the nested tibble.
The solution code is at the end of this post.
If you want to stop here, you will already know more than most purrr users. The remainder of this blog post involves little-used features of purrr for manipulating lists.
Additional purrr functionalities for lists
To demonstrate how to use purrr to manipulate lists, we will split the gapminder dataset into a list of data frames (which is kind of like the converse of a data frame containing a list-column). To make sure it’s easy to follow, we will only keep 5 rows from each continent.
set.seed(23489)
<- gapminder %>% split(gapminder$continent) %>%
gapminder_list map(~sample_n(., 5))
gapminder_list
$Africa
country continent year lifeExp pop gdpPercap
1 Gambia Africa 1967 35.857 439593 734.7829
2 Sierra Leone Africa 1967 34.113 2662190 1206.0435
3 Namibia Africa 1997 58.909 1774766 3899.5243
4 Equatorial Guinea Africa 1992 47.545 387838 1132.0550
5 Cote d'Ivoire Africa 2002 46.832 16252726 1648.8008
$Americas
country continent year lifeExp pop gdpPercap
1 Dominican Republic Americas 1997 69.957 7992357 3614.101
2 Puerto Rico Americas 1987 74.630 3444468 12281.342
3 Honduras Americas 1992 66.399 5077347 3081.695
4 Uruguay Americas 2007 76.384 3447496 10611.463
5 Costa Rica Americas 1962 62.842 1345187 3460.937
$Asia
country continent year lifeExp pop gdpPercap
1 Lebanon Asia 1967 63.870 2186894 6006.9830
2 Nepal Asia 1962 39.393 10332057 652.3969
3 Yemen, Rep. Asia 1992 55.599 13367997 1879.4967
4 India Asia 1972 50.651 567000000 724.0325
5 Cambodia Asia 1952 39.417 4693836 368.4693
$Europe
country continent year lifeExp pop gdpPercap
1 United Kingdom Europe 2002 78.471 59912431 29479.00
2 Greece Europe 1997 77.869 10502372 18747.70
3 Belgium Europe 2002 78.320 10311970 30485.88
4 Croatia Europe 2002 74.876 4481020 11628.39
5 Netherlands Europe 1967 73.820 12596822 15363.25
$Oceania
country continent year lifeExp pop gdpPercap
1 Australia Oceania 1982 74.740 15184200 19477.01
2 New Zealand Oceania 1997 77.550 3676187 21050.41
3 New Zealand Oceania 2007 80.204 4115771 25185.01
4 Australia Oceania 2007 81.235 20434176 34435.37
5 New Zealand Oceania 1952 69.390 1994794 10556.58
Keep/Discard: select_if for lists
keep()
only keeps elements of a list that satisfy a given condition, much like select_if()
selects columns of a data frame that satisfy a given condition.
The following code only keeps the gapminder continent data frames (the elements of the list) that have an average (among the sample of 5 rows) life expectancy of at least 70.
%>%
gapminder_list keep(~{mean(.x$lifeExp) > 70})
$Americas
country continent year lifeExp pop gdpPercap
1 Dominican Republic Americas 1997 69.957 7992357 3614.101
2 Puerto Rico Americas 1987 74.630 3444468 12281.342
3 Honduras Americas 1992 66.399 5077347 3081.695
4 Uruguay Americas 2007 76.384 3447496 10611.463
5 Costa Rica Americas 1962 62.842 1345187 3460.937
$Europe
country continent year lifeExp pop gdpPercap
1 United Kingdom Europe 2002 78.471 59912431 29479.00
2 Greece Europe 1997 77.869 10502372 18747.70
3 Belgium Europe 2002 78.320 10311970 30485.88
4 Croatia Europe 2002 74.876 4481020 11628.39
5 Netherlands Europe 1967 73.820 12596822 15363.25
$Oceania
country continent year lifeExp pop gdpPercap
1 Australia Oceania 1982 74.740 15184200 19477.01
2 New Zealand Oceania 1997 77.550 3676187 21050.41
3 New Zealand Oceania 2007 80.204 4115771 25185.01
4 Australia Oceania 2007 81.235 20434176 34435.37
5 New Zealand Oceania 1952 69.390 1994794 10556.58
discard()
does the opposite of keep()
: it discards any elements that satisfy your logical condition.
Reduce
reduce()
is designed to combine (reduces) all of the elements of a list into a single object by iteratively applying a binary function (a function that takes two inputs).
For instance, applying a reduce function to add up all of the elements of the vector c(1, 2, 3)
is like doing sum(sum(1, 2), 3)
: first it applies sum
to 1
and 2
, then it applies sum
again to the output of sum(1, 2)
and 3
.
reduce(c(1, 2, 3), sum)
[1] 6
accumulate()
also returns the intermediate values.
accumulate(c(1, 2, 3), sum)
[1] 1 3 6
An example of when reduce()
might come in handy is when you want to perform many left_join()
s in a row, or to do repeated rbinds()
(e.g. to bind the rows of the list back together into a single data frame)
%>%
gapminder_list reduce(rbind)
country continent year lifeExp pop gdpPercap
1 Gambia Africa 1967 35.857 439593 734.7829
2 Sierra Leone Africa 1967 34.113 2662190 1206.0435
3 Namibia Africa 1997 58.909 1774766 3899.5243
4 Equatorial Guinea Africa 1992 47.545 387838 1132.0550
5 Cote d'Ivoire Africa 2002 46.832 16252726 1648.8008
6 Dominican Republic Americas 1997 69.957 7992357 3614.1013
7 Puerto Rico Americas 1987 74.630 3444468 12281.3419
8 Honduras Americas 1992 66.399 5077347 3081.6946
9 Uruguay Americas 2007 76.384 3447496 10611.4630
10 Costa Rica Americas 1962 62.842 1345187 3460.9370
11 Lebanon Asia 1967 63.870 2186894 6006.9830
12 Nepal Asia 1962 39.393 10332057 652.3969
13 Yemen, Rep. Asia 1992 55.599 13367997 1879.4967
14 India Asia 1972 50.651 567000000 724.0325
15 Cambodia Asia 1952 39.417 4693836 368.4693
16 United Kingdom Europe 2002 78.471 59912431 29478.9992
17 Greece Europe 1997 77.869 10502372 18747.6981
18 Belgium Europe 2002 78.320 10311970 30485.8838
19 Croatia Europe 2002 74.876 4481020 11628.3890
20 Netherlands Europe 1967 73.820 12596822 15363.2514
21 Australia Oceania 1982 74.740 15184200 19477.0093
22 New Zealand Oceania 1997 77.550 3676187 21050.4138
23 New Zealand Oceania 2007 80.204 4115771 25185.0091
24 Australia Oceania 2007 81.235 20434176 34435.3674
25 New Zealand Oceania 1952 69.390 1994794 10556.5757
Logical statements for lists
Asking logical questions of a list can be done using every()
and some()
. For instance to ask whether every continent has average life expectancy greater than 70, you can use every()
%>% every(~{mean(.x$life) > 70}) gapminder_list
[1] FALSE
To ask whether some continents have average life expectancy greater than 70, you can use some()
%>% some(~{mean(.x$life) > 70}) gapminder_list
[1] TRUE
An equivalent of %in%
for lists is has_element()
.
list(1, c(2, 5, 1), "a") %>% has_element("a")
[1] TRUE
Most of these functions also work on vectors.
Now go forth and purrr!
Answer to advanced exercise
The following code produces the table from the exercise above
%>%
gapminder group_by(continent) %>%
nest() %>%
mutate(lm_obj = map(data, ~lm(lifeExp ~ pop + year + gdpPercap, data = .))) %>%
mutate(lm_tidy = map(lm_obj, broom::tidy)) %>%
ungroup() %>%
transmute(continent, lm_tidy) %>%
unnest(cols = c(lm_tidy))
# A tibble: 20 × 6
continent term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Asia (Intercept) -7.83e+ 2 4.83e+1 -16.2 1.22e-45
2 Asia pop 4.23e-11 2.04e-9 0.0207 9.83e- 1
3 Asia year 4.25e- 1 2.44e-2 17.4 1.13e-50
4 Asia gdpPercap 2.51e- 4 3.01e-5 8.34 1.31e-15
5 Europe (Intercept) -1.61e+ 2 2.28e+1 -7.09 7.44e-12
6 Europe pop -8.18e- 9 7.80e-9 -1.05 2.95e- 1
7 Europe year 1.16e- 1 1.16e-2 9.96 8.88e-21
8 Europe gdpPercap 3.25e- 4 2.15e-5 15.2 2.21e-40
9 Africa (Intercept) -4.70e+ 2 3.39e+1 -13.9 2.17e-38
10 Africa pop -3.68e- 9 1.89e-8 -0.195 8.45e- 1
11 Africa year 2.61e- 1 1.71e-2 15.2 1.07e-44
12 Africa gdpPercap 1.12e- 3 1.01e-4 11.1 2.46e-26
13 Americas (Intercept) -5.33e+ 2 4.10e+1 -13.0 6.40e-31
14 Americas pop -2.15e- 8 8.62e-9 -2.49 1.32e- 2
15 Americas year 3.00e- 1 2.08e-2 14.4 3.79e-36
16 Americas gdpPercap 6.75e- 4 7.15e-5 9.44 1.13e-18
17 Oceania (Intercept) -2.10e+ 2 5.12e+1 -4.10 5.61e- 4
18 Oceania pop 8.37e- 9 3.34e-8 0.251 8.05e- 1
19 Oceania year 1.42e- 1 2.65e-2 5.34 3.19e- 5
20 Oceania gdpPercap 2.03e- 4 8.47e-5 2.39 2.66e- 2