# remotes::install_github("allisonhorst/palmerpenguins")
library(palmerpenguins)
library(tidyverse)
I often find that I want to use a dplyr function on multiple columns at once. For instance, perhaps I want to scale all of the numeric variables at once using a mutate function, or I want to provide the same summary for three of my variables.
While it’s been possible to do such tasks for a while using scoped verbs, it’s now even easier - and more consistent - using dplyr’s new across()
function.
To demonstrate across()
, I’m going to use Palmer’s Penguin dataset, which was originally collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, but has recently been made popular in the R community by Allison Horst as an alternative to the over-used Iris dataset.
To start with, let’s load the penguins dataset (via the palmerpenguins
package) and the tidyverse package. If you’re new to the tidyverse (primarily to dplyr and piping, %>%
), I suggest taking a look at my post on the tidyverse before reading this post.
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
There are 344 rows in the penguins dataset, one for each penguin, and 7 columns. The first two columns, species
and island
, specify the species and island of the penguin, the next four specify numeric traits about the penguin, including the bill and flipper length, the bill depth and the body mass.
The new across()
function turns all dplyr functions into “scoped” versions of themselves, which means you can specify multiple columns that your dplyr function will apply to.
Ordinarily, if we want to summarise
a single column, such as species
, by calculating the number of distinct entries (using n_distinct()
) it contains, we would typically write
%>%
penguins summarise(distinct_species = n_distinct(species))
# A tibble: 1 × 1
distinct_species
<int>
1 3
If we wanted to calculate n_distinct()
not only across species
, but also across island
and sex
, we would need to write out the n_distinct
function three separate times:
%>%
penguins summarise(distinct_species = n_distinct(species),
distinct_island = n_distinct(island),
distinct_sex = n_distinct(sex))
# A tibble: 1 × 3
distinct_species distinct_island distinct_sex
<int> <int> <int>
1 3 3 3
Wouldn’t it be nice if we could just write which columns we want to apply n_distinct()
to, and then specify n_distinct()
once, rather than having to apply n_distinct to each column separately?
This is where across()
comes in. It is used inside your favourite dplyr function and the syntax is across(.cols, .fnd)
, where .cols
specifies the columns that you want the dplyr function to act on. When dplyr functions involve external functions that you’re applying to columns e.g. n_distinct()
in the example above, this external function is placed in the .fnd
argument. For example, we would to apply n_distinct()
to species
, island
, and sex
, we would write across(c(species, island, sex), n_distinct)
in the summarise
parentheses.
Note that we are specifying which variables we want to involve in the summarise
using c()
, as if we’re listing the variable names in a vector, but because we’re in dplyr-land, we don’t need to put them in quotes:
%>%
penguins summarise(across(c(species, island, sex),
n_distinct))
# A tibble: 1 × 3
species island sex
<int> <int> <int>
1 3 3 3
Something else that’s really neat is that you can also use !c()
to negate a set of variables (i.e. to apply the function to all variables except those that you specified in c()
):
%>%
penguins summarise(across(!c(species, island, sex),
n_distinct))
# A tibble: 1 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<int> <int> <int> <int> <int>
1 165 81 56 95 3
I want to emphasize here that the function n_distinct()
is an argument of across()
, rather than being an argument of the dplyr function (summarise
).
Select helpers: selecting columns to apply the function to
So far we’ve seen how to apply a dplyr function to a set of columns using a vector notation c(col1, col2, col3, ...)
. However, there are many other ways to specify the columns that you want to apply the dplyr function to.
everything()
: apply the function to all of the columns
%>%
penguins summarise(across(everything(), n_distinct))
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex year
<int> <int> <int> <int> <int> <int> <int> <int>
1 3 3 165 81 56 95 3 3
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
starts_with()
: apply the function to all columns whose name starts with a specific string
%>%
penguins summarise(across(starts_with("bill"), n_distinct))
# A tibble: 1 × 2
bill_length_mm bill_depth_mm
<int> <int>
1 165 81
contains()
: apply the function to all columns whose name contains a specific string
%>%
penguins summarise(across(contains("length"), n_distinct))
# A tibble: 1 × 2
bill_length_mm flipper_length_mm
<int> <int>
1 165 56
where()
apply the function to all columns that satisfy a logical condition, such asis.numeric()
%>%
penguins summarise(across(where(is.numeric), n_distinct))
# A tibble: 1 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<int> <int> <int> <int> <int>
1 165 81 56 95 3
The full list of select helpers can be found here.
Using in-line functions with across
Let’s look at an example of summarizing the columns using a custom function (rather than n_distinct()
). I usually do this using the tilde-dot shorthand for inline functions. The notation works by replacing
function(x) {
+ 10
x }
with
~{.x + 10}
~
indicates that you have started an anonymous function, and the argument of the anonymous function can be referred to using .x
(or simply .
). Unlike normal function arguments that can be anything that you like, the tilde-dot function argument is always .x
.
For instance, to identify how many missing values there are in every column, we could specify the inline function ~sum(is.na(.))
, which calculates how many NA
values are in each column (where the column is represented by .
) and adds them up:
%>%
penguins summarise(across(everything(),
~sum(is.na(.))))
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex year
<int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 2 2 2 2 11 0
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
This shows that there are missing values in every column except for the first two (species
and island
).
A mutate example
What if we want to replace the missing values in the numeric columns with 0 (clearly a terrible choice)? Without the across()
function, we would apply an if_else()
function separately to each numeric column, which will replace all NA
values with 0 and leave all non-NA
values as they are:
<- function(x) {
replace0 if_else(condition = is.na(x),
true = 0,
false = as.numeric(x))
}%>%
penguins mutate(bill_length_mm = replace0(bill_length_mm),
bill_depth_mm = replace0(bill_depth_mm),
flipper_length_mm = replace0(flipper_length_mm),
body_mass_g = replace0(body_mass_g))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen 0 0 0 0 <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
But fortunately, we can do this a lot more efficiently with across()
.
# define a function to replace NA with 0
%>%
penguins mutate(across(where(is.numeric), replace0))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen 0 0 0 0 <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Although obviously 0 isn’t a great choice, so perhaps we can replace the missing values with the mean value of the column. This time, rather than define a new function (in place of replace0
), we’ll be a bit more concise and use the tilde-dot notation to specify the function we want to apply.
%>%
penguins mutate(across(where(is.numeric), ~if_else(is.na(.), mean(., na.rm = T), as.numeric(.))))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen 43.9 17.2 201. 4202. <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Or better yet, perhaps we can replace the missing values with the average value within the relevant species and island.
%>%
penguins group_by(species, island) %>%
mutate(across(where(is.numeric),
~if_else(condition = is.na(.),
true = mean(., na.rm = T),
false = as.numeric(.)))) %>%
ungroup()
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen 39.0 18.4 191. 3706. <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
A select example
When you’re using select, you don’t have to include the across()
function, because the select helpers have always worked with select()
. This means that you can just write
%>%
penguins select(where(is.numeric))
# A tibble: 344 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<dbl> <dbl> <int> <int> <int>
1 39.1 18.7 181 3750 2007
2 39.5 17.4 186 3800 2007
3 40.3 18 195 3250 2007
4 NA NA NA NA 2007
5 36.7 19.3 193 3450 2007
6 39.3 20.6 190 3650 2007
7 38.9 17.8 181 3625 2007
8 39.2 19.6 195 4675 2007
9 34.1 18.1 193 3475 2007
10 42 20.2 190 4250 2007
# … with 334 more rows
rather than
%>%
penguins select(across(where(is.numeric)))
which will throw an error.
Hopefully across()
will make your life easier, as it has mine!