# load in the only library you ever really need
library(tidyverse)
library(lubridate)
# load in survey data
<- read_csv("data/bikepghpublic.csv") av_survey
Note: Scoped verbs have now essentially been superseded by accross()
(soon to be available in dplyr 1.0.0). See http://www.rebeccabarter.com/blog/2020-07-09-across/ for details.
I often find myself wishing that I could apply the same mutate
function to several columns in a data frame at once, such as convert all factors to characters, or do something to all columns that have missing values, or select all variables whose names end with _important
. When I first googled these problems around a year ago, I started to see solutions that use weird extensions of the basic mutate()
, select()
, rename()
, and summarise()
dplyr functions that look like summarise_all()
, filter_at()
, mutate_if()
, and so on. I have since learned that these functions are called “scoped verbs” (where “scoped” means that they operate only on a selection of variables).
Unfortunately, despite my extensive googling, I never really found a satisfactory description of how to use these functions in general, I think primarily because the documentation for these functions is not particularly useful (try ?mutate_at()
).
Fortunately, I recently attended a series of lightening talks hosted by the RLadies SF chapter where Sara Altman pointed us towards a summary document that Hadley Wickham wrote for the Data Science class he helped create at Stanford in 2017 (this class is now taught by Sara Altman herself).
To summarise what I will demonstrate below, there are three scoped variants of the standard mutate
, summarise
, rename
and select
(and transmute
) dplyr functions that can be specified by the following suffixes:
_if
: allows you to pick variables that satisfy some logical criteria such asis.numeric()
oris.character()
(e.g. summarising only the numeric columns)_at
: allows you to perform an operation only on variables specified by name (e.g. mutating only the columns whose name ends with “_date”)_all
: allows you to perform an operation on all variables at once (e.g. calculating the number of missing values in every column)
To explain how these functions all work, I will use the dataset from a survey of 800 Pittsburgh residents on whether or not they approve of self-driving car companies testing their autonomous vehicles on the streets of Pittsburgh (there have several articles on this issue in recent times in case you missed them: 1, 2). The data can usually be downloaded from data.gov (but is currently unavailable due to the current Government Shutdown - I will update this with an actual link to the data one day). For now you can download the data from here.
A random sample of 10 rows of this dataset is shown below. To make it easy to see what’s going on, I’ll restrict my analysis below to these 10 rows
set.seed(45679)
<- av_survey %>%
av_survey_sample # select jsut a few columns and give some more intuitive column names
select(id = `Response ID`,
start_date = `Start Date`,
end_date = `End Date`,
interacted_with_av_as_pedestrian = InteractPedestrian,
interacted_with_av_as_cyclist = InteractBicycle,
circumstanses_of_interaction = CircumstancesCoded, # lol @ typo in data
approve_av_testing_pgh = FeelingsProvingGround) %>%
# take a random sample of 10 rows
sample_n(10) %>%
# make data frame so that we view the whole thing
as.data.frame()
av_survey_sample
id start_date end_date
1 260381029 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 260822947 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 260907069 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 261099035 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 260332379 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 260355021 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 260350676 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 261092370 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 260332519 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 260351560 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
1 Yes Yes
2 No Yes
3 Yes Yes
4 No Yes
5 No Yes
6 Yes Yes
7 No No
8 Not sure Not sure
9 Yes No
10 No No
circumstanses_of_interaction approve_av_testing_pgh
1 2 Approve
2 4 Disapprove
3 NA Approve
4 3 Somewhat Approve
5 NA Somewhat Approve
6 1 Approve
7 NA Approve
8 NA Somewhat Disapprove
9 2 Approve
10 NA Somewhat Disapprove
A quick useful aside: Using shorthand for functions
For many of the examples below, I will be using the ~fun(.x)
shorthand for writing temporary functions. If you’ve never seen this shorthand before it’s incredibly useful. As an example, here are three ways of counting the number of missing values in each column of a data frame.
The first approach uses the traditional sapply()
function and temporary function syntax.
# using apply and the normal temporary function syntax
sapply(av_survey_sample, function(x) sum(is.na(x)))
id start_date
0 0
end_date interacted_with_av_as_pedestrian
0 0
interacted_with_av_as_cyclist circumstanses_of_interaction
0 5
approve_av_testing_pgh
0
The second still uses the temporary function syntax, but is using the map_dbl()
function from the purrr
package instead of the old-school sapply()
function.
# using purrr::map_dbl and the normal temporary function syntax
%>% map_dbl(function(x) sum(is.na(x))) av_survey_sample
id start_date
0 0
end_date interacted_with_av_as_pedestrian
0 0
interacted_with_av_as_cyclist circumstanses_of_interaction
0 5
approve_av_testing_pgh
0
The third uses the map_dbl()
function with the ~fun(.x)
syntax.
# using purrr::map_dbl and the `~fun(.x)` temporary function syntax
%>% map_dbl(~sum(is.na(.x))) av_survey_sample
id start_date
0 0
end_date interacted_with_av_as_pedestrian
0 0
interacted_with_av_as_cyclist circumstanses_of_interaction
0 5
approve_av_testing_pgh
0
The _if() scoped variant: perform an operation on variables that satisfy a logical criteria
_if
allows you to perform an operation on variables that satisfy some logical criteria such as is.numeric()
or is.character()
.
select_if()
For instance, we can use select_if()
to extract the numeric columns of the tibble only.
%>% select_if(is.numeric) av_survey_sample
id circumstanses_of_interaction
1 260381029 2
2 260822947 4
3 260907069 NA
4 261099035 3
5 260332379 NA
6 260355021 1
7 260350676 NA
8 261092370 NA
9 260332519 2
10 260351560 NA
We could also apply use more complex logical statements, for example by selecting columns that have at least one missing value.
%>%
av_survey_sample # select columns with at least one NA
# the expression evaluates to TRUE if there is one or more missing values
select_if(~sum(is.na(.x)) > 0)
circumstanses_of_interaction
1 2
2 4
3 NA
4 3
5 NA
6 1
7 NA
8 NA
9 2
10 NA
rename_if()
We could rename columns that satisfy a logical expression using rename_if()
. For instance, we can add a num_
prefix to all numeric column names.
%>%
av_survey_sample # only rename numeric columns by adding a "num_" prefix
rename_if(is.numeric, ~paste0("num_", .x))
num_id start_date end_date
1 260381029 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 260822947 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 260907069 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 261099035 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 260332379 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 260355021 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 260350676 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 261092370 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 260332519 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 260351560 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
1 Yes Yes
2 No Yes
3 Yes Yes
4 No Yes
5 No Yes
6 Yes Yes
7 No No
8 Not sure Not sure
9 Yes No
10 No No
num_circumstanses_of_interaction approve_av_testing_pgh
1 2 Approve
2 4 Disapprove
3 NA Approve
4 3 Somewhat Approve
5 NA Somewhat Approve
6 1 Approve
7 NA Approve
8 NA Somewhat Disapprove
9 2 Approve
10 NA Somewhat Disapprove
mutate_if()
We could similarly use mutate_if()
to mutate columns that satisfy specified logical conditions. In the example below, we mutate all columns that have at least one missing value by replacing NA
with "missing"
.
%>%
av_survey_sample # only mutate columns with at least one NA
# replace each NA value with the character "missing"
mutate_if(~sum(is.na(.x)) > 0,
~if_else(is.na(.x), "missing", as.character(.x)))
id start_date end_date
1 260381029 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 260822947 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 260907069 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 261099035 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 260332379 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 260355021 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 260350676 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 261092370 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 260332519 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 260351560 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
1 Yes Yes
2 No Yes
3 Yes Yes
4 No Yes
5 No Yes
6 Yes Yes
7 No No
8 Not sure Not sure
9 Yes No
10 No No
circumstanses_of_interaction approve_av_testing_pgh
1 2 Approve
2 4 Disapprove
3 missing Approve
4 3 Somewhat Approve
5 missing Somewhat Approve
6 1 Approve
7 missing Approve
8 missing Somewhat Disapprove
9 2 Approve
10 missing Somewhat Disapprove
summarise_if()
Similarly, summarise_if()
will summarise columns that satisfy the specified logical conditions. Below, we summarise each character column by reporting the most common value (but for some reason there is no mode()
function in R, so we need to write our own).
# function to calculate the mode (most common) observation
<- function(x) {
mode names(sort(table(x)))[1]
}# summarise character
%>%
av_survey_sample summarise_if(is.character, mode)
start_date end_date
1 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
1 Not sure Not sure
approve_av_testing_pgh
1 Disapprove
The _at() scoped variant: perform an operation only on variables specified by name
_at
allows you to perform an operation only on variables specified by name.
To specify which variables you want to operate on, you need to include the variable names inside the vars()
function as the first argument. I think of as like vars()
like c()
to provide multiple values (in this case variable names) as a single argument. For example av_survey_sample %>% mutate_at(vars(start_date, end_date), mdy_hms)
will only mutate the start_date
and end_date
variables by converting them to lubridate format using the mdy_hms
function.
These variables can be specified explicitly by name within the vars()
function, or using the select_helpers within the vars()
function.
Select helpers
Select helpers are functions that you can use within select()
to help specify which variables you want to select. The options are
starts_with()
: select all variables that start with a specified character stringends_with()
: select all variables that end with a specified character stringcontains()
: select all variables that contain a specified character stringmatches()
: select variables that match a specified character stringone_of()
: selects variables that match any entries in the specified character vectornum_range()
: selects variables that are numbered (e.g. columns namedV1
,V2
,V3
would be selected byselect(num_range("V", 1:3))
)
There are many ways that we could select the date
variables using the ends_with()
and contains()
select helpers:
# selecting the date columns by providing their names
%>% select(start_date, end_date) av_survey_sample
start_date end_date
1 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
# selecting the columns that end with "_date"
%>% select(ends_with("_date")) av_survey_sample
start_date end_date
1 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
# selecting the columns that contain "date"
%>% select(contains("date")) av_survey_sample
start_date end_date
1 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
If you ever find yourself wanting to provide variable names as characters, the matches()
and one_of()
select helpers can help you do that.
# provide matches with a single character variables
<- "start_date"
variable %>% select(matches(variable)) av_survey_sample
start_date
1 02/24/2017 3:14:19 AM PST
2 03/03/2017 7:08:33 AM PST
3 03/06/2017 5:57:07 PM PST
4 03/08/2017 3:05:41 PM PST
5 02/23/2017 9:09:11 AM PST
6 02/23/2017 10:11:52 PM PST
7 02/23/2017 6:10:42 PM PST
8 03/08/2017 11:22:43 AM PST
9 02/23/2017 9:16:14 AM PST
10 02/23/2017 6:40:54 PM PST
# provide one_of with a vector of character variables
<- c("start_date", "end_date")
variables %>% select(one_of(variables)) av_survey_sample
start_date end_date
1 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
Note that technically there does exist a select_at()
function that requires a vars()
input, but I can’t really think of a good use of this function…
# this is the same as av_survey_sample %>% select(start_date, end_date)
%>%
av_survey_sample select_at(vars(start_date, end_date))
start_date end_date
1 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
The syntax of this select_at()
example though can be useful for understanding how the vars()
function can be used in the other _at()
functions).
rename_at()
You can rename specified variables using the rename_at()
function. For instance, we could replace all column names that contain the character string “av” with the same column name but an uppercase “AV” instead of the original lowercase “av”.
To do this, we use the select helper contains()
within the vars()
function.
# use a select helper to only apply to columns whose name contains "av"
# then rename these columns with "AV" in place of "av"
%>%
av_survey_sample rename_at(vars(contains("av")),
~gsub("av", "AV", .x))
id start_date end_date
1 260381029 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 260822947 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 260907069 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 261099035 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 260332379 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 260355021 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 260350676 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 261092370 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 260332519 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 260351560 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
interacted_with_AV_as_pedestrian interacted_with_AV_as_cyclist
1 Yes Yes
2 No Yes
3 Yes Yes
4 No Yes
5 No Yes
6 Yes Yes
7 No No
8 Not sure Not sure
9 Yes No
10 No No
circumstanses_of_interaction approve_AV_testing_pgh
1 2 Approve
2 4 Disapprove
3 NA Approve
4 3 Somewhat Approve
5 NA Somewhat Approve
6 1 Approve
7 NA Approve
8 NA Somewhat Disapprove
9 2 Approve
10 NA Somewhat Disapprove
mutate_at()
To mutate only the date variables, normally we would do the mdy_hms()
transformation to each variable separately as follows:
# use the standard (unscoped) approach
%>%
av_survey_sample mutate(start_date = mdy_hms(start_date),
end_date = mdy_hms(end_date))
id start_date end_date
1 260381029 2017-02-24 03:14:19 2017-02-24 03:18:05
2 260822947 2017-03-03 07:08:33 2017-03-03 07:19:15
3 260907069 2017-03-06 17:57:07 2017-03-06 17:59:08
4 261099035 2017-03-08 15:05:41 2017-03-09 07:17:53
5 260332379 2017-02-23 09:09:11 2017-02-23 09:11:07
6 260355021 2017-02-23 22:11:52 2017-02-23 22:20:02
7 260350676 2017-02-23 18:10:42 2017-02-23 18:13:59
8 261092370 2017-03-08 11:22:43 2017-03-08 11:25:22
9 260332519 2017-02-23 09:16:14 2017-02-23 09:21:40
10 260351560 2017-02-23 18:40:54 2017-02-23 18:42:02
interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
1 Yes Yes
2 No Yes
3 Yes Yes
4 No Yes
5 No Yes
6 Yes Yes
7 No No
8 Not sure Not sure
9 Yes No
10 No No
circumstanses_of_interaction approve_av_testing_pgh
1 2 Approve
2 4 Disapprove
3 NA Approve
4 3 Somewhat Approve
5 NA Somewhat Approve
6 1 Approve
7 NA Approve
8 NA Somewhat Disapprove
9 2 Approve
10 NA Somewhat Disapprove
However, using mutate_at()
and supplying these column names as arguments to the vars()
function, we could specify the function only once.
# specifying specific variables to apply the same function to
%>%
av_survey_sample mutate_at(vars(start_date, end_date), mdy_hms)
id start_date end_date
1 260381029 2017-02-24 03:14:19 2017-02-24 03:18:05
2 260822947 2017-03-03 07:08:33 2017-03-03 07:19:15
3 260907069 2017-03-06 17:57:07 2017-03-06 17:59:08
4 261099035 2017-03-08 15:05:41 2017-03-09 07:17:53
5 260332379 2017-02-23 09:09:11 2017-02-23 09:11:07
6 260355021 2017-02-23 22:11:52 2017-02-23 22:20:02
7 260350676 2017-02-23 18:10:42 2017-02-23 18:13:59
8 261092370 2017-03-08 11:22:43 2017-03-08 11:25:22
9 260332519 2017-02-23 09:16:14 2017-02-23 09:21:40
10 260351560 2017-02-23 18:40:54 2017-02-23 18:42:02
interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
1 Yes Yes
2 No Yes
3 Yes Yes
4 No Yes
5 No Yes
6 Yes Yes
7 No No
8 Not sure Not sure
9 Yes No
10 No No
circumstanses_of_interaction approve_av_testing_pgh
1 2 Approve
2 4 Disapprove
3 NA Approve
4 3 Somewhat Approve
5 NA Somewhat Approve
6 1 Approve
7 NA Approve
8 NA Somewhat Disapprove
9 2 Approve
10 NA Somewhat Disapprove
Moreover, we can use the select helpers to specify which columns we want to mutate, without having to write out the entire column names.
# use a "select helper" to specify the variables that end with "_date"
%>%
av_survey_sample mutate_at(vars(ends_with("_date")), mdy_hms)
id start_date end_date
1 260381029 2017-02-24 03:14:19 2017-02-24 03:18:05
2 260822947 2017-03-03 07:08:33 2017-03-03 07:19:15
3 260907069 2017-03-06 17:57:07 2017-03-06 17:59:08
4 261099035 2017-03-08 15:05:41 2017-03-09 07:17:53
5 260332379 2017-02-23 09:09:11 2017-02-23 09:11:07
6 260355021 2017-02-23 22:11:52 2017-02-23 22:20:02
7 260350676 2017-02-23 18:10:42 2017-02-23 18:13:59
8 261092370 2017-03-08 11:22:43 2017-03-08 11:25:22
9 260332519 2017-02-23 09:16:14 2017-02-23 09:21:40
10 260351560 2017-02-23 18:40:54 2017-02-23 18:42:02
interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
1 Yes Yes
2 No Yes
3 Yes Yes
4 No Yes
5 No Yes
6 Yes Yes
7 No No
8 Not sure Not sure
9 Yes No
10 No No
circumstanses_of_interaction approve_av_testing_pgh
1 2 Approve
2 4 Disapprove
3 NA Approve
4 3 Somewhat Approve
5 NA Somewhat Approve
6 1 Approve
7 NA Approve
8 NA Somewhat Disapprove
9 2 Approve
10 NA Somewhat Disapprove
summarise_at()
The summarise_at()
scoped verb behaves very similarly to the mutate_at()
scoped verb, in that we can easily specify which variables we want to apply the same summary function to.
For instance, the following example summarises all variables that contain the word “interacted” by counting the number of “Yes” entries.
%>%
av_survey_sample summarise_at(vars(contains("interacted")), ~sum(.x == "Yes"))
interacted_with_av_as_pedestrian interacted_with_av_as_cyclist
1 4 6
The _all() scoped variant: perform an operation on all variables at once
_all
allows you to perform an operation on all variables at once (e.g. calculating the number of missing values in every column).
rename_all()
The select_all()
would is quite redundant (it would simply return all columns). Its friend rename_all()
, however can be very useful.
For instance, we could rename all variables by replacing underscores _
with dots .
(although I would advise against this: underscores are way better than dots!).
%>%
av_survey_sample rename_all(~gsub("_", ".", .x))
id start.date end.date
1 260381029 02/24/2017 3:14:19 AM PST 02/24/2017 3:18:05 AM PST
2 260822947 03/03/2017 7:08:33 AM PST 03/03/2017 7:19:15 AM PST
3 260907069 03/06/2017 5:57:07 PM PST 03/06/2017 5:59:08 PM PST
4 261099035 03/08/2017 3:05:41 PM PST 03/09/2017 7:17:53 AM PST
5 260332379 02/23/2017 9:09:11 AM PST 02/23/2017 9:11:07 AM PST
6 260355021 02/23/2017 10:11:52 PM PST 02/23/2017 10:20:02 PM PST
7 260350676 02/23/2017 6:10:42 PM PST 02/23/2017 6:13:59 PM PST
8 261092370 03/08/2017 11:22:43 AM PST 03/08/2017 11:25:22 AM PST
9 260332519 02/23/2017 9:16:14 AM PST 02/23/2017 9:21:40 AM PST
10 260351560 02/23/2017 6:40:54 PM PST 02/23/2017 6:42:02 PM PST
interacted.with.av.as.pedestrian interacted.with.av.as.cyclist
1 Yes Yes
2 No Yes
3 Yes Yes
4 No Yes
5 No Yes
6 Yes Yes
7 No No
8 Not sure Not sure
9 Yes No
10 No No
circumstanses.of.interaction approve.av.testing.pgh
1 2 Approve
2 4 Disapprove
3 NA Approve
4 3 Somewhat Approve
5 NA Somewhat Approve
6 1 Approve
7 NA Approve
8 NA Somewhat Disapprove
9 2 Approve
10 NA Somewhat Disapprove
mutate_all()
We could apply the same mutate function to every column at once using mutate_all()
. For instance, the code below converts every column to a numeric (although this results in mostly missing values for the character variables)
%>%
av_survey_sample mutate_all(as.numeric)
id start_date end_date interacted_with_av_as_pedestrian
1 260381029 NA NA NA
2 260822947 NA NA NA
3 260907069 NA NA NA
4 261099035 NA NA NA
5 260332379 NA NA NA
6 260355021 NA NA NA
7 260350676 NA NA NA
8 261092370 NA NA NA
9 260332519 NA NA NA
10 260351560 NA NA NA
interacted_with_av_as_cyclist circumstanses_of_interaction
1 NA 2
2 NA 4
3 NA NA
4 NA 3
5 NA NA
6 NA 1
7 NA NA
8 NA NA
9 NA 2
10 NA NA
approve_av_testing_pgh
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
summarise_all()
We could also apply the same summary function to every column at once using summarise_all()
. For instance, the example below calculates the number of distinct entries in each column.
%>%
av_survey_sample summarise_all(n_distinct)
id start_date end_date interacted_with_av_as_pedestrian
1 10 10 10 3
interacted_with_av_as_cyclist circumstanses_of_interaction
1 3 5
approve_av_testing_pgh
1 4
Conclusion
Hopefully this summary is useful to you in your data manipulation adventures!