Having to apply the same pre-processing steps to training, testing and validation data to do some machine learning can be surprisingly frustrating. But thanks to the recipes R package, it’s now super-duper easy. Instead of having five functions and maybe hundreds of lines of code, you can preprocess multiple datasets using a single ‘recipe’ in fewer than 10 lines of code.
R
workflow
machine learning
Author
Rebecca Barter
Published
June 6, 2019
Pre-processing data in R used to be the bane of my existence. For something that should be fairly straightforward, it often really wasn’t. Often my frustrations stemmed from simple things such as factor variables having different levels in the training data and test data, or a variable having missing values in the test data but not in the training data. I’d write a function that would pre-process the training data, and when I’d try to apply it to the test data, R would cry and yell and just be generally unpleasant.
Thankfully most of the pain of pre-processing is now in the past thanks to the recipes R package that is a part of the new “tidymodels” package ecosystem (which, I guess is supposed to be equivalent to the data-focused “tidyverse” package ecosystem that includes dplyr, tidyr, and other super awesome packages like that). Recipes was developed by Max Kuhn and Hadley Wickham.
So let’s get baking!
The fundamentals of pre-processing your data using recipes
Creating a recipe has four steps:
Get the ingredients (recipe()): specify the response variable and predictor variables
Write the recipe (step_zzz()): define the pre-processing steps, such as imputation, creating dummy variables, scaling, and more
Prepare the recipe (prep()): provide a dataset to base each step on (e.g. if one of the steps is to remove variables that only have one unique value, then you need to give it a dataset so it can decide which variables satisfy this criteria to ensure that it is doing the same thing to every dataset you apply it to)
Bake the recipe (bake()): apply the pre-processing steps to your datasets
In this blog post I’ll walk you through these three steps, touching on the wide range of things that recipes can do, while hopefully convincing you that recipes makes life really easy and that you should use it next time you need to do some pre-processing.
A simple example: cupcakes or muffins?
To keep things in the theme, I’m going to use a dataset from Alice Zhao’s git repo that I found when I typed “cupcake dataset” into Google. Our goal will be to classify recipes as either cupcakes or muffins based on the quantities used for each of the ingredients. So perhaps we will learn two things today: (1) how to use the recipes package, and (2) the difference between cupcakes and muffins.
# set up so that all variables of tibbles are printedoptions(dplyr.width =Inf)# load useful librarieslibrary(tidyverse)library(recipes) # could also load the tidymodels package# load in the datamuffin_cupcake_data_orig <-read_csv("https://raw.githubusercontent.com/adashofdata/muffin-cupcake/master/recipes_muffins_cupcakes.csv")# look at datamuffin_cupcake_data_orig
Since the space in the column name Baking Powder is going to really annoy me, I’m going to do a quick clean where I convert all of the column names to lower case and replace the space with an underscore.
As a side note, I’ve started naming all of my temporary function arguments (lambda functions?) with a period preceding the name. I find it makes it a lot easier to read. As another side note, if you’ve never seen the rename_all() function before, check out my blog post on scoped verbs!
muffin_cupcake_data <- muffin_cupcake_data_orig %>%# rename all columns rename_all(function(.name) { .name %>%# replace all names with the lowercase versions tolower %>%# replace all spaces with underscoresstr_replace(" ", "_") })# check that this did what I wantedmuffin_cupcake_data
Since recipes does a lot of useful stuff for categorical variables as well as with missing values, I’m going to modify the data a little bit so that it’s a bit more interesting (for educational purposes only - don’t ever actually modify your data so it’s more interesting, in science that’s called “fraud”, and fraud is bad).
# add an additional ingredients column that is categoricalmuffin_cupcake_data <- muffin_cupcake_data %>%mutate(additional_ingredients =c("fruit", "fruit", "none", "nuts", "fruit", "fruit", "nuts", "none", "none", "nuts","icing","icing","fruit","none","fruit","icing","none","fruit","icing","icing"))# add some random missing values here and there just for funset.seed(26738)muffin_cupcake_data <- muffin_cupcake_data %>%# only add missing values to numeric columnsmutate_if(is.numeric,function(x) {# randomly decide if 0, 2, or 3 values will be missing from each column n_missing <-sample(0:3, 1)# replace n_missing randomly selected values from each column with NA x[sample(1:20, n_missing)] <-NAreturn(x) })muffin_cupcake_data
# A tibble: 5 × 10
type flour milk sugar butter egg baking_powder vanilla salt
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Muffin 55 28 3 7 5 2 0 0
2 Muffin 47 24 12 6 9 1 NA 0
3 Muffin 54 27 7 5 5 2 0 0
4 Cupcake NA 17 20 20 5 2 1 0
5 Cupcake 38 15 31 8 6 1 1 0
additional_ingredients
<chr>
1 fruit
2 fruit
3 nuts
4 fruit
5 fruit
Writing and applying the recipe
Now that we’ve set up our data, we’re ready to write some recipes and do some baking! The first thing we need to do is get the ingredients. We can use formula notation within the recipe() function to do this: the thing we’re trying to predict is the variable to the left of the ~, and the predictor variables are the things to the right of it (Since I’m including all of my variables, I could have written type ~ .).
# define the recipe (it looks a lot like applying the lm function)model_recipe <-recipe(type ~ flour + milk + sugar + butter + egg + baking_powder + vanilla + salt + additional_ingredients, data = muffin_cupcake_train)
If we print a summary of the model_recipe object, it just shows us the variables we’ve specified, their type, and whether they’re a predictor or an outcome.
summary(model_recipe)
# A tibble: 10 × 4
variable type role source
<chr> <list> <chr> <chr>
1 flour <chr [2]> predictor original
2 milk <chr [2]> predictor original
3 sugar <chr [2]> predictor original
4 butter <chr [2]> predictor original
5 egg <chr [2]> predictor original
6 baking_powder <chr [2]> predictor original
7 vanilla <chr [2]> predictor original
8 salt <chr [2]> predictor original
9 additional_ingredients <chr [3]> predictor original
10 type <chr [3]> outcome original
Writing the recipe steps
So now we have our ingredients, we are ready to write the recipe (i.e. describe our pre-processing steps). We write the recipe one step at a time. We have many steps to choose from, including:
step_dummy(): creating dummy variables from categorical variables.
step_zzzimpute(): where instead of “zzz” it is the name of a method, such as step_knnimpute(), step_meanimpute(), step_modeimpute(). I find that the fancier imputation methods are reeeeally slow for decently large datasets, so I would probably do this step outside of the recipes package unless you just want to do a quick mean or mode impute (which, to be honest, I often do).
step_scale(): normalize to have a standard deviation of 1.
step_center(): center to have a mean of 0.
step_range(): normalize numeric data to be within a pre-defined range of values.
step_pca(): create principal component variables from your data.
step_nzv(): remove variables that have (or almost have) the same value for every data point.
You can also create your own step (which I’ve never felt the need to do, but the details of which can be found here https://tidymodels.github.io/recipes/articles/Custom_Steps.html).
In each step, you need to specify which variables you want to apply it to. There are many ways to do this:
Specifying the variable name(s) as the first argument
Standard dplyr selectors:
everything() applies the step to all columns,
contains() allows you to specify column names that contain a specific string,
starts_with() allows you to specify column names that start with a sepcific string,
etc
Functions that specify the role of the variables:
all_predictors() applies the step to the predictor variables only
all_outcomes() applies the step to the outcome variable(s) only
Functions that specify the type of the variables:
all_nominal() applies the step to all variables that are nominal (categorical)
all_numeric() applies the step to all variables that are numeric
To ignore a specific column, you can specify it’s name with a negative sign as a variable (just like you would in select())
# define the steps we want to applymodel_recipe_steps <- model_recipe %>%# mean impute numeric variablesstep_impute_mean(all_numeric()) %>%# convert the additional ingredients variable to dummy variablesstep_dummy(additional_ingredients) %>%# rescale all numeric variables except for vanilla, salt and baking powder to lie between 0 and 1step_range(all_numeric(), min =0, max =1, -vanilla, -salt, -baking_powder) %>%# remove predictor variables that are almost the same for every entrystep_nzv(all_predictors())
• Range scaling to [0,1] for: flour, milk, sugar, butter, egg, ... | Trained
• Sparse, unbalanced variable filter removed: salt | Trained
Bake the recipe
Next, you apply your recipe to your datasets.
So what did our recipe do?
step_meanimpute(all_numeric()) imputed all of the missing values with the mean value for that variable
step_dummy(additional_ingredients) converted the additional_ingredients into three dummy variables corresponding to three of the four levels of the original variable
step_range(all_numeric(), min = 0, max = 1, -vanilla, -salt, -baking_powder) converted the range of all of the numeric variables except for those specified to lie between 0 and 1
step_nzv(all_predictors()) removed the salt variable since it was 0 across all rows (except where it was missing)