Using the recipes package for easy pre-processing

Pre-processing data in R used to be the bane of my existence. For something that should be fairly straightforward, it often really wasn’t. Often my frustrations stemmed from simple things such as factor variables having different levels in the training data and test data, or a variable having missing values in the test data but not in the training data. I’d write a function that would pre-process the training data, and when I’d try to apply it to the test data, R would cry and yell and just be generally unpleasant.

Thankfully most of the pain of pre-processing is now in the past thanks to the recipes R package that is a part of the new “tidymodels” package ecosystem (which, I guess is supposed to be equivalent to the data-focused “tidyverse” package ecosystem that includes dplyr, tidyr, and other super awesome packages like that). Recipes was developed by Max Kuhn and Hadley Wickham.

So let’s get baking!

The fundamentals of pre-processing your data using recipes

Creating a recipe has four steps:

Get the ingredients (recipe()): specify the response variable and predictor variables
Write the recipe (step_zzz()): define the pre-processing steps, such as imputation, creating dummy variables, scaling, and more
Prepare the recipe (prep()): provide a dataset to base each step on (e.g. if one of the steps is to remove variables that only have one unique value, then you need to give it a dataset so it can decide which variables satisfy this criteria to ensure that it is doing the same thing to every dataset you apply it to)
Bake the recipe (bake()): apply the pre-processing steps to your datasets

In this blog post I’ll walk you through these three steps, touching on the wide range of things that recipes can do, while hopefully convincing you that recipes makes life really easy and that you should use it next time you need to do some pre-processing.

A simple example: cupcakes or muffins?

To keep things in the theme, I’m going to use a dataset from Alice Zhao’s git repo that I found when I typed “cupcake dataset” into Google. Our goal will be to classify recipes as either cupcakes or muffins based on the quantities used for each of the ingredients. So perhaps we will learn two things today: (1) how to use the recipes package, and (2) the difference between cupcakes and muffins.

# set up so that all variables of tibbles are printed
options(dplyr.width = Inf)
# load useful libraries
library(tidyverse)
library(recipes) # could also load the tidymodels package
# load in the data
muffin_cupcake_data_orig <- read_csv("https://raw.githubusercontent.com/adashofdata/muffin-cupcake/master/recipes_muffins_cupcakes.csv")
# look at data
muffin_cupcake_data_orig

# A tibble: 20 × 9
   Type    Flour  Milk Sugar Butter   Egg `Baking Powder` Vanilla  Salt
   <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>           <dbl>   <dbl> <dbl>
 1 Muffin     55    28     3      7     5               2       0     0
 2 Muffin     47    24    12      6     9               1       0     0
 3 Muffin     47    23    18      6     4               1       0     0
 4 Muffin     45    11    17     17     8               1       0     0
 5 Muffin     50    25    12      6     5               2       1     0
 6 Muffin     55    27     3      7     5               2       1     0
 7 Muffin     54    27     7      5     5               2       0     0
 8 Muffin     47    26    10     10     4               1       0     0
 9 Muffin     50    17    17      8     6               1       0     0
10 Muffin     50    17    17     11     4               1       0     0
11 Cupcake    39     0    26     19    14               1       1     0
12 Cupcake    42    21    16     10     8               3       0     0
13 Cupcake    34    17    20     20     5               2       1     0
14 Cupcake    39    13    17     19    10               1       1     0
15 Cupcake    38    15    23     15     8               0       1     0
16 Cupcake    42    18    25      9     5               1       0     0
17 Cupcake    36    14    21     14    11               2       1     0
18 Cupcake    38    15    31      8     6               1       1     0
19 Cupcake    36    16    24     12     9               1       1     0
20 Cupcake    34    17    23     11    13               0       1     0

Since the space in the column name Baking Powder is going to really annoy me, I’m going to do a quick clean where I convert all of the column names to lower case and replace the space with an underscore.

As a side note, I’ve started naming all of my temporary function arguments (lambda functions?) with a period preceding the name. I find it makes it a lot easier to read. As another side note, if you’ve never seen the rename_all() function before, check out my blog post on scoped verbs!

muffin_cupcake_data <- muffin_cupcake_data_orig %>%
  # rename all columns 
  rename_all(function(.name) {
    .name %>% 
      # replace all names with the lowercase versions
      tolower %>%
      # replace all spaces with underscores
      str_replace(" ", "_")
    })
# check that this did what I wanted
muffin_cupcake_data

# A tibble: 20 × 9
   type    flour  milk sugar butter   egg baking_powder vanilla  salt
   <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>         <dbl>   <dbl> <dbl>
 1 Muffin     55    28     3      7     5             2       0     0
 2 Muffin     47    24    12      6     9             1       0     0
 3 Muffin     47    23    18      6     4             1       0     0
 4 Muffin     45    11    17     17     8             1       0     0
 5 Muffin     50    25    12      6     5             2       1     0
 6 Muffin     55    27     3      7     5             2       1     0
 7 Muffin     54    27     7      5     5             2       0     0
 8 Muffin     47    26    10     10     4             1       0     0
 9 Muffin     50    17    17      8     6             1       0     0
10 Muffin     50    17    17     11     4             1       0     0
11 Cupcake    39     0    26     19    14             1       1     0
12 Cupcake    42    21    16     10     8             3       0     0
13 Cupcake    34    17    20     20     5             2       1     0
14 Cupcake    39    13    17     19    10             1       1     0
15 Cupcake    38    15    23     15     8             0       1     0
16 Cupcake    42    18    25      9     5             1       0     0
17 Cupcake    36    14    21     14    11             2       1     0
18 Cupcake    38    15    31      8     6             1       1     0
19 Cupcake    36    16    24     12     9             1       1     0
20 Cupcake    34    17    23     11    13             0       1     0

Since recipes does a lot of useful stuff for categorical variables as well as with missing values, I’m going to modify the data a little bit so that it’s a bit more interesting (for educational purposes only - don’t ever actually modify your data so it’s more interesting, in science that’s called “fraud”, and fraud is bad).

# add an additional ingredients column that is categorical
muffin_cupcake_data <- muffin_cupcake_data %>%
  mutate(additional_ingredients = c("fruit", 
                                    "fruit", 
                                    "none", 
                                    "nuts", 
                                    "fruit", 
                                    "fruit", 
                                    "nuts", 
                                    "none", 
                                    "none", 
                                    "nuts",
                                    "icing",
                                    "icing",
                                    "fruit",
                                    "none",
                                    "fruit",
                                    "icing",
                                    "none",
                                    "fruit",
                                    "icing",
                                    "icing"))
# add some random missing values here and there just for fun
set.seed(26738)
muffin_cupcake_data <- muffin_cupcake_data %>%
  # only add missing values to numeric columns
  mutate_if(is.numeric,
            function(x) {
              # randomly decide if 0, 2, or 3 values will be missing from each column
              n_missing <- sample(0:3, 1)
              # replace n_missing randomly selected values from each column with NA
              x[sample(1:20, n_missing)] <- NA
              return(x)
              })
muffin_cupcake_data

# A tibble: 20 × 10
   type    flour  milk sugar butter   egg baking_powder vanilla  salt
   <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>         <dbl>   <dbl> <dbl>
 1 Muffin     55    28     3      7     5             2       0     0
 2 Muffin     47    24    12      6     9             1      NA     0
 3 Muffin     47    23    18      6     4             1       0     0
 4 Muffin     NA    NA    17     17     8            NA       0     0
 5 Muffin     50    25    12      6     5             2       1     0
 6 Muffin     55    27     3      7     5             2       1     0
 7 Muffin     54    27     7      5     5             2       0     0
 8 Muffin     47    26    10     10     4            NA      NA     0
 9 Muffin     50    17    17      8     6             1       0     0
10 Muffin     50    NA    17     11     4             1       0     0
11 Cupcake    39     0    26     19    14             1       1     0
12 Cupcake    42    21    16     10     8             3       0     0
13 Cupcake    NA    17    20     20     5             2       1     0
14 Cupcake    39    13    17     19    10             1       1     0
15 Cupcake    38    15    23     NA     8             0       1     0
16 Cupcake    42    18    25     NA     5             1       0     0
17 Cupcake    36    14    21     14    11             2       1     0
18 Cupcake    38    15    31      8     6             1       1     0
19 Cupcake    36    16    24     12     9             1      NA     0
20 Cupcake    34    17    23     11    13             0       1     0
   additional_ingredients
   <chr>                 
 1 fruit                 
 2 fruit                 
 3 none                  
 4 nuts                  
 5 fruit                 
 6 fruit                 
 7 nuts                  
 8 none                  
 9 none                  
10 nuts                  
11 icing                 
12 icing                 
13 fruit                 
14 none                  
15 fruit                 
16 icing                 
17 none                  
18 fruit                 
19 icing                 
20 icing

Finally, I’m going to split my data into training and test sets, so that you can see how nicely our recipe can be applied to multiple data frames.

library(rsample)
muffin_cupcake_split <- initial_split(muffin_cupcake_data)
muffin_cupcake_train <- training(muffin_cupcake_split)
muffin_cupcake_test <- testing(muffin_cupcake_split)
rm(muffin_cupcake_data)

Our training data is

muffin_cupcake_train

# A tibble: 15 × 10
   type    flour  milk sugar butter   egg baking_powder vanilla  salt
   <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>         <dbl>   <dbl> <dbl>
 1 Cupcake    42    18    25     NA     5             1       0     0
 2 Muffin     50    17    17      8     6             1       0     0
 3 Cupcake    38    15    23     NA     8             0       1     0
 4 Cupcake    39    13    17     19    10             1       1     0
 5 Muffin     47    26    10     10     4            NA      NA     0
 6 Muffin     55    27     3      7     5             2       1     0
 7 Cupcake    42    21    16     10     8             3       0     0
 8 Muffin     50    NA    17     11     4             1       0     0
 9 Muffin     47    23    18      6     4             1       0     0
10 Cupcake    34    17    23     11    13             0       1     0
11 Cupcake    39     0    26     19    14             1       1     0
12 Cupcake    36    14    21     14    11             2       1     0
13 Muffin     NA    NA    17     17     8            NA       0     0
14 Muffin     50    25    12      6     5             2       1     0
15 Cupcake    36    16    24     12     9             1      NA     0
   additional_ingredients
   <chr>                 
 1 icing                 
 2 none                  
 3 fruit                 
 4 none                  
 5 none                  
 6 fruit                 
 7 icing                 
 8 nuts                  
 9 none                  
10 icing                 
11 icing                 
12 none                  
13 nuts                  
14 fruit                 
15 icing

and our testing data is

muffin_cupcake_test

# A tibble: 5 × 10
  type    flour  milk sugar butter   egg baking_powder vanilla  salt
  <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>         <dbl>   <dbl> <dbl>
1 Muffin     55    28     3      7     5             2       0     0
2 Muffin     47    24    12      6     9             1      NA     0
3 Muffin     54    27     7      5     5             2       0     0
4 Cupcake    NA    17    20     20     5             2       1     0
5 Cupcake    38    15    31      8     6             1       1     0
  additional_ingredients
  <chr>                 
1 fruit                 
2 fruit                 
3 nuts                  
4 fruit                 
5 fruit

Writing and applying the recipe

Now that we’ve set up our data, we’re ready to write some recipes and do some baking! The first thing we need to do is get the ingredients. We can use formula notation within the recipe() function to do this: the thing we’re trying to predict is the variable to the left of the ~, and the predictor variables are the things to the right of it (Since I’m including all of my variables, I could have written type ~ .).

# define the recipe (it looks a lot like applying the lm function)
model_recipe <- recipe(type ~ flour + milk + sugar + butter + egg + 
                         baking_powder + vanilla + salt + additional_ingredients, 
                       data = muffin_cupcake_train)

If we print a summary of the model_recipe object, it just shows us the variables we’ve specified, their type, and whether they’re a predictor or an outcome.

summary(model_recipe)

# A tibble: 10 × 4
   variable               type      role      source  
   <chr>                  <list>    <chr>     <chr>   
 1 flour                  <chr [2]> predictor original
 2 milk                   <chr [2]> predictor original
 3 sugar                  <chr [2]> predictor original
 4 butter                 <chr [2]> predictor original
 5 egg                    <chr [2]> predictor original
 6 baking_powder          <chr [2]> predictor original
 7 vanilla                <chr [2]> predictor original
 8 salt                   <chr [2]> predictor original
 9 additional_ingredients <chr [3]> predictor original
10 type                   <chr [3]> outcome   original

Writing the recipe steps

So now we have our ingredients, we are ready to write the recipe (i.e. describe our pre-processing steps). We write the recipe one step at a time. We have many steps to choose from, including:

step_dummy(): creating dummy variables from categorical variables.
step_zzzimpute(): where instead of “zzz” it is the name of a method, such as step_knnimpute(), step_meanimpute(), step_modeimpute(). I find that the fancier imputation methods are reeeeally slow for decently large datasets, so I would probably do this step outside of the recipes package unless you just want to do a quick mean or mode impute (which, to be honest, I often do).
step_scale(): normalize to have a standard deviation of 1.
step_center(): center to have a mean of 0.
step_range(): normalize numeric data to be within a pre-defined range of values.
step_pca(): create principal component variables from your data.
step_nzv(): remove variables that have (or almost have) the same value for every data point.

You can also create your own step (which I’ve never felt the need to do, but the details of which can be found here https://tidymodels.github.io/recipes/articles/Custom_Steps.html).

In each step, you need to specify which variables you want to apply it to. There are many ways to do this:

Specifying the variable name(s) as the first argument
Standard dplyr selectors:
- everything() applies the step to all columns,
- contains() allows you to specify column names that contain a specific string,
- starts_with() allows you to specify column names that start with a sepcific string,
- etc
Functions that specify the role of the variables:
- all_predictors() applies the step to the predictor variables only
- all_outcomes() applies the step to the outcome variable(s) only
Functions that specify the type of the variables:
- all_nominal() applies the step to all variables that are nominal (categorical)
- all_numeric() applies the step to all variables that are numeric

To ignore a specific column, you can specify it’s name with a negative sign as a variable (just like you would in select())

# define the steps we want to apply
model_recipe_steps <- model_recipe %>% 
  # mean impute numeric variables
  step_impute_mean(all_numeric()) %>%
  # convert the additional ingredients variable to dummy variables
  step_dummy(additional_ingredients) %>%
  # rescale all numeric variables except for vanilla, salt and baking powder to lie between 0 and 1
  step_range(all_numeric(), min = 0, max = 1, -vanilla, -salt, -baking_powder) %>%
  # remove predictor variables that are almost the same for every entry
  step_nzv(all_predictors())

model_recipe_steps

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 9

── Operations

• Mean imputation for: all_numeric()

• Dummy variables from: additional_ingredients

• Range scaling to [0,1] for: all_numeric(), -vanilla, -salt, -baking_powder

• Sparse, unbalanced variable filter on: all_predictors()

Note that the order in which you apply the steps does matter to some extent. The recommended ordering (taken from here) is

Impute
Individual transformations for skewness and other issues
Discretize (if needed and if you have no other choice)
Create dummy variables
Create interactions
Normalization steps (center, scale, range, etc)
Multivariate transformation (e.g. PCA, spatial sign, etc)

Preparing the recipe

Next, we need to provide a dataset on which to base the pre-processing steps. This allows the same recipe to be applied to multiple datasets.

prepped_recipe <- prep(model_recipe_steps, training = muffin_cupcake_train)

prepped_recipe

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 9

── Training information

Training data contained 15 data points and 6 incomplete rows.

── Operations

• Mean imputation for: flour, milk, sugar, butter, egg, ... | Trained

• Dummy variables from: additional_ingredients | Trained

• Range scaling to [0,1] for: flour, milk, sugar, butter, egg, ... | Trained

• Sparse, unbalanced variable filter removed: salt | Trained

Bake the recipe

Next, you apply your recipe to your datasets.

So what did our recipe do?

step_meanimpute(all_numeric()) imputed all of the missing values with the mean value for that variable
step_dummy(additional_ingredients) converted the additional_ingredients into three dummy variables corresponding to three of the four levels of the original variable
step_range(all_numeric(), min = 0, max = 1, -vanilla, -salt, -baking_powder) converted the range of all of the numeric variables except for those specified to lie between 0 and 1
step_nzv(all_predictors()) removed the salt variable since it was 0 across all rows (except where it was missing)

muffin_cupcake_train_preprocessed <- bake(prepped_recipe, muffin_cupcake_train) 
muffin_cupcake_train_preprocessed

# A tibble: 15 × 11
    flour  milk sugar butter   egg baking_powder vanilla type   
    <dbl> <dbl> <dbl>  <dbl> <dbl>         <dbl>   <dbl> <fct>  
 1 0.381  0.667 0.957 0.426    0.1          1      0     Cupcake
 2 0.762  0.630 0.609 0.154    0.2          1      0     Muffin 
 3 0.190  0.556 0.870 0.426    0.4          0      1     Cupcake
 4 0.238  0.481 0.609 1        0.6          1      1     Cupcake
 5 0.619  0.963 0.304 0.308    0            1.23   0.538 Muffin 
 6 1      1     0     0.0769   0.1          2      1     Muffin 
 7 0.381  0.778 0.565 0.308    0.4          3      0     Cupcake
 8 0.762  0.661 0.609 0.385    0            1      0     Muffin 
 9 0.619  0.852 0.652 0        0            1      0     Muffin 
10 0      0.630 0.870 0.385    0.9          0      1     Cupcake
11 0.238  0     1     1        1            1      1     Cupcake
12 0.0952 0.519 0.783 0.615    0.7          2      1     Cupcake
13 0.439  0.661 0.609 0.846    0.4          1.23   0     Muffin 
14 0.762  0.926 0.391 0        0.1          2      1     Muffin 
15 0.0952 0.593 0.913 0.462    0.5          1      0.538 Cupcake
   additional_ingredients_icing additional_ingredients_none
                          <dbl>                       <dbl>
 1                            1                           0
 2                            0                           1
 3                            0                           0
 4                            0                           1
 5                            0                           1
 6                            0                           0
 7                            1                           0
 8                            0                           0
 9                            0                           1
10                            1                           0
11                            1                           0
12                            0                           1
13                            0                           0
14                            0                           0
15                            1                           0
   additional_ingredients_nuts
                         <dbl>
 1                           0
 2                           0
 3                           0
 4                           0
 5                           0
 6                           0
 7                           0
 8                           1
 9                           0
10                           0
11                           0
12                           0
13                           1
14                           0
15                           0

muffin_cupcake_test_preprocessed <- bake(prepped_recipe, muffin_cupcake_test)
muffin_cupcake_test_preprocessed

# A tibble: 5 × 11
  flour  milk sugar butter   egg baking_powder vanilla type   
  <dbl> <dbl> <dbl>  <dbl> <dbl>         <dbl>   <dbl> <fct>  
1 1     1     0     0.0769   0.1             2   0     Muffin 
2 0.619 0.889 0.391 0        0.5             1   0.538 Muffin 
3 0.952 1     0.174 0        0.1             2   0     Muffin 
4 0.439 0.630 0.739 1        0.1             2   1     Cupcake
5 0.190 0.556 1     0.154    0.2             1   1     Cupcake
  additional_ingredients_icing additional_ingredients_none
                         <dbl>                       <dbl>
1                            0                           0
2                            0                           0
3                            0                           0
4                            0                           0
5                            0                           0
  additional_ingredients_nuts
                        <dbl>
1                           0
2                           0
3                           1
4                           0
5                           0