I do not understand how to apply step_pca to preprocess my data-CodePudding

I am trying to understand how to apply step_pca to preprocess my data. Suppose I want to build a K-Nearest Neighbor classifier to the iris dataset. For the sake of simplicity, I will not split the original iris dataset into train and test. I will assume iris is the train dataset and I have some other observations as my test dataset.

I want to apply three transformations to the predictors in my train dataset:

Center all predictor variables
Scale all predictor variables
PCA transform all predictor variables and keep a number of them that explains, at least, 80% of my data variance

So this is what I have:

library(tidymodels)

iris_rec <- 
  recipe(Species ~ ., 
         data = iris) %>%
  # center/scale
  step_center(-Species) %>%
  step_scale(-Species) %>%
  # pca
  step_pca(-Species, threshold = 0.8) %>%
  # apply data transformation
  prep()

iris_rec
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          4
#> 
#> Training data contained 150 data points and no missing data.
#> 
#> Operations:
#> 
#> Centering for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> Scaling for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> PCA extraction with Sepal.Length, Sepal.Width, Petal.Length, Petal.W... [trained]

^{Created on 2022-10-13 with reprex v2.0.2}

Ok, so far, so good. All the transformations are applied to my dataset. When I prepare my train dataset using juice, everything goes as expected:

# transformed training set

iris_train_t <- juice(iris_rec)

iris_train_t
#> # A tibble: 150 × 3
#>    Species   PC1     PC2
#>    <fct>   <dbl>   <dbl>
#>  1 setosa  -2.26 -0.478 
#>  2 setosa  -2.07  0.672 
#>  3 setosa  -2.36  0.341 
#>  4 setosa  -2.29  0.595 
#>  5 setosa  -2.38 -0.645 
#>  6 setosa  -2.07 -1.48  
#>  7 setosa  -2.44 -0.0475
#>  8 setosa  -2.23 -0.222 
#>  9 setosa  -2.33  1.11  
#> 10 setosa  -2.18  0.467 
#> # … with 140 more rows

^{Created on 2022-10-13 with reprex v2.0.2}

So, I have two predictors based on PCA (PC1 and PC2) and my response variable. However, when I proceed with my modelling, I get an error: all the models I test fail, as you can see below:

# cross validation

set.seed(2022)

iris_train_cv <- vfold_cv(iris_train_t, v = 5)

# tuning

iris_knn_tune <-
  nearest_neighbor(
    neighbors = tune(),
    weight_func = tune(),
    dist_power = tune()
  ) %>%
  set_engine("kknn") %>%
  set_mode("classification")

# grid search

iris_knn_grid <- 
  grid_regular(neighbors(range = c(3, 9)),
               weight_func(),
               dist_power(),
               levels = c(22, 2, 2))

# workflow creation

iris_wflow <- 
  workflow() %>% 
  add_recipe(iris_rec) %>%
  add_model(iris_knn_tune)

# model assessment

iris_knn_fit_tune <- 
  iris_wflow %>% 
  tune_grid(
    resamples = iris_train_cv,
    grid = iris_knn_grid
  )
#> x Fold1: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> x Fold2: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> x Fold3: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> x Fold4: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> x Fold5: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.

# cv results

collect_metrics(iris_knn_fit_tune)
#> Error in `estimate_tune_results()`:
#> ! All of the models failed. See the .notes column.

#> Backtrace:
#>     ▆
#>  1. ├─tune::collect_metrics(iris_knn_fit_tune)
#>  2. └─tune:::collect_metrics.tune_results(iris_knn_fit_tune)
#>  3.   └─tune::estimate_tune_results(x)
#>  4.     └─rlang::abort("All of the models failed. See the .notes column.")

^{Created on 2022-10-13 with reprex v2.0.2}

I am suspecting my problem is with the formula I defined on my iris_rec recipe. The formula there is

Species ~ ., data = iris

which means

Species ~ Sepal.Length   Sepal.Width   Petal.Length   Petal.Width, data = iris

However, when I run my models, the predictor variables are PC1 and PC2, so I guess the formula should be

Species ~ ., data = iris_train_t

Species ~ PC1   PC2, data = iris_train_t

How can I inform my model that my variables and dataset changed? All the others step_* I used on my tidymodels have worked, but I am struggling specifically with step_pca.

CodePudding user response：

Two things that are confusing.

First, you don't need to prep() or juice() a recipe before using it in a model or workflow. The tuning and resampling functions will be doing that within each resample.

You can prep() and juice() if you want the training set processed to troubleshoot, visualize, or otherwise explore. But you don’t need to otherwise.

Second, the recipe is basically a replacement for the formula. It knows what the predictors and outcomes are so there is rarely the need to use an additional formula on top of that.

(The exception is for models that require special formulas but otherwise no).

Here is updated code for you:

library(tidymodels)

iris_rec <- 
  recipe(Species ~ ., 
         data = iris) %>%
  # center/scale
  step_center(-Species) %>%
  step_scale(-Species) %>%
  # pca
  step_pca(-Species, threshold = 0.8)

set.seed(2022)

iris_train_cv <- vfold_cv(iris, v = 5)  #<- changes here

# tuning

iris_knn_tune <-
  nearest_neighbor(
    neighbors = tune(),
    weight_func = tune(),
    dist_power = tune()
  ) %>%
  set_engine("kknn") %>%
  set_mode("classification")

# grid search

iris_knn_grid <- 
  grid_regular(neighbors(range = c(3, 9)),
               weight_func(),
               dist_power(),
               levels = c(22, 2, 2))

# workflow creation

iris_wflow <- 
  workflow() %>% 
  add_recipe(iris_rec) %>%
  add_model(iris_knn_tune)

# model assessment

iris_knn_fit_tune <- 
  iris_wflow %>% 
  tune_grid(
    resamples = iris_train_cv,
    grid = iris_knn_grid
  )

show_best(iris_knn_fit_tune, metric = "roc_auc")
#> # A tibble: 5 × 9
#>   neighbors weight_func dist_power .metric .estima…¹  mean     n std_err .config
#>       <int> <chr>            <dbl> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
#> 1         9 rectangular          1 roc_auc hand_till 0.976     5 0.00580 Prepro…
#> 2         7 triangular           1 roc_auc hand_till 0.975     5 0.00688 Prepro…
#> 3         9 triangular           2 roc_auc hand_till 0.975     5 0.00571 Prepro…
#> 4         8 triangular           1 roc_auc hand_till 0.975     5 0.00655 Prepro…
#> 5         9 triangular           1 roc_auc hand_till 0.975     5 0.00655 Prepro…
#> # … with abbreviated variable name ¹.estimator

^{Created on 2022-10-13 with reprex v2.0.2}