I am trying to understand how to apply step_pca
to preprocess my data. Suppose I want to build a K-Nearest Neighbor classifier to the iris dataset. For the sake of simplicity, I will not split the original iris
dataset into train and test. I will assume iris
is the train dataset and I have some other observations as my test dataset.
I want to apply three transformations to the predictors in my train dataset:
- Center all predictor variables
- Scale all predictor variables
- PCA transform all predictor variables and keep a number of them that explains, at least, 80% of my data variance
So this is what I have:
library(tidymodels)
iris_rec <-
recipe(Species ~ .,
data = iris) %>%
# center/scale
step_center(-Species) %>%
step_scale(-Species) %>%
# pca
step_pca(-Species, threshold = 0.8) %>%
# apply data transformation
prep()
iris_rec
#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 4
#>
#> Training data contained 150 data points and no missing data.
#>
#> Operations:
#>
#> Centering for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> Scaling for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> PCA extraction with Sepal.Length, Sepal.Width, Petal.Length, Petal.W... [trained]
Created on 2022-10-13 with reprex v2.0.2
Ok, so far, so good. All the transformations are applied to my dataset. When I prepare my train dataset using juice
, everything goes as expected:
# transformed training set
iris_train_t <- juice(iris_rec)
iris_train_t
#> # A tibble: 150 × 3
#> Species PC1 PC2
#> <fct> <dbl> <dbl>
#> 1 setosa -2.26 -0.478
#> 2 setosa -2.07 0.672
#> 3 setosa -2.36 0.341
#> 4 setosa -2.29 0.595
#> 5 setosa -2.38 -0.645
#> 6 setosa -2.07 -1.48
#> 7 setosa -2.44 -0.0475
#> 8 setosa -2.23 -0.222
#> 9 setosa -2.33 1.11
#> 10 setosa -2.18 0.467
#> # … with 140 more rows
Created on 2022-10-13 with reprex v2.0.2
So, I have two predictors based on PCA (PC1
and PC2
) and my response variable. However, when I proceed with my modelling, I get an error: all the models I test fail, as you can see below:
# cross validation
set.seed(2022)
iris_train_cv <- vfold_cv(iris_train_t, v = 5)
# tuning
iris_knn_tune <-
nearest_neighbor(
neighbors = tune(),
weight_func = tune(),
dist_power = tune()
) %>%
set_engine("kknn") %>%
set_mode("classification")
# grid search
iris_knn_grid <-
grid_regular(neighbors(range = c(3, 9)),
weight_func(),
dist_power(),
levels = c(22, 2, 2))
# workflow creation
iris_wflow <-
workflow() %>%
add_recipe(iris_rec) %>%
add_model(iris_knn_tune)
# model assessment
iris_knn_fit_tune <-
iris_wflow %>%
tune_grid(
resamples = iris_train_cv,
grid = iris_knn_grid
)
#> x Fold1: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold2: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold3: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold4: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold5: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
# cv results
collect_metrics(iris_knn_fit_tune)
#> Error in `estimate_tune_results()`:
#> ! All of the models failed. See the .notes column.
#> Backtrace:
#> ▆
#> 1. ├─tune::collect_metrics(iris_knn_fit_tune)
#> 2. └─tune:::collect_metrics.tune_results(iris_knn_fit_tune)
#> 3. └─tune::estimate_tune_results(x)
#> 4. └─rlang::abort("All of the models failed. See the .notes column.")
Created on 2022-10-13 with reprex v2.0.2
I am suspecting my problem is with the formula I defined on my iris_rec
recipe. The formula there is
Species ~ ., data = iris
which means
Species ~ Sepal.Length Sepal.Width Petal.Length Petal.Width, data = iris
However, when I run my models, the predictor variables are PC1
and PC2
, so I guess the formula should be
Species ~ ., data = iris_train_t
or
Species ~ PC1 PC2, data = iris_train_t
How can I inform my model that my variables and dataset changed? All the others step_*
I used on my tidymodels
have worked, but I am struggling specifically with step_pca
.
CodePudding user response:
Two things that are confusing.
First, you don't need to prep()
or juice()
a recipe before using it in a model or workflow. The tuning and resampling functions will be doing that within each resample.
You can prep()
and juice()
if you want the training set processed to troubleshoot, visualize, or otherwise explore. But you don’t need to otherwise.
Second, the recipe is basically a replacement for the formula. It knows what the predictors and outcomes are so there is rarely the need to use an additional formula on top of that.
(The exception is for models that require special formulas but otherwise no).
Here is updated code for you:
library(tidymodels)
iris_rec <-
recipe(Species ~ .,
data = iris) %>%
# center/scale
step_center(-Species) %>%
step_scale(-Species) %>%
# pca
step_pca(-Species, threshold = 0.8)
set.seed(2022)
iris_train_cv <- vfold_cv(iris, v = 5) #<- changes here
# tuning
iris_knn_tune <-
nearest_neighbor(
neighbors = tune(),
weight_func = tune(),
dist_power = tune()
) %>%
set_engine("kknn") %>%
set_mode("classification")
# grid search
iris_knn_grid <-
grid_regular(neighbors(range = c(3, 9)),
weight_func(),
dist_power(),
levels = c(22, 2, 2))
# workflow creation
iris_wflow <-
workflow() %>%
add_recipe(iris_rec) %>%
add_model(iris_knn_tune)
# model assessment
iris_knn_fit_tune <-
iris_wflow %>%
tune_grid(
resamples = iris_train_cv,
grid = iris_knn_grid
)
show_best(iris_knn_fit_tune, metric = "roc_auc")
#> # A tibble: 5 × 9
#> neighbors weight_func dist_power .metric .estima…¹ mean n std_err .config
#> <int> <chr> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 9 rectangular 1 roc_auc hand_till 0.976 5 0.00580 Prepro…
#> 2 7 triangular 1 roc_auc hand_till 0.975 5 0.00688 Prepro…
#> 3 9 triangular 2 roc_auc hand_till 0.975 5 0.00571 Prepro…
#> 4 8 triangular 1 roc_auc hand_till 0.975 5 0.00655 Prepro…
#> 5 9 triangular 1 roc_auc hand_till 0.975 5 0.00655 Prepro…
#> # … with abbreviated variable name ¹.estimator
Created on 2022-10-13 with reprex v2.0.2