I have the following prediction model:
library(tidymodels)
data(ames)
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
# Preprocessing
norm_obj <- prep(norm_trans, training = ames_train)
rf_ames_train <- bake(norm_obj, ames_train) %>%
dplyr::select(Sale_Price, everything()) %>%
as.data.frame()
dim(rf_ames_train )
rf_xy_fit <- rand_forest(mode = "regression") %>%
set_engine("ranger") %>%
fit_xy(
x = rf_ames_train,
y = log10(rf_ames_train$Sale_Price)
)
Note that after preprocessing step the number of features are reduced from 74 to 33.
dim(rf_ames_train )
# 33
Currently, I have to explicitly pass the predictors in the function:
preds <- colnames(rf_ames_train)
my_pred_function <- function (fit = NULL, test_data = NULL, predictors = NULL) {
test_results <- test_data %>%
select(Sale_Price) %>%
mutate(Sale_Price = log10(Sale_Price)) %>%
bind_cols(
predict(fit, new_data = ames_test[, predictors])
)
test_results
}
my_pred_function(fit = rf_xy_fit, test_data = ames_test, predictors = preds)
Shown as predictors = preds
in the function call above.
In practice, I have to save the rf_xy_fit
and preds
as two RDS files, then read them again. This is prone to error and troublesome.
I would like to by-pass this explicit passing. Is there a way I can extract that from rf_xy_fit
directly?
CodePudding user response:
This is a case where you would benefit from using the workflows package. This allows you to combine the preprocessing code with the model fitting code
library(tidymodels)
data(ames)
set.seed(4595)
# Notice how I did log transformation before doing the splitting to assure that it is not on both testing and training data sets.
ames <- ames %>%
mutate(Sale_Price = log10(Sale_Price))
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
rf_spec <- rand_forest(mode = "regression") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_recipe(norm_trans) %>%
add_model(rf_spec)
rf_fit <- fit(rf_wf, ames_train)
predict(rf_fit, new_data = ames_train)
#> # A tibble: 2,197 × 1
#> .pred
#> <dbl>
#> 1 5.09
#> 2 5.12
#> 3 5.01
#> 4 4.99
#> 5 5.12
#> 6 5.07
#> 7 4.90
#> 8 5.09
#> 9 5.13
#> 10 5.08
#> # … with 2,187 more rows
Created on 2022-11-21 with reprex v2.0.2
CodePudding user response:
Supplementing Emils answer based on your comment...
Keep in mind that most R modeling functions will expect the original feature set, even if some of them are not at all used. This is a by-product of R’s formula/model.matrix()
machinery.
For recipes, it depends on which steps that you use.
You could refit the final model without them but you might not get exactly the same model. In many cases, the process to getting to the subset of features depends on how many were originally passed.
I’m working on a tidymodels api for this but caret has one to get the list of predictors that were actually used by the model. See the example:
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(tidymodels)
tidymodels_prefer()
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)
data(ames)
set.seed(4595)
ames <- ames %>%
mutate(Sale_Price = log10(Sale_Price))
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rec <- recipe(Sale_Price ~ ., data = ames_train)
norm_trans <- rec %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.1)
rf_spec <- rand_forest(mode = "regression") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_recipe(norm_trans) %>%
add_model(rf_spec)
rf_fit <- fit(rf_wf, ames_train)
# get predictor set:
rf_features <-
rf_fit %>%
extract_fit_engine() %>%
predictors() #<- the caret funciton
head(rf_features)
#> [1] "MS_SubClass" "MS_Zoning" "Lot_Frontage" "Lot_Shape" "Lot_Config"
#> [6] "Neighborhood"
# You get an error here:
ames_test %>%
select(all_of(rf_features)) %>%
predict(rf_fit, new_data = .)
#> Error in `validate_column_names()`:
#> ! The following required columns are missing: 'Lot_Area',
#> 'Street', 'Alley', 'Land_Contour', 'Utilities', 'Land_Slope',
#> 'Condition_2', 'Year_Built', 'Year_Remod_Add', 'Roof_Matl',
#> 'Mas_Vnr_Area', 'Bsmt_Cond', 'BsmtFin_SF_1', 'BsmtFin_Type_2',
#> 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating',
#> 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath',
#> 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr',
#> 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Cars',
#> 'Garage_Area', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch',
#> 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC',
#> 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Latitude'.
Created on 2022-11-21 by the reprex package (v2.0.1)
This error comes from the workflows package but the underlying modeling package would also error.