Background: I'm creating a recipe to clean and transform time-series data that will be used by multiple models. One of the steps in the recipe is to remove correlated predictors using the step_corr()
function.
However, due to the nature of the data set, some of the variables can have a constant value for the entire set of training data when doing cross-validation using a rolling window and thus cause the step_corr()
function to throw a warning.
Problem Statement: In such cases, is it possible to exclude such variables from the correlation step? Or perhaps remove the variable entirely?
P.S. I know I can easily ignore the warning and proceed. But I'm looking for a cleaner approach / best practice advice.
CodePudding user response:
There are two steps for you to consider:
step_zv()
will remove variables that all have the same value (zero variance)step_nzv()
will remove variables that almost all have the same value (highly sparse and unbalanced)