I'm building a GBM classifier to predict a certain target variable.
My data contains many continuous variables, and I want to scale only one of them (age
) using the scale
function. I should scale this variable in the train set and then scale it in the test set according to the train set, and that is so I don't get information leakage. My question is how do I apply this in R?
The way I'm doing this is by scaling the age feature separately in the train set and the test set which is not quite right. Here is my code (I use the caret package):
for (i in (1:10)) {
print(i)
set.seed(i)
IND = createDataPartition(y = MYData$Target_feature, p=0.8, list = FALSE)
TRAIN_set = MYData[IND, ]
TEST_set = MYData[-IND,]
TRAIN_set$age = scale(TRAIN_set$age)
TEST_set$age = scale(TEST_set$age)
GBMModel <- train(Target_feature~., data = TRAIN_set,
method = "gbm",
metric="ROC",
trControl = ctrlCV,
tuneGrid = gbmGRID,
verbose = FALSE
)
AUCs_Trn[i] = auc(roc(TRAIN_set$Target_feature,predict(GBMModel,TRAIN_set, type='prob')[,1]))
AUCs_Tst[i] = auc(roc(TEST_set$Target_feature,predict(GBMModel,TEST_set, type='prob')[,1]))
}
NOTE: I only want to scale the age
feature.
CodePudding user response:
One way to do it is to manually scale the test data by the mean and standard deviation from the training set (equivalent to what scale() does).
test$age_scaled = (test$age - mean(train$age) ) / sd(train$age)