While doing feature selection using selectKbest, can we use the same code for the training and testi-CodePudding

x_tr = SelectKBest(chi2, k=25).fit_transform(x_tr,y_tr)
x_ts = SelectKBest(chi2, k=25).fit_transform(x_ts, y_ts)

This is the code I have. I'm worried that it will select different features for the training and testing data. Should I change the code or will it give the same features?

CodePudding user response：

Short answer: You will obtain different features (unless you are lucky).

Why? Basically, because you are obtaining information from different data:

x_tr = SelectKBest(chi2, k=25).fit_transform(x_tr,y_tr)
x_ts = SelectKBest(chi2, k=25).fit_transform(x_ts, y_ts)

In the first line you obtain the feature from x_tr and y_tr; and in the second line, you obtain features from x_ts and y_ts. So it makes sense that the output you obtain is different. To sump up, if the input is different, the output has a high probability to be different too.

The only case you will obtain the same features is when training and test data are super homogenous and they hide exactly the same information. In your case, you are asking for 25 features, and it will be quiet difficult that you obtain exactly the same 25 features in each code.

If you want to apply the transformation use this code:

select = SelectKBest(score_func=chi2, k=25) #We define the model by using SelectKBest class     
x_tr_selected = select.fit_transform(x_tr ,y_tr) #We fit the class using x_tr and y_tr. And we transform x_tr            
x_ts_selected = select.transform(x_ts) #We only transform the data x_ts with the information we obtain in the previous fit

CodePudding user response：

To get the same features, you have to fit on training data, and then transform both training and testing data.

select = SelectKBest(chi2, k=25).fit(x_tr, y_tr)

X_train_new = select.transform(x_tr)
X_test_new = select.transform(x_ts)