x_tr = SelectKBest(chi2, k=25).fit_transform(x_tr,y_tr)
x_ts = SelectKBest(chi2, k=25).fit_transform(x_ts, y_ts)
This is the code I have. I'm worried that it will select different features for the training and testing data. Should I change the code or will it give the same features?
CodePudding user response:
Short answer: You will obtain different features (unless you are lucky).
Why? Basically, because you are obtaining information from different data:
x_tr = SelectKBest(chi2, k=25).fit_transform(x_tr,y_tr)
x_ts = SelectKBest(chi2, k=25).fit_transform(x_ts, y_ts)
In the first line you obtain the feature from x_tr
and y_tr
; and in the second line, you obtain features from x_ts
and y_ts
. So it makes sense that the output you obtain is different. To sump up, if the input is different, the output has a high probability to be different too.
The only case you will obtain the same features is when training and test data are super homogenous and they hide exactly the same information. In your case, you are asking for 25 features, and it will be quiet difficult that you obtain exactly the same 25 features in each code.
If you want to apply the transformation use this code:
select = SelectKBest(score_func=chi2, k=25) #We define the model by using SelectKBest class
x_tr_selected = select.fit_transform(x_tr ,y_tr) #We fit the class using x_tr and y_tr. And we transform x_tr
x_ts_selected = select.transform(x_ts) #We only transform the data x_ts with the information we obtain in the previous fit
CodePudding user response:
To get the same features, you have to fit on training data, and then transform both training and testing data.
select = SelectKBest(chi2, k=25).fit(x_tr, y_tr)
X_train_new = select.transform(x_tr)
X_test_new = select.transform(x_ts)