I am working on a binary classification using random forest with a dataset size of 977 records and 6 columns. class ratio is 77:23 (imbalanced dataset)
Since, my dataset is small, I learnt that it is not advisable to split using regular train_test split of 70 and 30.
So, I was thinking to do repeatedKfold CV. Please find my code below
Approach 1 - Full data - X, y
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X,y, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))
But I see that we have full input data X
passed at once to the model. Doesn't this lead to data leakage? Meaning, if I am doing categorical encoding, we have to do based on all categories encountered in full dataset. Similarly, consider if a dataset ranges from the year 2017 to 2022. It is possible that model uses 2021 data in one of the folds and validate it on the 2020 data.
So, is it right to use repeatedKfold
like the below?
Approach 2 - only train data - X_train, y_train
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X_train,y_train, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))
Can help me understand which will be the best approach to use?
CodePudding user response:
I'd say that there are two ways to do it. The first way is to write the code for training and validating manually. Here is an example of a code for it:
scores = []
folds = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
for fold_n, (train_index, valid_index) in enumerate(folds.split(train, y, groups=train['breath_id'])):
X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
encoder = LabelEncoder()
encoder.fit(X_train[:, 0])
X_train[:, 0] = encoder.transform(X_train[:, 0])
X_valid [:, 0] = encoder.transform(X_valid [:, 0])
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
rf_boruta .fit(X_train, y_train)
score = metrics.f1_score(y_valid, rf_boruta .predict(X_valid))
scores.append(score)
The second way is to use Pipeline from sklearn:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# creating artificial data
X, y = make_classification(n_samples=1000, n_features=6, n_informative=4, n_redundant=2)
# making one of the column categorical
X[:, 0] = np.random.randint(0, 10, 1000)
# converting into DataFrame so that we can use column names
X = pd.DataFrame(X, columns = [str(i) for i in range(6)])
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['1', '2']),
('cat', OneHotEncoder(), ['0']),
]
)
rf = RandomForestClassifier(class_weight='balanced', max_depth=3, max_features='sqrt', n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10)
pipe = Pipeline([('transformer', preprocessor), ('rf', rf)])
scores = cross_val_score(rf, X, y, scoring='f1', cv=cv)
print(f'Mean f1: {np.mean(scores):.3f}')