I have DataFrame in Python Pandas like below:
Input data:
Y - binnary target
X1...X5 - predictors
Y X1 X2 X3 X4 X5 1 111 22 1 0 150 0 12 33 1 0 222 1 150 44 0 0 230 0 270 55 0 1 500 ... ... ... ... ... ...
Requirements: And I need to:
- run a loop through all the variables in such a way that after each iteration a new XGBoost classification model is created and also after each iteration one of the variables is discarded and create next model
- So, if I have for example 5 predictors (X1...X5) I need to create 5 XGBoost classification models, and in in each successive model there must be 1 less variable
- Each model should be evaluated by
roc_auc_score
- As an output I need:
list_of_models = []
where will be saved created models and DataFrame with AUC on train and test
Desire output:
So, as a result I need to have something like below
Model - position of model in list_of_models
Num_var - number of predictors used in model
AUC_train - roc_auc_score on train dataset
AUC_test - roc_auc_score on test dataset
Model Num_var AUC_train AUC_test 0 5 0.887 0.884 1 4 0.875 0.845 2 3 0.854 0.843 3 2 0.965 0.928 4 1 0.922 0.921
My draft: which is wrong because it should be loop through all the variables in such a way that after each iteration a new XGBoost classification model is created and also after each iteration one of the variables is discarded and create next model
X_train, X_test, y_train, y_test = train_test_split(df.drop("Y", axis=1)
, df.Y
, train_size = 0.70
, test_size=0.30
, random_state=1
, stratify = df.Y)
results = []
list_of_models = []
for val in X_train:
model = XGBClassifier()
model.fit(X_train, y_train)
list_of_models.append(model)
preds_train = model.predict(X_train)
preds_test = model.predict(X_test)
preds_prob_train = model.predict_proba(X_train)[:,1]
preds_prob_test = model.predict_proba(X_test)[:,1]
results.append({("AUC_train":round(metrics.roc_auc_score(y_train,preds_prod_test),3),
"AUC_test":round(metrics.roc_auc_score(y_test,preds_prod_test),3})
results = pd.DataFrame(results)
How can I do that in Python ?
CodePudding user response:
You want to make your data narrower during each loop? If I understand this correctly you could do something like this:
results = []
list_of_models = []
for i in X_train.columns:
model = XGBClassifier()
model.fit(X_train, y_train)
list_of_models.append(model)
preds_train = model.predict(X_train)
preds_test = model.predict(X_test)
preds_prob_train = model.predict_proba(X_train)[:,1]
preds_prob_test = model.predict_proba(X_test)[:,1]
results.append({("AUC_train":round(metrics.roc_auc_score(y_train,preds_prod_test),3),
"AUC_test":round(metrics.roc_auc_score(y_test,preds_prod_test),3})
X_train = X_train.drop(i, axis=1)
X_test = X_test.drop(i, axis=1)
results = pd.DataFrame(results)