How to create a few Machine Learning models through all variables and after each iteration next XGBC-CodePudding

I have DataFrame in Python Pandas like below:

Input data:

Y - binnary target
X1...X5 - predictors

Y X1 X2 X3 X4 X5

1 111 22 1 0 150

0 12 33 1 0 222

1 150 44 0 0 230

0 270 55 0 1 500

... ... ... ... ... ...

Y	X1	X2	X3	X4	X5
1	111	22	1	0	150
0	12	33	1	0	222
1	150	44	0	0	230
0	270	55	0	1	500
...	...	...	...	...	...

Requirements: And I need to:

run a loop through all the variables in such a way that after each iteration a new XGBoost classification model is created and also after each iteration one of the variables is discarded and create next model
So, if I have for example 5 predictors (X1...X5) I need to create 5 XGBoost classification models, and in in each successive model there must be 1 less variable
Each model should be evaluated by roc_auc_score
As an output I need: list_of_models = [] where will be saved created models and DataFrame with AUC on train and test

Desire output:

So, as a result I need to have something like below

Model - position of model in list_of_models
Num_var - number of predictors used in model
AUC_train - roc_auc_score on train dataset
AUC_test - roc_auc_score on test dataset

Model Num_var AUC_train AUC_test

0 5 0.887 0.884

1 4 0.875 0.845

2 3 0.854 0.843

3 2 0.965 0.928

4 1 0.922 0.921

Model	Num_var	AUC_train	AUC_test
0	5	0.887	0.884
1	4	0.875	0.845
2	3	0.854	0.843
3	2	0.965	0.928
4	1	0.922	0.921

My draft: which is wrong because it should be loop through all the variables in such a way that after each iteration a new XGBoost classification model is created and also after each iteration one of the variables is discarded and create next model

X_train, X_test, y_train, y_test = train_test_split(df.drop("Y", axis=1)
                                                    , df.Y
                                                    , train_size = 0.70
                                                    , test_size=0.30
                                                    , random_state=1
                                                    , stratify = df.Y)

results = []
list_of_models = []

for val in X_train:

    model = XGBClassifier()
    model.fit(X_train, y_train)
    list_of_models.append(model)

    preds_train = model.predict(X_train)
    preds_test = model.predict(X_test)
    preds_prob_train = model.predict_proba(X_train)[:,1]
    preds_prob_test = model.predict_proba(X_test)[:,1]

    results.append({("AUC_train":round(metrics.roc_auc_score(y_train,preds_prod_test),3),
                     "AUC_test":round(metrics.roc_auc_score(y_test,preds_prod_test),3})

results = pd.DataFrame(results)

How can I do that in Python ?

CodePudding user response：

You want to make your data narrower during each loop? If I understand this correctly you could do something like this:

results = []
list_of_models = []

for i in X_train.columns:
    model = XGBClassifier()
    model.fit(X_train, y_train)
    list_of_models.append(model)

    preds_train = model.predict(X_train)
    preds_test = model.predict(X_test)
    preds_prob_train = model.predict_proba(X_train)[:,1]
    preds_prob_test = model.predict_proba(X_test)[:,1]
    results.append({("AUC_train":round(metrics.roc_auc_score(y_train,preds_prod_test),3),
                 "AUC_test":round(metrics.roc_auc_score(y_test,preds_prod_test),3})
    X_train = X_train.drop(i, axis=1)
    X_test = X_test.drop(i, axis=1)

results = pd.DataFrame(results)

Y	X1	X2	X3	X4	X5
1	111	22	1	0	150
0	12	33	1	0	222
1	150	44	0	0	230
0	270	55	0	1	500
...	...	...	...	...	...

Y	X1	X2	X3	X4	X5
1	111	22	1	0	150
0	12	33	1	0	222
1	150	44	0	0	230
0	270	55	0	1	500
...	...	...	...	...	...

Y	X1	X2	X3	X4	X5
1	111	22	1	0	150
0	12	33	1	0	222
1	150	44	0	0	230
0	270	55	0	1	500
...	...	...	...	...	...