I have a train and test data set which contains 30 independent features and 1 target feature.
All the features are numerical variables. An example of the train data set looks like. The test data set also has the same columns
Target | col1 | col2 | ... | col29 | col30 |
---|---|---|---|---|---|
20 | 12 | 14 | ... | 15 | 12 |
25 | 13 | 25 | ... | 19 | 19 |
I want to write an efficient code to run all combination of the features to a light GBM regressor model on test data set to find out the best combination of features which gave the best MAE.
An example of the result output that I am looking for should look like this
Rank | Features_used | MAE |
---|---|---|
1 | col1,col2,col14,col17,col18 | 2.40 |
2 | col4,col5,col15,col19,col24 | 2.50 |
3 | col4,col5,col15,col19,col24,col29,col18,col13 | 2.50 |
-- | ---- | --- |
-- | ---- | --- |
-- | ---- | --- |
-- | ---- | --- |
n | worst combination of features | Worst MAE |
I have tried passing each combination of features individually and finding out the MAE but it seems inefficient while trying out all the combinations.
Predict = 'Target'
train = train[['Target','col1','col2','col3','col4','col5']]
test = test[['Target','col1','col2','col3','col4','col5']]
X_train = train[train.columns.difference([Predict])]
X_test = test[test.columns.difference([Predict])]
y_train = train[Predict]
y_test = test[Predict]
regressor = lightgbm.LGBMRegressor()
regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"])
y_pred = regressor.predict(X_test)
Is there an efficient way to run all the combination of features and rank the output based on the MAE?
CodePudding user response:
The first step is the do every combination of the features with keeping the "target" within every combination.
The second step is to iterate over every combination, train, predict and calculate the MAE and store it in a dataframe among the features used
The final one, is to sort the dataframe based on the MAE.
from itertools import compress, product import numpy as np from sklearn.metrics import mean_absolute_error as mae #This fonctions will be used to have every combinations of features for the model def combinations(items): return ( list(set(compress(items,mask))) for mask in product(*[[0,1]]*len(items)) ) def lgbm(train, test, all_columns): Predict = 'Target' train = train[all_columns] test = test[all_columns] X_train = train[train.columns.difference([Predict])] X_test = test[test.columns.difference([Predict])] y_train = train[Predict] y_test = test[Predict] regressor = lightgbm.LGBMRegressor() regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"]) y_pred = regressor.predict(X_test) #Calculate the MAE mae_error = mae(y_test, y_pred) return mae_error d = pd.DataFrame(columns=["Features_used","MAE"]) all_columns = ['Target','col1','col2','col3','col4','col5'] #Iterate over every combinations of features and train the model, #get the MAE and append it with the features used in the dataframe combi_col = list(combinations(np.arange(start=1, stop=len(all_columns))))[1:] #starting from index 1 to drop empty list for columns in combi_col: columns = [all_columns[i] for i in columns [0]] #index 0 referes to the target columns because it must be always included in every combination error = lgbm(train,test, columns) d = d.append({"Features_used":",".join(columns),"MAE":error},ignore_index=True) d['Rank'] = d['MAE'].rank(ascending = 0).astype(int) d = d.sort_values(["MAE"],ascending=False) d = d[["Rank","Features_used","MAE"]] d