Run and rank all combination of features to machine learning model-CodePudding

I have a train and test data set which contains 30 independent features and 1 target feature.

All the features are numerical variables. An example of the train data set looks like. The test data set also has the same columns

Target	col1	col2	...	col29	col30
20	12	14	...	15	12
25	13	25	...	19	19

I want to write an efficient code to run all combination of the features to a light GBM regressor model on test data set to find out the best combination of features which gave the best MAE.

An example of the result output that I am looking for should look like this

Rank	Features_used	MAE
1	col1,col2,col14,col17,col18	2.40
2	col4,col5,col15,col19,col24	2.50
3	col4,col5,col15,col19,col24,col29,col18,col13	2.50
--	----	---
--	----	---
--	----	---
--	----	---
n	worst combination of features	Worst MAE

I have tried passing each combination of features individually and finding out the MAE but it seems inefficient while trying out all the combinations.

Predict = 'Target'
train = train[['Target','col1','col2','col3','col4','col5']]
test = test[['Target','col1','col2','col3','col4','col5']]
X_train = train[train.columns.difference([Predict])]
X_test =  test[test.columns.difference([Predict])]
y_train = train[Predict]
y_test = test[Predict]
regressor = lightgbm.LGBMRegressor()
regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"])
y_pred = regressor.predict(X_test)

Is there an efficient way to run all the combination of features and rank the output based on the MAE?

CodePudding user response：

The first step is the do every combination of the features with keeping the "target" within every combination.
The second step is to iterate over every combination, train, predict and calculate the MAE and store it in a dataframe among the features used

The final one, is to sort the dataframe based on the MAE.

  from itertools import compress, product
  import numpy as np
  from sklearn.metrics import mean_absolute_error as mae
  #This fonctions will be used to have every combinations of features for the model
  def combinations(items):
      return ( list(set(compress(items,mask))) for mask in product(*[[0,1]]*len(items)) )

  def lgbm(train, test, all_columns):
      Predict = 'Target'
      train = train[all_columns]
      test = test[all_columns]
      X_train = train[train.columns.difference([Predict])]
      X_test =  test[test.columns.difference([Predict])]
      y_train = train[Predict]
      y_test = test[Predict]
      regressor = lightgbm.LGBMRegressor()
      regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"])
      y_pred = regressor.predict(X_test)
      #Calculate the MAE
      mae_error = mae(y_test, y_pred)
      return mae_error

  d = pd.DataFrame(columns=["Features_used","MAE"])

  all_columns = ['Target','col1','col2','col3','col4','col5']
  #Iterate over every combinations of features and train the model,
  #get the MAE and append it with the features used in the dataframe
  combi_col = list(combinations(np.arange(start=1, stop=len(all_columns))))[1:] #starting from index 1 to drop empty list
  for columns in combi_col:
      columns = [all_columns[i] for i in columns [0]] 
      #index 0 referes to the target columns because it must be always included in every combination
      error = lgbm(train,test, columns)
      d = d.append({"Features_used":",".join(columns),"MAE":error},ignore_index=True)
  d['Rank'] = d['MAE'].rank(ascending = 0).astype(int)
  d = d.sort_values(["MAE"],ascending=False)
  d = d[["Rank","Features_used","MAE"]]
  d