Home > database >  Run and rank all combination of features to machine learning model
Run and rank all combination of features to machine learning model

Time:12-28

I have a train and test data set which contains 30 independent features and 1 target feature.

All the features are numerical variables. An example of the train data set looks like. The test data set also has the same columns

Target col1 col2 ... col29 col30
20 12 14 ... 15 12
25 13 25 ... 19 19

I want to write an efficient code to run all combination of the features to a light GBM regressor model on test data set to find out the best combination of features which gave the best MAE.

An example of the result output that I am looking for should look like this

Rank Features_used MAE
1 col1,col2,col14,col17,col18 2.40
2 col4,col5,col15,col19,col24 2.50
3 col4,col5,col15,col19,col24,col29,col18,col13 2.50
-- ---- ---
-- ---- ---
-- ---- ---
-- ---- ---
n worst combination of features Worst MAE

I have tried passing each combination of features individually and finding out the MAE but it seems inefficient while trying out all the combinations.

Predict = 'Target'
train = train[['Target','col1','col2','col3','col4','col5']]
test = test[['Target','col1','col2','col3','col4','col5']]
X_train = train[train.columns.difference([Predict])]
X_test =  test[test.columns.difference([Predict])]
y_train = train[Predict]
y_test = test[Predict]
regressor = lightgbm.LGBMRegressor()
regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"])
y_pred = regressor.predict(X_test)

Is there an efficient way to run all the combination of features and rank the output based on the MAE?

CodePudding user response:

  • The first step is the do every combination of the features with keeping the "target" within every combination.

  • The second step is to iterate over every combination, train, predict and calculate the MAE and store it in a dataframe among the features used

  • The final one, is to sort the dataframe based on the MAE.

      from itertools import compress, product
      import numpy as np
      from sklearn.metrics import mean_absolute_error as mae
      #This fonctions will be used to have every combinations of features for the model
      def combinations(items):
          return ( list(set(compress(items,mask))) for mask in product(*[[0,1]]*len(items)) )
    
      def lgbm(train, test, all_columns):
          Predict = 'Target'
          train = train[all_columns]
          test = test[all_columns]
          X_train = train[train.columns.difference([Predict])]
          X_test =  test[test.columns.difference([Predict])]
          y_train = train[Predict]
          y_test = test[Predict]
          regressor = lightgbm.LGBMRegressor()
          regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"])
          y_pred = regressor.predict(X_test)
          #Calculate the MAE
          mae_error = mae(y_test, y_pred)
          return mae_error
    
      d = pd.DataFrame(columns=["Features_used","MAE"])
    
      all_columns = ['Target','col1','col2','col3','col4','col5']
      #Iterate over every combinations of features and train the model,
      #get the MAE and append it with the features used in the dataframe
      combi_col = list(combinations(np.arange(start=1, stop=len(all_columns))))[1:] #starting from index 1 to drop empty list
      for columns in combi_col:
          columns = [all_columns[i] for i in columns [0]] 
          #index 0 referes to the target columns because it must be always included in every combination
          error = lgbm(train,test, columns)
          d = d.append({"Features_used":",".join(columns),"MAE":error},ignore_index=True)
      d['Rank'] = d['MAE'].rank(ascending = 0).astype(int)
      d = d.sort_values(["MAE"],ascending=False)
      d = d[["Rank","Features_used","MAE"]]
      d
    
  • Related