how will i get the important features and eleminate the feature which is not selected after performi-CodePudding

here i have tried to perform pca on my dataset but i dont have any idea how to get the important features and eleminate the feature which is not selected. here i have given a condition that if data contains more than 10 features then perform PCA else dont perform PCA.

from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x): 
    no_of_col = len(x.columns)
    percent = 90
    my_num = int((percent/100)*no_of_col)
    if no_of_col >= 10:
        pca = PCA(n_components = my_num)
        x_new = pca.fit_transform(x)
        print("More than 10 columns found Performing PCA")
        return selected_var
    else:
        print("Less than 10 columns found no PCA performed")
        return x
        
        
x = Perform_PCA(x)
x

CodePudding user response：

I will first review your function:

from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x): 
    no_of_col = len(x.columns) 
    percent = 90 
    my_num = int((percent/100)*no_of_col)
    if no_of_col >= 10:
          pca = PCA(n_components = my_num)
          x_new = pca.fit_transform(x)
          print("More than 10 columns found Performing PCA")
          return selected_var
    else:
          print("Less than 10 columns found no PCA performed")
          return x

You are performing PCA only if there are more than ten columns, but your function returns selected_var, which does not exist.

Also, PCA does not "select features", it transforms the input data by computing a lower-dimensional representation. If you want to remove columns, use the pca.transform(x) function.

Here is your code modified (it would be possible to optimise it further, but I tried to change it as little as possible):

from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x): 
    no_of_col = len(x.columns) 
    percent = 90 
    my_num = int((percent/100)*no_of_col)

    if no_of_col >= 10:
        pca = PCA(n_components = my_num)
        x_new = pca.fit_transform(x)
        print("More than 10 columns found Performing PCA")
        return x_new
    else:
         print("Less than 10 columns found no PCA performed")
         return x

Hope this will help you.

CodePudding user response：

In your current code you create my_num components, but only if you have more then 10 columns.

If you want to have a look and select the features yourself you could modify your code:

 pca = PCA()
 x_new = pca.fit_transform(x)
 explained_variance = pca.explained_variance_ratio_
 print(explained_variance)
 print(pd.DataFrame(pca.components_,columns=x.columns))

This will give you the explained variance for every feature in your dataset. From here you can set the bar how many features should be selected.