here i have tried to perform pca on my dataset but i dont have any idea how to get the important features and eleminate the feature which is not selected. here i have given a condition that if data contains more than 10 features then perform PCA else dont perform PCA.
from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x):
no_of_col = len(x.columns)
percent = 90
my_num = int((percent/100)*no_of_col)
if no_of_col >= 10:
pca = PCA(n_components = my_num)
x_new = pca.fit_transform(x)
print("More than 10 columns found Performing PCA")
return selected_var
else:
print("Less than 10 columns found no PCA performed")
return x
x = Perform_PCA(x)
x
CodePudding user response:
I will first review your function:
from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x):
no_of_col = len(x.columns)
percent = 90
my_num = int((percent/100)*no_of_col)
if no_of_col >= 10:
pca = PCA(n_components = my_num)
x_new = pca.fit_transform(x)
print("More than 10 columns found Performing PCA")
return selected_var
else:
print("Less than 10 columns found no PCA performed")
return x
You are performing PCA only if there are more than ten columns, but your function returns selected_var
, which does not exist.
Also, PCA does not "select features", it transforms the input data by computing a lower-dimensional representation. If you want to remove columns, use the pca.transform(x) function.
Here is your code modified (it would be possible to optimise it further, but I tried to change it as little as possible):
from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x):
no_of_col = len(x.columns)
percent = 90
my_num = int((percent/100)*no_of_col)
if no_of_col >= 10:
pca = PCA(n_components = my_num)
x_new = pca.fit_transform(x)
print("More than 10 columns found Performing PCA")
return x_new
else:
print("Less than 10 columns found no PCA performed")
return x
Hope this will help you.
CodePudding user response:
In your current code you create my_num
components, but only if you have more then 10 columns.
If you want to have a look and select the features yourself you could modify your code:
pca = PCA()
x_new = pca.fit_transform(x)
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
print(pd.DataFrame(pca.components_,columns=x.columns))
This will give you the explained variance for every feature in your dataset. From here you can set the bar how many features should be selected.