I have produced a list based on feature importance using the code below. How do I select the features indexes whose scores are greater than 0.00100? The code I used is below:
importance = rf.feature_importances_
# summarise feature importance
for i,v in enumerate(importance):
print('Feature:
, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Feature: 0, Score: 0.00020
Feature: 1, Score: 0.00097
Feature: 2, Score: 0.00122
Feature: 3, Score: 0.00115
Feature: 4, Score: 0.00012
I have tried
X = importance.loc[:, importance.loc['importance'] <= 0.001]
and
X = importance[importance['Score'] > 0.00100]
but it obviously returns an error:
AttributeError: 'numpy.ndarray' object has no attribute 'loc'
and
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
respectively.
I believe I can call original columns later once I know the features indexes using:
X = X.iloc[:,[0, 3, 18, 27, 31, 32, 39, 67, 90, 114]]
Except there is a better way to call them straight away into rather than doing a copy and paste into the iloc.
CodePudding user response:
With pandas
you can easily select rows based on a condition. Based on the sample you provided :
importance = np.array([2.04432471e-04, 9.69855344e-04, 1.22306283e-03, 1.15336387e-03, 1.21430735e-04, 3.66518341e-04, 8.84830067e-04,
1.82450072e-03, 2.43196633e-03, 4.63633572e-04, 2.12339471e-04, 8.13088621e-04])
df = {
'Feature': range(len(importance)),
'Score': importance
}
df = pd.DataFrame(df)
print(df)
# Sample DataFrame
Feature Score
0 0 0.000204
1 1 0.000970
...
11 11 0.000813
This line selects only the rows matching the condition "Score greater than 0.00100".
df = df[df['Score'] > 0.00100]
print(df)
# Output
Feature Score
2 2 0.001223
3 3 0.001153
7 7 0.001825
8 8 0.002432