Home > Software design >  Select Specific Scores Based on a Criteria in Python
Select Specific Scores Based on a Criteria in Python

Time:03-04

I have produced a list based on feature importance using the code below. How do I select the features indexes whose scores are greater than 0.00100? The code I used is below:

importance = rf.feature_importances_
# summarise feature importance
for i,v in enumerate(importance):
    print('Feature: 
, Score: %.5f' % (i,v))
    
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

Feature: 0, Score: 0.00020
Feature: 1, Score: 0.00097
Feature: 2, Score: 0.00122
Feature: 3, Score: 0.00115
Feature: 4, Score: 0.00012

I have tried

X = importance.loc[:, importance.loc['importance'] <= 0.001]

and

X = importance[importance['Score'] > 0.00100]

but it obviously returns an error:

AttributeError: 'numpy.ndarray' object has no attribute 'loc'

and

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

respectively.

I believe I can call original columns later once I know the features indexes using:

X = X.iloc[:,[0,   3,  18,  27,  31,  32,  39,  67,  90, 114]]

Except there is a better way to call them straight away into rather than doing a copy and paste into the iloc.

CodePudding user response:

With pandas you can easily select rows based on a condition. Based on the sample you provided :

importance = np.array([2.04432471e-04, 9.69855344e-04, 1.22306283e-03, 1.15336387e-03, 1.21430735e-04, 3.66518341e-04, 8.84830067e-04,
                       1.82450072e-03, 2.43196633e-03, 4.63633572e-04, 2.12339471e-04, 8.13088621e-04])

df = {
    'Feature': range(len(importance)),
    'Score': importance
}

df = pd.DataFrame(df)

print(df) 
# Sample DataFrame
    Feature     Score
0         0  0.000204
1         1  0.000970
...
11       11  0.000813

This line selects only the rows matching the condition "Score greater than 0.00100".

df = df[df['Score'] > 0.00100]

print(df) 
# Output
   Feature     Score
2        2  0.001223
3        3  0.001153
7        7  0.001825
8        8  0.002432
  • Related