I'm trying to drop columns that have too many missing values. How can I count the occurrence of some values within columns since the missing values are represented using 99 or 90
here is the code that is supposed to drop columns that exceed the threshold value
threshold = 0.6
data = data[data.columns[[data.column == 90 or data.column == 99].count().mean() < threshold]]
I'm not quite used to using pandas, any suggestions would be helpful
CodePudding user response:
You're almost there. Use apply
:
threshold = 0.6
out = data[data.apply(lambda s: s.isin([90, 99])).mean(1).lt(threshold)]
Example input:
0 1 2 3 4
0 0 90 0 0 0
1 0 0 0 0 0
2 0 90 0 99 0
3 90 0 0 0 0
4 99 99 0 90 99 # to drop
5 99 0 0 0 99
6 0 0 99 0 90
7 0 90 99 0 90 #
8 99 90 0 90 0 #
9 0 99 0 0 0
output:
0 1 2 3 4
0 0 90 0 0 0
1 0 0 0 0 0
2 0 90 0 99 0
3 90 0 0 0 0
5 99 0 0 0 99
6 0 0 99 0 90
9 0 99 0 0 0