I want to remove rows where specific columns unique value counts is less than some value.
Dataframe looks like that:
class reason bank_fees cash_advance community food_and_drink ... recreation service shops tax transfer travel
0 0 at&t 20.00 0.0 253.95 254.48 ... 19.27 629.34 842.77 0.0 -4089.51 121.23
1 0 at&t 0.00 0.0 0.00 319.55 ... 0.00 1327.53 656.84 -1784.0 -1333.20 79.60
2 1 entergy arkansas 80.00 0.0 3.39 3580.99 ... 612.36 3746.90 4990.33 0.0 -14402.54 888.67
3 1 entergy arkansas 0.00 0.0 0.00 37.03 ... 0.00 405.24 47.34 0.0 -400.01 41.12
4 1 entergy arkansas 0.00 0.0 0.00 250.18 ... 0.00 123.48 54.28 0.0 -270.15 87.00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6659 0 t-mobile 0.00 0.0 0.00 0.00 ... 0.00 0.00 50.00 0.0 -253.74 284.44
6660 0 optimum 0.00 -30.0 108.63 158.67 ... 10.11 7098.23 2657.95 0.0 -12641.89 3011.04
6661 0 optimum 0.00 0.0 0.00 267.86 ... 0.00 2459.41 1939.35 0.0 -5727.50 212.06
6662 0 state farm insurance 0.00 0.0 0.00 80.91 ... 25.00 130.27 195.42 0.0 -1189.71 48.79
6663 0 verizon 39.97 0.0 0.00 0.00 ... 0.00 110.00 0.00 0.0 0.00 0.00
[6664 rows x 15 columns]
this is the counts of the column reason
at&t 724
verizon 544
geico 341
t-mobile 309
state farm insurance 135
...
town of smyrna 1
city of hendersonville 1
duke energy 1
pima medical institute 1
gieco 1
Name: reason, Length: 649, dtype: int64
the important column there is the reason
. for example, if the unique value count is less than 5 I want to remove those rows. How can I do that? Thanks
CodePudding user response:
You can try to get the index of value counts where value is below 5
and use isin
to filter out these value
out = df[~df['reason'].isin(df['reason'].value_counts().lt(5).pipe(lambda s: s[s].index))]
To elaborate each step usage
out = df[~df['reason'].isin(
df['reason'].value_counts() # get each value count
.lt(5) # mask value lower than 5
.pipe(lambda s: s[s].index) # get the index of value which is lower than 5
)] # if value is not in the index, keep it