Home > Software engineering >  Pandas Dataframes Remove rows by unique count of values
Pandas Dataframes Remove rows by unique count of values

Time:08-14

I want to remove rows where specific columns unique value counts is less than some value.

Dataframe looks like that:

      class                reason  bank_fees  cash_advance  community  food_and_drink  ...  recreation  service    shops     tax  transfer   travel
0         0                  at&t      20.00           0.0     253.95          254.48  ...       19.27   629.34   842.77     0.0  -4089.51   121.23
1         0                  at&t       0.00           0.0       0.00          319.55  ...        0.00  1327.53   656.84 -1784.0  -1333.20    79.60
2         1      entergy arkansas      80.00           0.0       3.39         3580.99  ...      612.36  3746.90  4990.33     0.0 -14402.54   888.67
3         1      entergy arkansas       0.00           0.0       0.00           37.03  ...        0.00   405.24    47.34     0.0   -400.01    41.12
4         1      entergy arkansas       0.00           0.0       0.00          250.18  ...        0.00   123.48    54.28     0.0   -270.15    87.00
...     ...                   ...        ...           ...        ...             ...  ...         ...      ...      ...     ...       ...      ...
6659      0              t-mobile       0.00           0.0       0.00            0.00  ...        0.00     0.00    50.00     0.0   -253.74   284.44
6660      0               optimum       0.00         -30.0     108.63          158.67  ...       10.11  7098.23  2657.95     0.0 -12641.89  3011.04
6661      0               optimum       0.00           0.0       0.00          267.86  ...        0.00  2459.41  1939.35     0.0  -5727.50   212.06
6662      0  state farm insurance       0.00           0.0       0.00           80.91  ...       25.00   130.27   195.42     0.0  -1189.71    48.79
6663      0               verizon      39.97           0.0       0.00            0.00  ...        0.00   110.00     0.00     0.0      0.00     0.00

[6664 rows x 15 columns]

this is the counts of the column reason

at&t                      724
verizon                   544
geico                     341
t-mobile                  309
state farm insurance      135
                         ... 
town of smyrna              1
city of hendersonville      1
duke energy                 1
pima medical institute      1
gieco                       1
Name: reason, Length: 649, dtype: int64

the important column there is the reason. for example, if the unique value count is less than 5 I want to remove those rows. How can I do that? Thanks

CodePudding user response:

You can try to get the index of value counts where value is below 5 and use isin to filter out these value

out = df[~df['reason'].isin(df['reason'].value_counts().lt(5).pipe(lambda s: s[s].index))]

To elaborate each step usage

out = df[~df['reason'].isin(
    df['reason'].value_counts() # get each value count
    .lt(5)                      # mask value lower than 5
    .pipe(lambda s: s[s].index) # get the index of value which is lower than 5
)]                              # if value is not in the index, keep it
  • Related