How to only keep rows in a Pandas DataFrame based on its count in a given column-CodePudding

I have a Pandas DataFrame with some categorical data in one of the columns. On doing value_counts on that particular column, I get something similar to:

HR                          176
Coding                       81
Reject                       74
Database Administration      21
Finance                      17
Project Management           16
Sales                        15
DevOps                       13
Core Electronics             10
Networking                   10
Medical Science               9
Core Mechanical               8
Web Development               4
Puzzles                       3
behavioural                   3
not a question                2
civil engineering             1
Mathematics                   1
Finance, Medical Science      1
Sales, HR                     1

What I'd like to do is to only keep the categories with a count >= some threshold (e.g. 10). All the smaller categories should get clubbed in a separate "Other" category i.e. the result should look like:

HR                          176
Coding                       81
Reject                       74

*Other*                      33

Database Administration      21
Finance                      17
Project Management           16
Sales                        15
DevOps                       13
Core Electronics             10
Networking                   10

I've done this in the past by hacking together a defaultdict(int) and only taking the instances where count >= threshold. I want to know if there is a Pandas canonical way of achieving the same.

CodePudding user response：

I would use a mask to perform boolean indexing and concat:

m = s>=10
out = (pd.concat([s[m], pd.Series(s[~m].sum(), index=['Others'])])
         .sort_values(ascending=False)
      )

output:

HR                         176
Coding                      81
Reject                      74
Others                      33
Database Administration     21
Finance                     17
Project Management          16
Sales                       15
DevOps                      13
Core Electronics            10
Networking

CodePudding user response：

Is this the answer you're looking for :

Pandas: Selecting rows based on value counts of a particular column

Else maybe this is what you want :

data = pd.DataFrame([["researcher",150],["politician",15],["builder",1],["teacher",5],])
data.columns = ["category", "count"]
filter_value = 10
d1 = data[data['count'] >= filter_value]
d2 = data[data['count'] < filter_value]
d1["tag"] = "filter_passed"
d2["tag"] = "Others"
data = pd.concat([d1,d2])
>>> data
     category  count            tag
0  researcher    150  filter_passed
1  politician     15  filter_passed
2     builder      1         Others
3     teacher      5         Others