I have a Pandas DataFrame with some categorical data in one of the columns. On doing value_counts
on that particular column, I get something similar to:
HR 176
Coding 81
Reject 74
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking 10
Medical Science 9
Core Mechanical 8
Web Development 4
Puzzles 3
behavioural 3
not a question 2
civil engineering 1
Mathematics 1
Finance, Medical Science 1
Sales, HR 1
What I'd like to do is to only keep the categories with a count >= some threshold (e.g. 10). All the smaller categories should get clubbed in a separate "Other" category i.e. the result should look like:
HR 176
Coding 81
Reject 74
*Other* 33
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking 10
I've done this in the past by hacking together a defaultdict(int)
and only taking the instances where count >= threshold. I want to know if there is a Pandas canonical way of achieving the same.
CodePudding user response:
I would use a mask to perform boolean indexing and concat
:
m = s>=10
out = (pd.concat([s[m], pd.Series(s[~m].sum(), index=['Others'])])
.sort_values(ascending=False)
)
output:
HR 176
Coding 81
Reject 74
Others 33
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking
CodePudding user response:
Is this the answer you're looking for :
Pandas: Selecting rows based on value counts of a particular column
Else maybe this is what you want :
data = pd.DataFrame([["researcher",150],["politician",15],["builder",1],["teacher",5],])
data.columns = ["category", "count"]
filter_value = 10
d1 = data[data['count'] >= filter_value]
d2 = data[data['count'] < filter_value]
d1["tag"] = "filter_passed"
d2["tag"] = "Others"
data = pd.concat([d1,d2])
>>> data
category count tag
0 researcher 150 filter_passed
1 politician 15 filter_passed
2 builder 1 Others
3 teacher 5 Others