I want to ignore the rows is the occupation has less than 2 unique names:
name value occupation
a 23 mechanic
a 24 mechanic
b 30 mechanic
c 40 mechanic
c 41 mechanic
d 30 doctor
d 20 doctor
e 70 plumber
e 71 plumber
f 30 plumber
g 50 tailor
I did:
df.groupby('ocuupation')['name'].nunique()
>>>>>>
occupation
mechanic 3
doctor 1
plumber 2
tailor 1
Name: name, dtype: int64
Is it possible to use something like df = df.drop(df[<some boolean condition>].index)
?
Desired output:
name value occupation
a 23 mechanic
a 24 mechanic
b 30 mechanic
c 40 mechanic
c 41 mechanic
e 70 plumber
e 71 plumber
f 30 plumber
CodePudding user response:
Use GroupBy.transform
with Series.ge
for get values equal of greater like 2
:
df = df[df.groupby('occupation')['name'].transform('nunique').ge(2)]
print (df)
name value occupation
0 a 23 mechanic
1 a 24 mechanic
2 b 30 mechanic
3 c 40 mechanic
4 c 41 mechanic
7 e 70 plumber
8 e 71 plumber
9 f 30 plumber
Your solution is filtered values of index in Series compared in Series.isin
:
s = df.groupby('occupation')['name'].nunique()
df = df[df['occupation'].isin(s[s.ge(2)].index)]
print (df)
name value occupation
0 a 23 mechanic
1 a 24 mechanic
2 b 30 mechanic
3 c 40 mechanic
4 c 41 mechanic
7 e 70 plumber
8 e 71 plumber
9 f 30 plumber