Home > Software engineering >  Pandas Groupby: selection where both subgroups exist
Pandas Groupby: selection where both subgroups exist

Time:10-13

My DataFrame has the following form.

id group color
i1 aa    white
i1 aa    white
i1 ab    white
i1 ab    black
...

I apply a groupby as following:

groupdf = df.groupby(['id', 'group'])['color'].value_counts()

The result of the groupby has an multiindex.

               value
id group color 

i1  aa   white  2
i1  ab   white  1
i1  ab   black  3
i1  ac   black  5
i1  ad   white  4
i1  ad   black  5

i2  aa   white  1
i2  aa   black  1
i2  bb   black  1
i2  cc   white  2
i2  cc   black  6
i2  ad   black  5  

My goal is to

  1. select entries where both categories of the last index color are present and then
  2. select the group with max value of the black color So the result would look like:
    value
id

i1  5    #only groups ab and ad have both colors; ad.black = 5 > ab.black = 3
i2  6    #only groups aa and cc have both colors; cc.black = 6 > aa.black = 1

I have tried .xs() as well as .index.get_level_values() but I am not able to achieve my goal.

EDIT 1: I see that I have provided poor information how I got the DataFrame and updated it above. I can't directly plugin .max() as the original df does not have value column.

CodePudding user response:

Let's try:

# mask the groups with more than one colors
s = df.groupby(['id','group'])['value'].transform('size') > 1


# boolean index the groups and query, then another groupby with max
df[s].query('color=="black"').groupby(['id','color'])['value'].max()

Output:

id  color
i1  black    5
i2  black    6
Name: value, dtype: int64
  • Related