I have the following df
df = pd.DataFrame({'Cat':['tq','tb','ta','tb','ta','tq','tb','tq','ta'],
'col1':['a','a','a','b','b','c','c','c','a'],
'col2':['aa','aa','aa','aa','ba','ba','cc','cc','cc'],
'val':np.random.rand(9)})
I would like to create the following rankings:
df['Cat'] = pd.Categorical(df['Cat'],['tb','tq','ta'])
However, when I try to do a group by sum:
df2 = df.groupby(['col1','Cat','col2'])['val'].sum()
I end up with a 27 row table instead of the desired 8 rows that would occur where I to omit the categorical ranking.
I understand that the 27 is the product of the unique values of ['col1','Cat','col2']
.
I'm wondering how to prevent these permutations in the group by instead of filtering out where val != 0
CodePudding user response:
You can make use of the observed
argument in groupby
df2 = df.groupby(['col1','Cat','col2'], observed=True)['val'].sum()
df2
# col1 Cat col2
# a tq aa 0.422378
# tb aa 0.395679
# ta aa 0.407851
# cc 0.998086
# b tb aa 0.318188
# ta ba 0.861469
# c tq ba 0.333660
# cc 0.427609
# tb cc 0.415207
# Name: val, dtype: float64