Home > Enterprise >  Aggregate group by response based on certain count (or size) Pandas
Aggregate group by response based on certain count (or size) Pandas

Time:02-22

I am looking to create a sum based on certain values obtained after a groupby count (or size). I have created a mock DataFrame and the desired output bellow. It should be self explanatory from the example what I am looking for. I checked quite a bit but it seems there is no straight answer.

     
data = {'col1' : ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C','C','C','C','C','C','C','C'], 'col2' :[ 'A', 'B', 'C', 'B', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'C', 'A','B','A','A','A','B','C','C']}
data = pd.DataFrame(data)

data.groupby(['col1', 'col2'])['col2'].count()

The output for this count is:

A     A       2
      B       2
      C       1
B     A       1
      B       3
      C       2
C     A       4
      B       2
      C       3

I would like to do a further calculation on this output and get:

A     A       2
    (B C)     3
B   (A C)     3
      B       3
C   (A B)     6
      C       3

CodePudding user response:

You could create dummy columns and groupby using those columns:

out = (data
       .assign(match=data['col1']==data['col2'], count=1)
       .groupby(['col1','match'], as_index=False)
       .agg({'col2': lambda x: ' '.join(x.unique()), 'count':'sum'})
       .drop(columns='match'))

Output:

  col1 col2  count
0    A  B C      3
1    A    A      2
2    B  C A      3
3    B    B      3
4    C  A B      6
5    C    C      3
  • Related