I am looking to create a sum based on certain values obtained after a groupby
count (or size). I have created a mock DataFrame and the desired output bellow. It should be self explanatory from the example what I am looking for. I checked quite a bit but it seems there is no straight answer.
data = {'col1' : ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C','C','C','C','C','C','C','C'], 'col2' :[ 'A', 'B', 'C', 'B', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'C', 'A','B','A','A','A','B','C','C']}
data = pd.DataFrame(data)
data.groupby(['col1', 'col2'])['col2'].count()
The output for this count is:
A A 2
B 2
C 1
B A 1
B 3
C 2
C A 4
B 2
C 3
I would like to do a further calculation on this output and get:
A A 2
(B C) 3
B (A C) 3
B 3
C (A B) 6
C 3
CodePudding user response:
You could create dummy columns and groupby
using those columns:
out = (data
.assign(match=data['col1']==data['col2'], count=1)
.groupby(['col1','match'], as_index=False)
.agg({'col2': lambda x: ' '.join(x.unique()), 'count':'sum'})
.drop(columns='match'))
Output:
col1 col2 count
0 A B C 3
1 A A 2
2 B C A 3
3 B B 3
4 C A B 6
5 C C 3