I have a dataframe as below:
A B code cumul_sum
group1 group1_1 A 1
group1 group1_1 A 2
group1 group1_1 B 1
group1 group1_1 A 1
group1 group1_1 A 2
group1 group1_1 A 3
group2 group2_1 A 1
group2 group2_1 A 2
group2 group2_1 A 3
Please assume column A and B are belonging to same category. I wanted to group and the expected result are as follows
A B code cumul_sum **groupcat**
group1 group1_1 A 1 group1
group1 group1_1 A 2 group1
group1 group1_1 B 1 group2
group1 group1_1 A 1 group3
group1 group1_1 A 2 group3
group1 group1_1 A 3 group3
group2 group2_1 A 1 group1
group2 group2_1 A 2 group1
group2 group2_1 A 3 group1
Since there is a code B that occurred in the middle, group1 is not assigned to the third row and then the next value is a fresh categorical value.
EDIT: To also reset the group-cat for group A and B.
Please advice
CodePudding user response:
You can modify previous solution:
#tested consecutive values by code column
s = df['code'].ne(df['code'].shift()).cumsum()
df['groupcat'] = 'group' s.astype(str)
print (df)
A B code cumul_sum groupcat
0 group1 group1_1 A 1 group1
1 group1 group1_1 A 2 group1
2 group1 group1_1 B 1 group2
3 group1 group1_1 A 1 group3
4 group1 group1_1 A 2 group3
5 group1 group1_1 A 3 group3
#tested consecutive groups by A, B, code
s = df[['A','B','code']].ne(df[['A','B','code']].shift()).any(axis=1).cumsum()
df['groupcat'] = 'group' s.astype(str)
print (df)
A B code cumul_sum groupcat
0 group1 group1_1 A 1 group1
1 group1 group1_1 A 2 group1
2 group1 group1_1 B 1 group2
3 group1 group1_1 A 1 group3
4 group1 group1_1 A 2 group3
5 group1 group1_1 A 3 group3
EDIT:
s = df['code'].ne(df['code'].shift()).cumsum()
df['group'] = ('group' df.assign(s=s)
.groupby(['A','B'])['s']
.rank('dense').astype(int).astype(str))
print (df)
A B code cumul_sum group
0 group1 group1_1 A 1 group1
1 group1 group1_1 A 2 group1
2 group1 group1_1 B 1 group2
3 group1 group1_1 A 1 group3
4 group1 group1_1 A 2 group3
5 group1 group1_1 A 3 group3
6 group2 group2_1 A 1 group1
7 group2 group2_1 A 2 group1
8 group2 group2_1 A 3 group1
Alternative solution:
s = (df[['A','B','code']].ne(df[['A','B','code']].shift()).any(axis=1)
.groupby([df.A, df.B]).cumsum())
df['groupcat'] = 'group' s.astype(str)
print (df)
A B code cumul_sum groupcat
0 group1 group1_1 A 1 group1
1 group1 group1_1 A 2 group1
2 group1 group1_1 B 1 group2
3 group1 group1_1 A 1 group3
4 group1 group1_1 A 2 group3
5 group1 group1_1 A 3 group3
6 group2 group2_1 A 1 group1
7 group2 group2_1 A 2 group1
8 group2 group2_1 A 3 group1
CodePudding user response:
Since your groups ignore the code, you can just restart on each 1 value:
df['groupcat'] = df['cumul_sum'].eq(1).cumsum()
As string:
df['groupcat'] = df['cumul_sum'].eq(1).cumsum().astype(str).radd('group')
output:
A B code cumul_sum groupcat
0 group1 group1_1 A 1 group1
1 group1 group1_1 A 2 group1
2 group1 group1_1 B 1 group2
3 group1 group1_1 A 1 group3
4 group1 group1_1 A 2 group3
5 group1 group1_1 A 3 group3