I have a structure like this:
Data_group Data Value
Group_x A 12
Group_x A 13
Group_x B 3
Group_x C 3
Group_x C 32
Group_x C 23
Group_y A 8
Group_y A 7
Group_y B 13
Group_y C 12
Group_y C 13
Group_y C 66
I would like to manipulate the Data column in order to have this as the output:
Data_group Data Value
Group_x A[0] 12
Group_x A[1] 13
Group_x B 3
Group_x C[0] 3
Group_x C[1] 32
Group_x C[2] 23
Group_y A[0] 8
Group_y A[1] 7
Group_y B 13
Group_y C[0] 12
Group_y C[1] 13
Group_y C[2] 66
Note that for every Group the index is reset and when there is just one entry of that Data there is no suffix. The Value column is not involved in the restructure. I have tried to exploit this:
mask = df['Data'].duplicated(keep=False)
df['Data'] = mask.cumcount().add(1).astype(str).radd('_').mask(df['Data'].transform('count')==1,'')
It will number the Data duplicate but the suffix will increment regardless of the Data Group. I started to do it with nested if and for but it appears cumbersome and not functional. Is there a clean way to use Pandas method have that result?
CodePudding user response:
You can use a single groupby
with both Data_group/Data columns. Compute the cumcount
as string (with brackets) and add only to groups that have more than one element (transform('size').gt(1)
):
g = df.groupby(['Data_group', 'Data'])
df.loc[g['Data'].transform('size').gt(1),
'Data'] = '[' g.cumcount().astype(str) ']'
output:
Data_group Data Value
0 Group_x A[0] 12
1 Group_x A[1] 13
2 Group_x B 3
3 Group_x C[0] 3
4 Group_x C[1] 32
5 Group_x C[2] 23
6 Group_y A[0] 8
7 Group_y A[1] 7
8 Group_y B 13
9 Group_y C[0] 12
10 Group_y C[1] 13
11 Group_y C[2] 66
intermediates:
'[' g.cumcount().astype(str) ']'
0 [0]
1 [1]
2 [0]
3 [0]
4 [1]
5 [2]
6 [0]
7 [1]
8 [0]
9 [0]
10 [1]
11 [2]
dtype: object
g['Data'].transform('size').gt(1)
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 True
10 True
11 True
Name: Data, dtype: bool