Home > OS >  Add suffixes to duplicate (column) cell values in Pandas depending on another column value (category
Add suffixes to duplicate (column) cell values in Pandas depending on another column value (category

Time:05-31

I have a structure like this:

   Data_group Data Value
    Group_x    A     12
    Group_x    A     13
    Group_x    B     3
    Group_x    C     3
    Group_x    C     32
    Group_x    C     23
    Group_y    A     8
    Group_y    A     7
    Group_y    B     13
    Group_y    C     12 
    Group_y    C     13
    Group_y    C     66

I would like to manipulate the Data column in order to have this as the output:

   Data_group Data  Value
    Group_x    A[0]   12
    Group_x    A[1]   13
    Group_x    B      3
    Group_x    C[0]   3
    Group_x    C[1]   32
    Group_x    C[2]   23
    Group_y    A[0]   8
    Group_y    A[1]   7
    Group_y    B      13
    Group_y    C[0]   12 
    Group_y    C[1]   13
    Group_y    C[2]   66    

Note that for every Group the index is reset and when there is just one entry of that Data there is no suffix. The Value column is not involved in the restructure. I have tried to exploit this:

mask = df['Data'].duplicated(keep=False)
df['Data']  = mask.cumcount().add(1).astype(str).radd('_').mask(df['Data'].transform('count')==1,'') 

It will number the Data duplicate but the suffix will increment regardless of the Data Group. I started to do it with nested if and for but it appears cumbersome and not functional. Is there a clean way to use Pandas method have that result?

CodePudding user response:

You can use a single groupby with both Data_group/Data columns. Compute the cumcount as string (with brackets) and add only to groups that have more than one element (transform('size').gt(1)):

g = df.groupby(['Data_group', 'Data'])
df.loc[g['Data'].transform('size').gt(1),
       'Data']  = '[' g.cumcount().astype(str) ']'

output:

   Data_group  Data  Value
0     Group_x  A[0]     12
1     Group_x  A[1]     13
2     Group_x     B      3
3     Group_x  C[0]      3
4     Group_x  C[1]     32
5     Group_x  C[2]     23
6     Group_y  A[0]      8
7     Group_y  A[1]      7
8     Group_y     B     13
9     Group_y  C[0]     12
10    Group_y  C[1]     13
11    Group_y  C[2]     66

intermediates:

'[' g.cumcount().astype(str) ']'

0     [0]
1     [1]
2     [0]
3     [0]
4     [1]
5     [2]
6     [0]
7     [1]
8     [0]
9     [0]
10    [1]
11    [2]
dtype: object

g['Data'].transform('size').gt(1)

0      True
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8     False
9      True
10     True
11     True
Name: Data, dtype: bool
  • Related