Python Groupby on String values-CodePudding

Below is the input data

A      B       C       D
R1     J1      D1     S1,S2
R1     J1      D1     S3,S4,S5 
R1     J1      D2     S5,S6,S2
R1     J1      D2     S7,S8 
P1     J2      E1     T1,T2 
P1     J2      E1     T3,T4,T5
P1     J2      E2     T5,T6,T2
P1     J2      E3     T7,T8,T5

Expected output with no repeated values in column D :

A     B        C             D
R1    J1      D1,D2         S1,S2,S3,S4,S5,S6,S7,S8
P1    J2      E1,E2,E3      T1,T2,T3,T4,T5,T6,T7,T8

The script which I have tried and that not working

df.groupby(['A','B'])[['C','D']].agg([','.join]).reset_index()

CodePudding user response：

Try using str.split to split on the commas, then use explode to flatten the Series and use unique to remove the duplicates:

df.groupby(['A','B'])[['C','D']].agg(lambda x: ','.join(x.str.split(',').explode().unique())).reset_index()

Output:

    A   B         C                        D
0  P1  J2  E1,E2,E3  T1,T2,T3,T4,T5,T6,T7,T8
1  R1  J1     D1,D2  S1,S2,S3,S4,S5,S6,S7,S8

CodePudding user response：

Use lambda function with remove duplicates by convert splitted values to dictionaries by dict.fromkeys in GroupBy.agg:

f = lambda x: ','.join(dict.fromkeys(z for y in x for z in y.split(',')))
df = df.groupby(['A','B'], sort=False)[['C','D']].agg(f).reset_index()
print (df)
    A   B         C                        D
0  R1  J1     D1,D2  S1,S2,S3,S4,S5,S6,S7,S8
1  P1  J2  E1,E2,E3  T1,T2,T3,T4,T5,T6,T7,T8