Below is the input data
A B C D
R1 J1 D1 S1,S2
R1 J1 D1 S3,S4,S5
R1 J1 D2 S5,S6,S2
R1 J1 D2 S7,S8
P1 J2 E1 T1,T2
P1 J2 E1 T3,T4,T5
P1 J2 E2 T5,T6,T2
P1 J2 E3 T7,T8,T5
Expected output with no repeated values in column D :
A B C D
R1 J1 D1,D2 S1,S2,S3,S4,S5,S6,S7,S8
P1 J2 E1,E2,E3 T1,T2,T3,T4,T5,T6,T7,T8
The script which I have tried and that not working
df.groupby(['A','B'])[['C','D']].agg([','.join]).reset_index()
CodePudding user response:
Try using str.split
to split on the commas, then use explode
to flatten the Series
and use unique
to remove the duplicates:
df.groupby(['A','B'])[['C','D']].agg(lambda x: ','.join(x.str.split(',').explode().unique())).reset_index()
Output:
A B C D
0 P1 J2 E1,E2,E3 T1,T2,T3,T4,T5,T6,T7,T8
1 R1 J1 D1,D2 S1,S2,S3,S4,S5,S6,S7,S8
CodePudding user response:
Use lambda function with remove duplicates by convert splitted values to dictionaries by dict.fromkeys
in GroupBy.agg
:
f = lambda x: ','.join(dict.fromkeys(z for y in x for z in y.split(',')))
df = df.groupby(['A','B'], sort=False)[['C','D']].agg(f).reset_index()
print (df)
A B C D
0 R1 J1 D1,D2 S1,S2,S3,S4,S5,S6,S7,S8
1 P1 J2 E1,E2,E3 T1,T2,T3,T4,T5,T6,T7,T8