I have a dataframe containing only duplicate "MainID" rows.
One MainID may have multiple secondary IDs (SecID).
I want to concatenate the values of SecID if there is a common MainID, joined by ':' in SecID col.
What is the best way of achieving this?
Yes, I know this is not best practice, however it's the structure the software wants.
Need to keep the df structure and values in rest of the df. They will always match the other duplicated row. Only SecID will be different.
Current:
data={'MainID':['NHFPL0580','NHFPL0580','NHFPL0582','NHFPL0582'],'SecID':['G12345','G67890','G11223','G34455'], 'Other':['A','A','B','B']}
df=pd.DataFrame(data)
print(df)
MainID SecID Other
0 NHFPL0580 G12345 A
1 NHFPL0580 G67890 A
2 NHFPL0582 G11223 B
3 NHFPL0582 G34455 B
Intended Structure
MainID SecID Other
NHFPL0580 G12345:G67890 A
NHFPL0582 G11223:G34455 B
CodePudding user response:
Try:
df.groupby('MainID').apply(lambda x: ':'.join(x.SecID))
the above code returns a pd.Series
, and you can convert it to a dataframe as @Guy suggested:
You need .reset_index(name='SecID') if you want it back as DataFrame
The solution to the edited question:
df = df.groupby(['MainID', 'Other']).apply(lambda x: ':'.join(x.SecID)).reset_index(name='SecID')
You can then change the column order
cols = df.columns.tolist()
df = df[[cols[i] for i in [0, 2, 1]]]