Pivot a column so repeated values/records are placed in 1 cell-CodePudding

I have the following

Input:
samples = [('001', 'RENAL', 'CHROMOPHOBE', 'KICH'),
         ('002', 'OVARIAN', 'HIGH_GRADE_SEROUS_CARCINOMA', 'LGSOC'),
         ('003', 'OVARIAN', 'OTHER', 'NaN'),
         ('001', 'COLORECTAL', 'ADENOCARCINOMA', 'KICH')]
labels = ['id', 'disease_type', 'disease_sub_type', 'study_abbreviation']
df = pd.DataFrame.from_records(samples, columns=labels)
df

    id  disease_type  disease_sub_type              study_abbreviation
0   001 RENAL         CHROMOPHOBE                   KICH
1   002 OVARIAN       HIGH_GRADE_SEROUS_CARCINOMA   LGSOC
2   003 OVARIAN       OTHER                         NaN
3   001 COLORECTAL    ADENOCARCINOMA                KICH

I want to be able to compress the repeated id, say 001 in this case so that I can have the disease_type and disease_sub_type, study_abbreviation merged into 1 cell each (nested).

Output

    id  disease_type         disease_sub_type               study_abbreviation
0   001 RENAL,COLORECTAL     CHROMOPHOBE,ADENOCARCINOMA     KICH, KICH
1   002 OVARIAN              HIGH_GRADE_SEROUS_CARCINOMA    LGSOC
2   003 OVARIAN              OTHER                          NaN

This is not for anything but admin work hence the stupid ask but would help greatly when I need to merge on other datasets, thanks again.

CodePudding user response：

You could group by your 'id' column and use list as an aggregation:

df.groupby('id',as_index=False).agg(','.join)

    id      disease_type             disease_sub_type study_abbreviation
0  001  RENAL,COLORECTAL   CHROMOPHOBE,ADENOCARCINOMA          KICH,KICH
1  002           OVARIAN  HIGH_GRADE_SEROUS_CARCINOMA              LGSOC
2  003           OVARIAN                        OTHER                NaN