I have the following
Input:
samples = [('001', 'RENAL', 'CHROMOPHOBE', 'KICH'),
('002', 'OVARIAN', 'HIGH_GRADE_SEROUS_CARCINOMA', 'LGSOC'),
('003', 'OVARIAN', 'OTHER', 'NaN'),
('001', 'COLORECTAL', 'ADENOCARCINOMA', 'KICH')]
labels = ['id', 'disease_type', 'disease_sub_type', 'study_abbreviation']
df = pd.DataFrame.from_records(samples, columns=labels)
df
id disease_type disease_sub_type study_abbreviation
0 001 RENAL CHROMOPHOBE KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
3 001 COLORECTAL ADENOCARCINOMA KICH
I want to be able to compress the repeated id
, say 001 in this case so that I can have the disease_type
and disease_sub_type
, study_abbreviation
merged into 1 cell each (nested).
Output
id disease_type disease_sub_type study_abbreviation
0 001 RENAL,COLORECTAL CHROMOPHOBE,ADENOCARCINOMA KICH, KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
This is not for anything but admin work hence the stupid ask but would help greatly when I need to merge on other datasets, thanks again.
CodePudding user response:
You could group by your 'id' column and use list
as an aggregation:
df.groupby('id',as_index=False).agg(','.join)
id disease_type disease_sub_type study_abbreviation
0 001 RENAL,COLORECTAL CHROMOPHOBE,ADENOCARCINOMA KICH,KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN