My question is similar to this, this, and this questions.
But still cannot resolve it.
I have a dataframe with duplicated ids
ID Publication_type
1 Journal
1 Clinical study
1 Guideline
2 Journal
2 Letter
I want to make it wide, but I do not know how many publication type will I have - maybe 2, maybe 20. Thus, I do not know how many columns wide will I need. The max size of wide columns for publication_type
must be not be more than the number of types for each id.
Expected output
ID Publication_type1 Publication_type2 Publication_type 3 etc
1 Journal Clinical Study Guideline
2 Journal Letter NaN
For now I do not need to put the same publication type into the same column. I do not need all articles in the same column. Thanks!
CodePudding user response:
You can group by ID
, aggregate via list
, and then create a new DataFrame from the results:
col = 'Publication_type'
new_df = pd.DataFrame(df.groupby('ID')[col].agg(lambda x: x.tolist()).tolist()).replace({None: np.nan})
new_df.columns = [f'{col}{i}' for i in new_df.columns 1]
new_df['ID'] = df['ID'].drop_duplicates().reset_index(drop=True)
Output:
>>> df
Publication_type1 Publication_type2 Publication_type3 ID
0 Journal Clinical-study Guideline 1
1 Journal Letter NaN 2