I have a dataframe with repetitive names of people
Input:
{'name': {0: 'John Smith', 1: 'John Smith', 2: 'John Smith', 3: 'John Doo', 4: 'John Doo', 5: 'John Doo'}, 'journal': {0: 'Journal1', 1: 'Journal2', 2: 'Journal2', 3: 'Journal1', 4: 'Journal2', 5: 'Journal2'}, 'is_editor_in_chief_2019': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan}, 'is_editor_in_chief_2020': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan}, 'is_editor_in_chief_2021': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan}, 'is_editor_in_chief_2022': {0: 1.0, 1: nan, 2: nan, 3: 1.0, 4: nan, 5: nan}, 'is_editorial_board_member_2019': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan}, 'is_editorial_board_member_2020': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan}, 'is_editorial_board_member_2021': {0: nan, 1: nan, 2: 1.0, 3: nan, 4: nan, 5: 1.0}, 'is_editorial_board_member_2022': {0: nan, 1: 1.0, 2: nan, 3: nan, 4: 1.0, 5: nan}}
I want to collapse these rows and convert data from long to wide.
Expected output:
{'name': {0: 'John Smith', 1: 'John Doo'}, 'Journal1_is_editor_in_chief_2019': {0: nan, 1: nan}, 'Journal1_is_editor_in_chief_2020': {0: nan, 1: nan}, 'Journal1_is_editor_in_chief_2021': {0: nan, 1: nan}, 'Journal1_is_editor_in_chief_2022': {0: 1, 1: 1}, 'Journal1_is_editorial_board_member_2019': {0: nan, 1: nan}, 'Journal1_is_editorial_board_member_2020': {0: nan, 1: nan}, 'Journal1_is_editorial_board_member_2021': {0: 1, 1: 1}, 'Journal1_is_editorial_board_member_2022': {0: 1, 1: 1}, 'Journal2_is_editor_in_chief_2019': {0: nan, 1: nan}, 'Journal2_is_editor_in_chief_2020': {0: nan, 1: nan}, 'Journal2_is_editor_in_chief_2021': {0: nan, 1: nan}, 'Journal2_is_editor_in_chief_2022': {0: nan, 1: nan}, 'Journal2_is_editorial_board_member_2019': {0: nan, 1: nan}, 'Journal2_is_editorial_board_member_2020': {0: nan, 1: nan}, 'Journal2_is_editorial_board_member_2021': {0: nan, 1: nan}, 'Journal2_is_editorial_board_member_2022': {0: nan, 1: nan}}
The closest solution I found is here:
df_out = df.set_index(['name', df.groupby(['name']).cumcount() 1]).unstack().sort_index(level=1, axis=1)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
But it does not give me the ideal output.
CodePudding user response:
You can use pivot_table
:
df_out = df.pivot_table(index='name', columns='journal', dropna=False)
df_out.columns = df_out.columns.map('{0[1]}_{0[0]}'.format)
df_out.reset_index()
output:
name Journal1_is_editor_in_chief_2019 Journal2_is_editor_in_chief_2019 Journal1_is_editor_in_chief_2020 Journal2_is_editor_in_chief_2020 Journal1_is_editor_in_chief_2021 Journal2_is_editor_in_chief_2021 Journal1_is_editor_in_chief_2022 Journal2_is_editor_in_chief_2022 Journal1_is_editorial_board_member_2019 Journal2_is_editorial_board_member_2019 Journal1_is_editorial_board_member_2020 Journal2_is_editorial_board_member_2020 Journal1_is_editorial_board_member_2021 Journal2_is_editorial_board_member_2021 Journal1_is_editorial_board_member_2022 Journal2_is_editorial_board_member_2022
0 John Doo NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN 1.0 NaN 1.0
1 John Smith NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN 1.0 NaN 1.0