I have a long and complicated dataset where I am trying to simplify it such that there is only one row per unique ID pairs (found in col 1 (ID1) and col 2 (ID2)). I would like to do this by making one column per unique value in the 'favorite_grades' column (in my actual dataset it is more complicated than 'favorite_grades', so I would like something that can group by unique text). Also, I noticed that sometimes there are multiple 'favorite_grades' that are lumped up into one row and separated by a space.
After each unique text is its own column, I would like the names to be sorted into the row that matches their ID pair, as well as the column that matches the favorite grade of each student. The ones that have the same ID pair and same grade can be grouped together in a list (double quotes around them). I have made a sample data frame as well as the desired result (they are both pandas data frames).
original:
ID1 ID2 name favorite_grades
01 01 John 3rd 4th
01 01 Kate 4th 5th
01 02 Emily 4th
01 03 Mark 5th
01 03 Emma 5th
desired result:
ID1 ID2 3rd_grade 4th_grade 5th_grade
01 01 "John" "John, Kate" "Kate
01 02 "Emily"
01 03 "Mark, Emma"
So far, I've tried:
df.pivot_table(index=['ID1','ID2'], columns='grade', values='name',aggfunc=', '.join).reset_index()
But it gives me this:
ID1 ID2 3rd 4th 4th 5th 4th 5th
1 1 John Kate NaN
2 NaN Emily Emily NaN
3 NaN NaN Mark, Emma
First off, the ID1 is blank for all other rows besides the first (I want it to be listed). Also, the '3rd 4th' and '4th 5th' should not be read as their own columns.
CodePudding user response:
Try this:
df.assign(favorite_grades=df['favorite_grades'].str.split(' '))\
.explode('favorite_grades')\
.groupby(['ID1', 'ID2', 'favorite_grades'])['name'].agg(', '.join)\
.unstack(fill_value='')\
.reset_index()
Output:
favorite_grades ID1 ID2 3rd 4th 5th
0 1 1 John John, Kate Kate
1 1 2 Emily
2 1 3 Mark, Emma
With double quotes...
df.assign(favorite_grades=df['favorite_grades'].str.split(' '))\
.explode('favorite_grades')\
.groupby(['ID1', 'ID2', 'favorite_grades'])['name']\
.agg(lambda x: f'''"{', '.join(x)}"''')\
.unstack(fill_value='')\
.reset_index())
favorite_grades ID1 ID2 3rd 4th 5th
0 1 1 "John" "John, Kate" "Kate"
1 1 2 "Emily"
2 1 3 "Mark, Emma"