I have a dataframe like this with IDs and Preferences in a string separated by ', ':
ID | Preferences |
---|---|
1 | banana, apple |
1 | banana, apple, kiwi |
1 | avocado, apple |
2 | avocado, grapes |
2 | banana, apple, kiwi |
And I want to group by the ID and get the 2 preferences that appear the most, so the result would be like:
ID | first_preference | second_preference |
---|---|---|
1 | apple | banana |
2 | avocado, grapes, banana, apple, kiwi |
with 'draws' being concatenated together.
I need to do this on an aggregated groupby because I have other columns I need to aggregate too.
Could anyone help me? Thanks!
CodePudding user response:
To obtain the preferences, first split and expand your DataFrame into a much longer Series. Then, count and rank the number of occurrences, and another groupby
agg
will allow you to join the ties for the first and second prefernences.
The index of this result will be the unique 'ID'
values in your original DataFrame, so you can concat
this with the results of your other groupby
agg
operations
Sample Data
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,2],
'Preferences': ['banana, apple', 'banana, apple, kiwi', 'avocado, apple', 'avocado, grapes',
'banana, apple, kiwi']})
Code
# Expand to long Series
s = df.set_index(['ID']).Preferences.str.split(', ', expand=True).stack()
# Within each ID, rank preferences based on # of occurrences
s = (s.groupby([s.index.get_level_values(0), s.rename('preference')]).size()
.groupby(level=0).rank(method='dense', ascending=False)
.map({1: 'first', 2: 'second'}).rename('order'))
res = s[s.isin(['first', 'second'])].reset_index().groupby(['ID', 'order']).agg(', '.join).unstack(-1)
# Collapse MultiIndex to get simple column labels
res.columns = [f'{y}_{x}' for x,y in res.columns]
print(res)
first_preference second_preference
ID
1 apple banana
2 apple, avocado, banana, grapes, kiwi NaN