I'm trying to use groupby and agg() function for this data processing step:
Input:
ID Fruit
1 Apples
1 Oranges
2 Strawberries
3 Blueberries
4 Blueberries
4 Apples
I plan to aggregate the data by ID. The requirement is if apples and oranges show up for the same ID, keep 'Apples'; for other combinations, keep the first observation for each ID. So wanted output:
ID Fruit
1 Apples
2 Strawberries
3 Blueberries
4 Blueberries
I could pivot the table and use np.where; however, in this particular situation, I'd prefer to do it in a long format. Please help me with this data processing step. Thank you.
CodePudding user response:
Use:
#sample data
print (df)
ID Fruit
0 1 Apples
1 1 Oranges
2 2 Strawberries
3 3 Blueberries
4 4 Blueberries
5 4 Apples
6 5 Blueberries
7 5 Oranges
8 5 Apples
You can aggregate set
and GroupBy.first
:
L = ['Apples','Oranges']
df1 = df.groupby('ID', sort=False)['Fruit'].agg([set, 'first']).reset_index()
print (df1)
ID set first
0 1 {Oranges, Apples} Apples
1 2 {Strawberries} Strawberries
2 3 {Blueberries} Blueberries
3 4 {Blueberries, Apples} Blueberries
4 5 {Oranges, Blueberries, Apples} Blueberries
And then test if all categories from list L
exist in column set
by Series.ge
also possible another categories like ID=5
- then is assigned Apple
else values from column first
:
df1['Fruit'] = np.where(df1.pop('set').ge(set(L)), 'Apple', df1.pop('first'))
print (df1)
ID Fruit
0 1 Apple
1 2 Strawberries
2 3 Blueberries
3 4 Blueberries
4 5 Apple
Or if need test if exist only unique values from list converted to sets use Series.eq
:
L = ['Apples','Oranges']
df1 = df.groupby('ID', sort=False)['Fruit'].agg([set, 'first']).reset_index()
print (df1)
ID set first
0 1 {Oranges, Apples} Apples
1 2 {Strawberries} Strawberries
2 3 {Blueberries} Blueberries
3 4 {Blueberries, Apples} Blueberries
4 5 {Oranges, Blueberries, Apples} Blueberries
df1['Fruit'] = np.where(df1.pop('set').eq(set(L)), 'Apple', df1.pop('first'))
print (df1)
ID Fruit
0 1 Apple
1 2 Strawberries
2 3 Blueberries
3 4 Blueberries
4 5 Blueberries
CodePudding user response:
Not 100% I'm catching all possible edge cases:
df = df.groupby("ID")["Fruit"].apply(list).reset_index()
df["Fruit"] = df["Fruit"].apply(
lambda x: sorted(x)[0] if "Apples" and "Oranges" in x else x[0]
)
print(df)
ID Fruit
0 1 Apples
1 2 Strawberries
2 3 Blueberries
3 4 Blueberries