Pandas: Remove rows in a group if a particular value is also in a group-CodePudding

I'm trying to use groupby and agg() function for this data processing step:

Input:

ID  Fruit 
1   Apples 
1   Oranges 
2   Strawberries 
3   Blueberries 
4   Blueberries 
4   Apples

I plan to aggregate the data by ID. The requirement is if apples and oranges show up for the same ID, keep 'Apples'; for other combinations, keep the first observation for each ID. So wanted output:

ID  Fruit 
1   Apples 
2   Strawberries 
3   Blueberries 
4   Blueberries

I could pivot the table and use np.where; however, in this particular situation, I'd prefer to do it in a long format. Please help me with this data processing step. Thank you.

CodePudding user response：

Use:

#sample data
print (df)
   ID         Fruit
0   1        Apples
1   1       Oranges
2   2  Strawberries
3   3   Blueberries
4   4   Blueberries
5   4        Apples
6   5   Blueberries
7   5       Oranges
8   5        Apples

You can aggregate set and GroupBy.first:

L = ['Apples','Oranges']

df1 = df.groupby('ID', sort=False)['Fruit'].agg([set, 'first']).reset_index()
print (df1)
   ID                             set         first
0   1               {Oranges, Apples}        Apples
1   2                  {Strawberries}  Strawberries
2   3                   {Blueberries}   Blueberries
3   4           {Blueberries, Apples}   Blueberries
4   5  {Oranges, Blueberries, Apples}   Blueberries

And then test if all categories from list L exist in column set by Series.ge also possible another categories like ID=5 - then is assigned Apple else values from column first:

df1['Fruit'] = np.where(df1.pop('set').ge(set(L)), 'Apple', df1.pop('first'))
print (df1)
   ID         Fruit
0   1         Apple
1   2  Strawberries
2   3   Blueberries
3   4   Blueberries
4   5         Apple

Or if need test if exist only unique values from list converted to sets use Series.eq:

L = ['Apples','Oranges']

df1 = df.groupby('ID', sort=False)['Fruit'].agg([set, 'first']).reset_index()
print (df1)
   ID                             set         first
0   1               {Oranges, Apples}        Apples
1   2                  {Strawberries}  Strawberries
2   3                   {Blueberries}   Blueberries
3   4           {Blueberries, Apples}   Blueberries
4   5  {Oranges, Blueberries, Apples}   Blueberries

df1['Fruit'] = np.where(df1.pop('set').eq(set(L)), 'Apple', df1.pop('first'))
print (df1)
   ID         Fruit
0   1         Apple
1   2  Strawberries
2   3   Blueberries
3   4   Blueberries
4   5   Blueberries

CodePudding user response：

Not 100% I'm catching all possible edge cases:

df = df.groupby("ID")["Fruit"].apply(list).reset_index()

df["Fruit"] = df["Fruit"].apply(
    lambda x: sorted(x)[0] if "Apples" and "Oranges" in x else x[0]
)

print(df)

   ID         Fruit
0   1        Apples
1   2  Strawberries
2   3   Blueberries
3   4   Blueberries