Home > Net >  Python How to do conditional selection after groupby
Python How to do conditional selection after groupby

Time:10-03

I have a large dataframe mostly hast unique values, still there are multiple same IDs with different values stored. I want to group the same IDs then apply a logic to those to select one among them then remove the others.

df = pd.DataFrame({'ID': [11, 11,11,11,22,22,33] ,
                   'Source': [2, 2,4,3,3,2,3],
                   'Price':[10, 20,30,40,50,60,70]})

the logic is :if in group there is a row with SOURCE==4 keep and remove the others else in group there is a row with SOURCE==2 keep and remove the others else in group there is a row with SOURCE==3 keep and remove the others so hierarchy is based on Source column and it is 4>2>3.

Expected output:

expected = pd.DataFrame({'ID': [11,22,33] ,
                   'Source': [4,2,3],
                   'Price':[30,60,70]})

A possible solution is creating a new column of hierarchy if source ==4 then hierarchy ==1... and then sort it and and select nth(1) . However I wonder most how can I do conditional select after groupby.

CodePudding user response:

I would do groupby and apply with map and idxmax:

>>> df.groupby('ID', as_index=False).apply(lambda x: x.loc[x['Source'].map({3: 1, 2: 2, 4: 3}).idxmax()])
   ID  Source  Price
0  11       4     30
1  22       2     60
2  33       3     70
>>> 

CodePudding user response:

I feel like you are hunting for even and odd numbers, hence the 4, 2, 3 order. The code below should suffice, and avoid the anonymous function, while offering some speed up (depending on the data size); it is quite verbose in my opinion though:

(df.assign(even_odd = np.where(df.Source % 2 == 0, 'even', 'odd'))
   .groupby(['ID', 'even_odd'], as_index = False)
   .max()
   .drop_duplicates('ID', keep='first')
   .filter([*df.columns])
 )
   ID  Source  Price
0  11       4     30
2  22       2     60
4  33       3     70

Of course, this would fail if you had 5, 9, 6, 12, ..., in which case, another logic is required. This only works if the numbers are restricted to 4, 2, 3

CodePudding user response:

d= {4:1,2:2, 3:3} # dict of drop hierarchy
new=(df.assign(rank=df.Source.map(d))#Create a rank column that maps the hierachy of selection
     .sort_values(by='rank')#Sort new dataframe by rank
     .drop_duplicates(subset='ID',keep='first')#Drop all the duplicated Source values
     .drop('rank',1)#Drop the temp sorting column
    )

print(new)

    ID  Source  Price
2  11       4     30
5  22       2     60
6  33       3     70
  • Related