I have a large dataframe mostly hast unique values, still there are multiple same IDs with different values stored. I want to group the same IDs then apply a logic to those to select one among them then remove the others.
df = pd.DataFrame({'ID': [11, 11,11,11,22,22,33] ,
'Source': [2, 2,4,3,3,2,3],
'Price':[10, 20,30,40,50,60,70]})
the logic is :if in group there is a row with SOURCE==4 keep and remove the others else in group there is a row with SOURCE==2 keep and remove the others else in group there is a row with SOURCE==3 keep and remove the others so hierarchy is based on Source column and it is 4>2>3.
Expected output:
expected = pd.DataFrame({'ID': [11,22,33] ,
'Source': [4,2,3],
'Price':[30,60,70]})
A possible solution is creating a new column of hierarchy if source ==4 then hierarchy ==1... and then sort it and and select nth(1) . However I wonder most how can I do conditional select after groupby.
CodePudding user response:
I would do groupby
and apply
with map
and idxmax
:
>>> df.groupby('ID', as_index=False).apply(lambda x: x.loc[x['Source'].map({3: 1, 2: 2, 4: 3}).idxmax()])
ID Source Price
0 11 4 30
1 22 2 60
2 33 3 70
>>>
CodePudding user response:
I feel like you are hunting for even and odd numbers, hence the 4, 2, 3 order. The code below should suffice, and avoid the anonymous function, while offering some speed up (depending on the data size); it is quite verbose in my opinion though:
(df.assign(even_odd = np.where(df.Source % 2 == 0, 'even', 'odd'))
.groupby(['ID', 'even_odd'], as_index = False)
.max()
.drop_duplicates('ID', keep='first')
.filter([*df.columns])
)
ID Source Price
0 11 4 30
2 22 2 60
4 33 3 70
Of course, this would fail if you had 5, 9, 6, 12, ..., in which case, another logic is required. This only works if the numbers are restricted to 4, 2, 3
CodePudding user response:
d= {4:1,2:2, 3:3} # dict of drop hierarchy
new=(df.assign(rank=df.Source.map(d))#Create a rank column that maps the hierachy of selection
.sort_values(by='rank')#Sort new dataframe by rank
.drop_duplicates(subset='ID',keep='first')#Drop all the duplicated Source values
.drop('rank',1)#Drop the temp sorting column
)
print(new)
ID Source Price
2 11 4 30
5 22 2 60
6 33 3 70