Dropping rows using multiple criteria-CodePudding

I have the following data frame:

EID	CLEAN_NAME	Start_Date	End_Date	Rank_no
E000000	DEF	3/1/1973	2/28/1978	154
E000001	GHI	6/1/1983	3/31/1988	1296
E000001	ABC	1/1/2017		80292
E000002	JKL	10/1/1980	8/31/1981	751.5
E000003	MNO	5/1/1973	11/30/1977	157
E000003	ABC	5/1/1977	11/30/1987	200
E000003	PQR	5/1/1987	11/30/1997	300
E000003	ABC	5/1/1997		1000

What I am trying to do here is I am trying to delete company ABC where rank is highest for ABC company in Rank_no column for each EID. If we find ABC record but it does not have highest rank for an EID it should not be deleted. Rest of the data should remain as it is. The expected output is as follows:

EID	CLEAN_NAME	Start_Date	End_Date	Rank_no
E000000	DEF	3/1/1973	2/28/1978	154
E000001	GHI	6/1/1983	3/31/1988	1296
E000002	JKL	10/1/1980	8/31/1981	751.5
E000003	MNO	5/1/1973	11/30/1977	157
E000003	ABC	5/1/1977	11/30/1987	200
E000003	PQR	5/1/1987	11/30/1997	300

I tried to use the following code:

result_new = result.drop(result[(result['Rank_no'] == result.Rank_no.max()) & (result['CLEAN_NAME'] == 'ABC')].index)

But it's not working. Pretty sure I am giving the conditions incorrect but not sure what exactly I am missing or writing incorrectly. I have named my data frame as result.

Any leads would be appreciated. Thanks.!

CodePudding user response：

Use groupby and idxmax to find the max index for each respective EID and CLEAN_NAME combo after filtering down to only the rows that have ABC.

df.drop(df.loc[df.CLEAN_NAME == "ABC"].groupby("EID").Rank_no.idxmax())

       EID CLEAN_NAME Start_Date    End_Date  Rank_no
0  E000000        DEF   3/1/1973   2/28/1978    154.0
1  E000001        GHI   6/1/1983   3/31/1988   1296.0
3  E000002        JKL  10/1/1980   8/31/1981    751.5
4  E000003        MNO   5/1/1973  11/30/1977    157.0
5  E000003        ABC   5/1/1977  11/30/1987    200.0
6  E000003        PQR   5/1/1987  11/30/1997    300.0

CodePudding user response：

import pandas as pd

datas = [
    ['E000000', 'DEF', '3/1/1973', '2/28/1978', 154],
    ['E000001', 'GHI', '6/1/1983', '3/31/1988', 1296],
    ['E000001', 'ABC', '1/1/2017', '', 80292],
    ['E000002', 'JKL', '10/1/1980', '8/31/1981', 751.5],
    ['E000003', 'MNO', '5/1/1973', '11/30/1977', 157],
    ['E000003', 'ABC', '5/1/1977', '11/30/1987', 200],
    ['E000003', 'PQR', '5/1/1987', '11/30/1997', 300],
    ['E000003', 'ABC', '5/1/1997', '', 1000],
]

result = pd.DataFrame(datas, columns=['EID', 'CLEAN_NAME', 'Start_Date', 'End_Date', 'Rank_no'])

new_result = result.sort_values(by='Rank_no')  # sort by lowest Rank_no
new_result = new_result.drop_duplicates(subset=['CLEAN_NAME'], keep='first')  # drop duplicates keeping the first
new_result = new_result.sort_values(by='EID')  # sort by EID

print(new_result)

Output :

   EID CLEAN_NAME Start_Date    End_Date  Rank_no
0  E000000        DEF   3/1/1973   2/28/1978    154.0
1  E000001        GHI   6/1/1983   3/31/1988   1296.0
3  E000002        JKL  10/1/1980   8/31/1981    751.5
4  E000003        MNO   5/1/1973  11/30/1977    157.0
5  E000003        ABC   5/1/1977  11/30/1987    200.0
6  E000003        PQR   5/1/1987  11/30/1997    300.0