Home > Software engineering >  Dropping rows using multiple criteria
Dropping rows using multiple criteria

Time:02-18

I have the following data frame:

EID CLEAN_NAME Start_Date End_Date Rank_no
E000000 DEF 3/1/1973 2/28/1978 154
E000001 GHI 6/1/1983 3/31/1988 1296
E000001 ABC 1/1/2017 80292
E000002 JKL 10/1/1980 8/31/1981 751.5
E000003 MNO 5/1/1973 11/30/1977 157
E000003 ABC 5/1/1977 11/30/1987 200
E000003 PQR 5/1/1987 11/30/1997 300
E000003 ABC 5/1/1997 1000

What I am trying to do here is I am trying to delete company ABC where rank is highest for ABC company in Rank_no column for each EID. If we find ABC record but it does not have highest rank for an EID it should not be deleted. Rest of the data should remain as it is. The expected output is as follows:

EID CLEAN_NAME Start_Date End_Date Rank_no
E000000 DEF 3/1/1973 2/28/1978 154
E000001 GHI 6/1/1983 3/31/1988 1296
E000002 JKL 10/1/1980 8/31/1981 751.5
E000003 MNO 5/1/1973 11/30/1977 157
E000003 ABC 5/1/1977 11/30/1987 200
E000003 PQR 5/1/1987 11/30/1997 300

I tried to use the following code:

result_new = result.drop(result[(result['Rank_no'] == result.Rank_no.max()) & (result['CLEAN_NAME'] == 'ABC')].index)

But it's not working. Pretty sure I am giving the conditions incorrect but not sure what exactly I am missing or writing incorrectly. I have named my data frame as result.

Any leads would be appreciated. Thanks.!

CodePudding user response:

Use groupby and idxmax to find the max index for each respective EID and CLEAN_NAME combo after filtering down to only the rows that have ABC.

df.drop(df.loc[df.CLEAN_NAME == "ABC"].groupby("EID").Rank_no.idxmax())

       EID CLEAN_NAME Start_Date    End_Date  Rank_no
0  E000000        DEF   3/1/1973   2/28/1978    154.0
1  E000001        GHI   6/1/1983   3/31/1988   1296.0
3  E000002        JKL  10/1/1980   8/31/1981    751.5
4  E000003        MNO   5/1/1973  11/30/1977    157.0
5  E000003        ABC   5/1/1977  11/30/1987    200.0
6  E000003        PQR   5/1/1987  11/30/1997    300.0

CodePudding user response:

import pandas as pd

datas = [
    ['E000000', 'DEF', '3/1/1973', '2/28/1978', 154],
    ['E000001', 'GHI', '6/1/1983', '3/31/1988', 1296],
    ['E000001', 'ABC', '1/1/2017', '', 80292],
    ['E000002', 'JKL', '10/1/1980', '8/31/1981', 751.5],
    ['E000003', 'MNO', '5/1/1973', '11/30/1977', 157],
    ['E000003', 'ABC', '5/1/1977', '11/30/1987', 200],
    ['E000003', 'PQR', '5/1/1987', '11/30/1997', 300],
    ['E000003', 'ABC', '5/1/1997', '', 1000],
]

result = pd.DataFrame(datas, columns=['EID', 'CLEAN_NAME', 'Start_Date', 'End_Date', 'Rank_no'])

new_result = result.sort_values(by='Rank_no')  # sort by lowest Rank_no
new_result = new_result.drop_duplicates(subset=['CLEAN_NAME'], keep='first')  # drop duplicates keeping the first
new_result = new_result.sort_values(by='EID')  # sort by EID

print(new_result)

Output :

   EID CLEAN_NAME Start_Date    End_Date  Rank_no
0  E000000        DEF   3/1/1973   2/28/1978    154.0
1  E000001        GHI   6/1/1983   3/31/1988   1296.0
3  E000002        JKL  10/1/1980   8/31/1981    751.5
4  E000003        MNO   5/1/1973  11/30/1977    157.0
5  E000003        ABC   5/1/1977  11/30/1987    200.0
6  E000003        PQR   5/1/1987  11/30/1997    300.0
  • Related