I have the following data frame:
EID | CLEAN_NAME | Start_Date | End_Date | Rank_no |
---|---|---|---|---|
E000000 | DEF | 3/1/1973 | 2/28/1978 | 154 |
E000001 | GHI | 6/1/1983 | 3/31/1988 | 1296 |
E000001 | ABC | 1/1/2017 | 80292 | |
E000002 | JKL | 10/1/1980 | 8/31/1981 | 751.5 |
E000003 | MNO | 5/1/1973 | 11/30/1977 | 157 |
E000003 | ABC | 5/1/1977 | 11/30/1987 | 200 |
E000003 | PQR | 5/1/1987 | 11/30/1997 | 300 |
E000003 | ABC | 5/1/1997 | 1000 |
What I am trying to do here is I am trying to delete company ABC where rank is highest for ABC company in Rank_no column for each EID. If we find ABC record but it does not have highest rank for an EID it should not be deleted. Rest of the data should remain as it is. The expected output is as follows:
EID | CLEAN_NAME | Start_Date | End_Date | Rank_no |
---|---|---|---|---|
E000000 | DEF | 3/1/1973 | 2/28/1978 | 154 |
E000001 | GHI | 6/1/1983 | 3/31/1988 | 1296 |
E000002 | JKL | 10/1/1980 | 8/31/1981 | 751.5 |
E000003 | MNO | 5/1/1973 | 11/30/1977 | 157 |
E000003 | ABC | 5/1/1977 | 11/30/1987 | 200 |
E000003 | PQR | 5/1/1987 | 11/30/1997 | 300 |
I tried to use the following code:
result_new = result.drop(result[(result['Rank_no'] == result.Rank_no.max()) & (result['CLEAN_NAME'] == 'ABC')].index)
But it's not working. Pretty sure I am giving the conditions incorrect but not sure what exactly I am missing or writing incorrectly. I have named my data frame as result.
Any leads would be appreciated. Thanks.!
CodePudding user response:
Use groupby
and idxmax
to find the max index for each respective EID
and CLEAN_NAME
combo after filtering down to only the rows that have ABC
.
df.drop(df.loc[df.CLEAN_NAME == "ABC"].groupby("EID").Rank_no.idxmax())
EID CLEAN_NAME Start_Date End_Date Rank_no
0 E000000 DEF 3/1/1973 2/28/1978 154.0
1 E000001 GHI 6/1/1983 3/31/1988 1296.0
3 E000002 JKL 10/1/1980 8/31/1981 751.5
4 E000003 MNO 5/1/1973 11/30/1977 157.0
5 E000003 ABC 5/1/1977 11/30/1987 200.0
6 E000003 PQR 5/1/1987 11/30/1997 300.0
CodePudding user response:
import pandas as pd
datas = [
['E000000', 'DEF', '3/1/1973', '2/28/1978', 154],
['E000001', 'GHI', '6/1/1983', '3/31/1988', 1296],
['E000001', 'ABC', '1/1/2017', '', 80292],
['E000002', 'JKL', '10/1/1980', '8/31/1981', 751.5],
['E000003', 'MNO', '5/1/1973', '11/30/1977', 157],
['E000003', 'ABC', '5/1/1977', '11/30/1987', 200],
['E000003', 'PQR', '5/1/1987', '11/30/1997', 300],
['E000003', 'ABC', '5/1/1997', '', 1000],
]
result = pd.DataFrame(datas, columns=['EID', 'CLEAN_NAME', 'Start_Date', 'End_Date', 'Rank_no'])
new_result = result.sort_values(by='Rank_no') # sort by lowest Rank_no
new_result = new_result.drop_duplicates(subset=['CLEAN_NAME'], keep='first') # drop duplicates keeping the first
new_result = new_result.sort_values(by='EID') # sort by EID
print(new_result)
Output :
EID CLEAN_NAME Start_Date End_Date Rank_no
0 E000000 DEF 3/1/1973 2/28/1978 154.0
1 E000001 GHI 6/1/1983 3/31/1988 1296.0
3 E000002 JKL 10/1/1980 8/31/1981 751.5
4 E000003 MNO 5/1/1973 11/30/1977 157.0
5 E000003 ABC 5/1/1977 11/30/1987 200.0
6 E000003 PQR 5/1/1987 11/30/1997 300.0