I have the dataframe outlined below and need to remove the second row. Rows removed should only have a Reason of 'RTE' and be directly following a Reason of 100. Therefore row 4 should remain even though row 5 has the same Emp and Salary.
Is there a way to do this with Pandas? I've tried drop_duplicates() and duplicates() but unsure of a way to define a condition where Reason must be RTE and follow a Reason of 100. Any ideas would be much appreciated. Thank you!
DF:
Emp | Reason | Date | Salary |
---|---|---|---|
2 | 100 | 01/23/2017 | 10000 |
2 | RTE | 02/26/2018 | 10000 |
2 | RTE | 05/07/2018 | 101000 |
2 | RTE | 05/06/2019 | 102000 |
2 | 800 | 01/31/2020 | 102000 |
Desired Result:
Emp | Reason | Date | Salary |
---|---|---|---|
2 | 100 | 01/23/2017 | 10000 |
2 | RTE | 05/07/2018 | 101000 |
2 | RTE | 05/06/2019 | 102000 |
2 | 800 | 01/31/2020 | 102000 |
df.drop_duplicates(subset=["Emp", "Reason", "Salary"], keep='first')
- this does not remove row 2
df.drop_duplicates(subset=["Emp", "Salary"], keep='first')
- this removes row 5 which I want to keep.
CodePudding user response:
You could create a mask for records that have RTE and prior row 100, and then use ~
to select only the rows that don't match.
Edit: Changed the input df to have examples of matches w/ and w/o the same salary.
import pandas as pd
df = pd.DataFrame({'Emp': [2, 2, 2, 2, 2],
'Reason': ['100', 'RTE', '100', 'RTE', '800'],
'Date': ['01/23/2017',
'02/26/2018',
'05/07/2018',
'05/06/2019',
'01/31/2020'],
'Salary': [10000, 10000, 101000, 102000, 102000]})
m = df['Reason'].eq('RTE') & df['Reason'].shift(1).eq('100') & df['Salary'].eq(df['Salary'].shift(1))
df = df[~m]
print(df)
Output
Emp Reason Date Salary
0 2 100 01/23/2017 10000
2 2 100 05/07/2018 101000
3 2 RTE 05/06/2019 102000
4 2 800 01/31/2020 102000