Home > Mobile >  Remove duplicates in a DF when specified column is different and contains a specific value compared
Remove duplicates in a DF when specified column is different and contains a specific value compared

Time:10-01

I have the dataframe outlined below and need to remove the second row. Rows removed should only have a Reason of 'RTE' and be directly following a Reason of 100. Therefore row 4 should remain even though row 5 has the same Emp and Salary.

Is there a way to do this with Pandas? I've tried drop_duplicates() and duplicates() but unsure of a way to define a condition where Reason must be RTE and follow a Reason of 100. Any ideas would be much appreciated. Thank you!

DF:

Emp Reason Date Salary
2 100 01/23/2017 10000
2 RTE 02/26/2018 10000
2 RTE 05/07/2018 101000
2 RTE 05/06/2019 102000
2 800 01/31/2020 102000

Desired Result:

Emp Reason Date Salary
2 100 01/23/2017 10000
2 RTE 05/07/2018 101000
2 RTE 05/06/2019 102000
2 800 01/31/2020 102000

df.drop_duplicates(subset=["Emp", "Reason", "Salary"], keep='first') - this does not remove row 2 df.drop_duplicates(subset=["Emp", "Salary"], keep='first') - this removes row 5 which I want to keep.

CodePudding user response:

You could create a mask for records that have RTE and prior row 100, and then use ~ to select only the rows that don't match.

Edit: Changed the input df to have examples of matches w/ and w/o the same salary.

import pandas as pd
df = pd.DataFrame({'Emp': [2, 2, 2, 2, 2],
 'Reason': ['100', 'RTE', '100', 'RTE', '800'],
 'Date': ['01/23/2017',
  '02/26/2018',
  '05/07/2018',
  '05/06/2019',
  '01/31/2020'],
 'Salary': [10000, 10000, 101000, 102000, 102000]})


m = df['Reason'].eq('RTE') & df['Reason'].shift(1).eq('100') & df['Salary'].eq(df['Salary'].shift(1))
df = df[~m]


print(df)

Output

   Emp Reason        Date  Salary
0    2    100  01/23/2017   10000
2    2    100  05/07/2018  101000
3    2    RTE  05/06/2019  102000
4    2    800  01/31/2020  102000
  • Related