Lets say i have this data frame. here i want to match a date value and will compare if at this current index and at next index employee id is same or not if they are same then i want to delete the row at next index. i tried with below code but getting key error:1
EMPL_ID | AGE | EMPLOYER | END_DATE |
---|---|---|---|
12 | 23 | BHU | 2022-04-22 00:00:00 |
12 | 21 | BHU | 2022-04-22 00:00:00 |
34 | 22 | DU | 2022-04-22 00:00:00 |
36 | 21 | BHU | 2022-04-22 00:00:00 |
for index, row in df.iterrows():
value = row['END_DATE']
if (value == '2022-04-22 00:00:00'):
a = index
if (df.loc[a, 'EMPL_ID'] == df.loc[a 1, 'EMPL_ID']):
df.drop(a 1, inplace = True)
else:
df = df
else:
df = df
CodePudding user response:
Keeping only the unique values based on two columns can be performed using the function drop_duplicates, while supplying the columns that make up the subset:
df = pd.DataFrame(data={'id': [1, 2, 3, 3], 'time': [1, 2, 3, 3]})
>>> df
id time
0 1 1
1 2 2
2 3 3
3 3 3
df.drop_duplicates(subset=['id', 'time'], inplace=True)
>>> df
id time
0 1 1
1 2 2
2 3 3
CodePudding user response:
IIUC here is one approach. Since you only search for consecutive duplicates I added another row to your example. The 2nd row of "EMPL_ID" 34
won't be deleted.
df = pd.DataFrame({'EMPL_ID': [12, 12, 34, 36, 34],
'AGE': [23, 21, 22, 21, 22],
'EMPLOYER': ['BHU', 'BHU', 'DU', 'BHU', 'DU'],
'END_DATE': ['2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00'],
})
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
1 12 21 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00
grp = ['END_DATE', 'EMPL_ID']
df['group'] = (df[grp] != df[grp].shift(1)).any(axis=1).cumsum()
df = df.drop_duplicates('group', keep= 'first').drop('group', axis=1)
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00