How to resolve following key error in the my code while manipulating dataframe?-CodePudding

Lets say i have this data frame. here i want to match a date value and will compare if at this current index and at next index employee id is same or not if they are same then i want to delete the row at next index. i tried with below code but getting key error:1

EMPL_ID	AGE	EMPLOYER	END_DATE
12	23	BHU	2022-04-22 00:00:00
12	21	BHU	2022-04-22 00:00:00
34	22	DU	2022-04-22 00:00:00
36	21	BHU	2022-04-22 00:00:00

for index, row in df.iterrows():
  value = row['END_DATE']
  if (value == '2022-04-22 00:00:00'):
     a = index
     if (df.loc[a, 'EMPL_ID'] == df.loc[a 1, 'EMPL_ID']):
         df.drop(a 1, inplace = True)
     else:
       df = df
  else:
    df = df

CodePudding user response：

Keeping only the unique values based on two columns can be performed using the function drop_duplicates, while supplying the columns that make up the subset:

df = pd.DataFrame(data={'id': [1, 2, 3, 3], 'time': [1, 2, 3, 3]})
>>> df

   id  time
0   1     1
1   2     2
2   3     3
3   3     3

df.drop_duplicates(subset=['id', 'time'], inplace=True)
>>> df

   id  time
0   1     1
1   2     2
2   3     3

CodePudding user response：

IIUC here is one approach. Since you only search for consecutive duplicates I added another row to your example. The 2nd row of "EMPL_ID" 34 won't be deleted.

df = pd.DataFrame({'EMPL_ID': [12, 12, 34, 36, 34],
 'AGE': [23, 21, 22, 21, 22],
 'EMPLOYER': ['BHU', 'BHU', 'DU', 'BHU', 'DU'],
 'END_DATE': ['2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00'],
})

print(df)
   EMPL_ID  AGE EMPLOYER             END_DATE
0       12   23      BHU  2022-04-22 00:00:00
1       12   21      BHU  2022-04-22 00:00:00
2       34   22       DU  2022-04-22 00:00:00
3       36   21      BHU  2022-04-22 00:00:00
4       34   22       DU  2022-04-22 00:00:00

grp = ['END_DATE', 'EMPL_ID']
df['group'] = (df[grp] != df[grp].shift(1)).any(axis=1).cumsum()
df = df.drop_duplicates('group', keep= 'first').drop('group', axis=1)

print(df)
   EMPL_ID  AGE EMPLOYER             END_DATE
0       12   23      BHU  2022-04-22 00:00:00
2       34   22       DU  2022-04-22 00:00:00
3       36   21      BHU  2022-04-22 00:00:00
4       34   22       DU  2022-04-22 00:00:00