Need to update duplicate with existing rows-CodePudding

I need to update(clean) my data by following conditions

Check if PS-id has duplicate values.
Check which has the latest Date

Update the old row which has the latest Date and new Cam with the old Cam

ID       Cam                PS-id             Date
A1       Alto-Car           A1222             1.1.2022     
B1       BMW-Car            A123              1.1.2022
A5       Car-Alto           A1222             5.1.2022

The ideal output should be like this

ID       Cam                PS-id             Date
A1       Car-Alto           A1222             5.1.2022     
B1       BMW-Car            A123              1.1.2022

I have tried few solutions but cant do based on my conditions

CodePudding user response：

Convert values to datetimes in column Date, then sorting per both columns, so first row is old rows created by DataFrame.drop_duplicates and converting to Series used for reassign ID by rows with maximal vaues per PS-id:

df['Date'] = pd.to_datetime(df['Date'], format='%m.%d.%Y')
df = df.sort_values(['PS-id','Date'])

s = df.drop_duplicates('PS-id').set_index('PS-id')['ID']
df = df.drop_duplicates('PS-id', keep='last').assign(ID = lambda x: x['PS-id'].map(s))
print (df)
   ID       Cam  PS-id       Date
2  A1  Car-Alto  A1222 2022-05-01
1  B1   BMW-Car   A123 2022-01-01

CodePudding user response：

IIUC, you want a combination of groupby idxmax and groupby first, that you merge:

# get indices of latest (max) dates
# here we keep the string format
# if you want to convert to datetime
# df['Date'] = pd.to_datetime(df['Date'])
idx = pd.to_datetime(df['Date']).groupby(df['PS-id']).idxmax().values

# slice the rows with max dates
(df.loc[idx]
   # and merge back first id
   .merge(df.groupby('PS-id')['ID'].first(),
          left_on='PS-id', right_index=True, suffixes=('_', ''))
   .drop('ID_', axis=1)[df.columns]
)

output:

   ID       Cam  PS-id      Date
2  A1  Car-Alto  A1222  5.1.2022
1  B1   BMW-Car   A123  1.1.2022