I need to update(clean) my data by following conditions
Check if PS-id has duplicate values.
Check which has the latest Date
Update the old row which has the latest Date and new Cam with the old Cam
ID Cam PS-id Date A1 Alto-Car A1222 1.1.2022 B1 BMW-Car A123 1.1.2022 A5 Car-Alto A1222 5.1.2022
The ideal output should be like this
ID Cam PS-id Date
A1 Car-Alto A1222 5.1.2022
B1 BMW-Car A123 1.1.2022
I have tried few solutions but cant do based on my conditions
CodePudding user response:
Convert values to datetimes in column Date
, then sorting per both columns, so first row is old rows created by DataFrame.drop_duplicates
and converting to Series
used for reassign ID
by rows with maximal vaues per PS-id
:
df['Date'] = pd.to_datetime(df['Date'], format='%m.%d.%Y')
df = df.sort_values(['PS-id','Date'])
s = df.drop_duplicates('PS-id').set_index('PS-id')['ID']
df = df.drop_duplicates('PS-id', keep='last').assign(ID = lambda x: x['PS-id'].map(s))
print (df)
ID Cam PS-id Date
2 A1 Car-Alto A1222 2022-05-01
1 B1 BMW-Car A123 2022-01-01
CodePudding user response:
IIUC, you want a combination of groupby
idxmax
and groupby
first
, that you merge
:
# get indices of latest (max) dates
# here we keep the string format
# if you want to convert to datetime
# df['Date'] = pd.to_datetime(df['Date'])
idx = pd.to_datetime(df['Date']).groupby(df['PS-id']).idxmax().values
# slice the rows with max dates
(df.loc[idx]
# and merge back first id
.merge(df.groupby('PS-id')['ID'].first(),
left_on='PS-id', right_index=True, suffixes=('_', ''))
.drop('ID_', axis=1)[df.columns]
)
output:
ID Cam PS-id Date
2 A1 Car-Alto A1222 5.1.2022
1 B1 BMW-Car A123 1.1.2022