I have the following dataframe:
Year-Week Case Team Case Info
2022-42 540 Finance OTHER
2022-42 540 IT ACCEPTED
2022-42 480 Sales OTHER
2022-42 480 Finance CHECK CASE
2022-42 480 IT OTHER
2022-42 465 IT OTHER
2022-42 465 Sales OTHER
2022-42 310 Legal VALIDATED
2022-42 310 Marketing OTHER
As you can see, there are duplicate values in the "Case" section, this is because each case can belong to a different team and each team can categorize the case differently in the "Case Info" section.
My problem is that I would like my python code to take the duplicate "Case" cases and when they have a different value in "Case Info", any other value other than "OTHER" prevails, I attach an example:
Year-Week Case Team Case Info
2022-42 540 Finance ACCEPTED
2022-42 540 IT ACCEPTED
2022-42 480 Sales CHECK CASE
2022-42 480 Finance CHECK CASE
2022-42 480 IT CHECK CASE
2022-42 465 IT OTHER
2022-42 465 Sales OTHER
2022-42 310 Legal VALIDATED
2022-42 310 Marketing VALIDATED
I've been trying using commands like:
df['Case Info'] = df.groupby('Case')['Case Info'].transform('max')
But that only leads me to replace one value over the other, leveraging the fact that the initial letter in the ASCII table is applied. Having said that, that only makes half of the job, since is only doing it for just one value vs 'OTHER' that comes at last.
CodePudding user response:
Not exactly, the most efficient way but this method chaining should work buddy
df.groupby('Case')['Case Info'].apply(lambda x : x.replace('OTHER',np.nan).bfill().ffill()) # This should return your wanted series