Home > Software engineering >  Cleaning Dataframe - Manage Duplicate Dominance
Cleaning Dataframe - Manage Duplicate Dominance

Time:10-14

I have a dataframe as following:

Year-Week   Ticket  Policy       User
2022-35     12502   COMPLIED      MX 
2022-35     12502   NOT COMPLIED  US
2022-35     12502   COMPLIED      AL
2022-35     12502   COMPLIED      BR
2022-35     12471   COMPLIED      DE
2022-35     12409   COMPLIED      US
2022-35     12409   NOT COMPLIED  BR
2022-35     12406   COMPLIED      MX 

There are cases that are duplicated and that is okay since they are from different Users, however, how could I standardize that the same ticket has the same policy? E.g: If a ticket in a week has "NOT COMPLIED", the policy should be the same for that ticket no matter the week, so, should be something like this:

Output Sample

Year-Week   Ticket  Policy       User
2022-35     12502   NOT COMPLIED  MX 
2022-35     12502   NOT COMPLIED  US
2022-35     12502   NOT COMPLIED  AL
2022-35     12502   NOT COMPLIED  BR
2022-35     12471   COMPLIED      DE
2022-35     12409   NOT COMPLIED  US
2022-35     12409   NOT COMPLIED  BR
2022-35     12406   COMPLIED      MX 

Basically what I want is the dominance of NOT COMPLIED over COMPLIED when there are duplicated Tickets.

CodePudding user response:

Leverage the fact that "N" comes after "C" on the ASCII table. So max() would take "NOT COMPLIED" over "COMPLIED". As you want to standardize it over Tickets, call max() as groupby().transform().

df['Policy'] = df.groupby('Ticket')['Policy'].transform('max')
  Year-Week  Ticket        Policy User
0   2022-35   12502  NOT COMPLIED   MX
1   2022-35   12502  NOT COMPLIED   US
2   2022-35   12502  NOT COMPLIED   AL
3   2022-35   12502  NOT COMPLIED   BR
4   2022-35   12471      COMPLIED   DE
5   2022-35   12409  NOT COMPLIED   US
6   2022-35   12409  NOT COMPLIED   BR
7   2022-35   12406      COMPLIED   MX

CodePudding user response:

Try this piece of code;

df1 = df[['Year-Week', 'Ticket']].drop_duplicates()

policy = []
for time,ticket in zip(df1['Year-Week'],df1['Ticket']):
    df_temp = df[(df["Year-Week"] == time) & (df["Ticket"] == ticket)]
    if "NOT COMPLIED" in df_temp["Policy"].to_list():
        policy.extend(["NOT COMPLIED"] * df_temp.shape[0])
    else:
        policy.extend(["COMPLIED"] * df_temp.shape[0])
        
df["Policy"] = policy

Hope this helps...

CodePudding user response:

You can do a groupby followed by a transfrom and set it to the Policy column;

df['Policy'] = df.groupby(['Year-Week','Ticket']).transform(lambda x: 'NOT COMPLIED'if 'NOT COMPLIED' in x else 'COMPLIED')
  • Related