Home > Blockchain >  Replace duplicate value with the value from another column on each row where a duplicate is located
Replace duplicate value with the value from another column on each row where a duplicate is located

Time:03-02

Say I have the following dataframe (Duplicate ID 1 and 3):

ID    Name    ALT_ID
1     Jack    111
1     James   222
2     Joe     333
3     Jim     444
3     Jen     555

How do I replace duplicate ID with ALT_ID for each occurrence? I would want the final dataframe to look like this:

ID    Name    ALT_ID
1     Jack    111
222   James   222
2     Joe     333
3     Jim     444
555   Jen     555

This will be a massive dataframe but long running time is not really an issue. Please let me know if there is any more information I can provide, thanks!

I've been using 'pandas' so far so any functions that would help me from that library would be a big bonus!

CodePudding user response:

Just use pandas.DataFrame.duplicated, a method of your dataframe to locate which values are dupes in your "ID" column. Then use the same rows but take the value in "ALT_ID":

>>> df.loc[df["ID"].duplicated(), "ID"] = df.loc[df["ID"].duplicated(), "ALT_ID"]
>>> df
    ID   Name  ALT_ID
0    1   Jack     111
1  222  James     222
2    2    Joe     333
3    3    Jim     444
4  555    Jen     555

CodePudding user response:

Without overriding anything in df:

df2 = df.assign(ID=df['ID'].where(~df.duplicated(['ID']), df['ALT_ID']))
>>> df2
    ID   Name  ALT_ID
0    1   Jack     111
1  222  James     222
2    2    Joe     333
3    3    Jim     444
4  555    Jen     555

Or, in place:

mask = df.duplicated('ID')
df.loc[mask, 'ID'] = df.loc[mask, 'ALT_ID']
  • Related