Say I have the following dataframe (Duplicate ID
1 and 3):
ID Name ALT_ID
1 Jack 111
1 James 222
2 Joe 333
3 Jim 444
3 Jen 555
How do I replace duplicate ID
with ALT_ID
for each occurrence? I would want the final dataframe to look like this:
ID Name ALT_ID
1 Jack 111
222 James 222
2 Joe 333
3 Jim 444
555 Jen 555
This will be a massive dataframe but long running time is not really an issue. Please let me know if there is any more information I can provide, thanks!
I've been using 'pandas' so far so any functions that would help me from that library would be a big bonus!
CodePudding user response:
Just use pandas.DataFrame.duplicated
, a method of your dataframe to locate which values are dupes in your "ID" column. Then use the same rows but take the value in "ALT_ID":
>>> df.loc[df["ID"].duplicated(), "ID"] = df.loc[df["ID"].duplicated(), "ALT_ID"]
>>> df
ID Name ALT_ID
0 1 Jack 111
1 222 James 222
2 2 Joe 333
3 3 Jim 444
4 555 Jen 555
CodePudding user response:
Without overriding anything in df
:
df2 = df.assign(ID=df['ID'].where(~df.duplicated(['ID']), df['ALT_ID']))
>>> df2
ID Name ALT_ID
0 1 Jack 111
1 222 James 222
2 2 Joe 333
3 3 Jim 444
4 555 Jen 555
Or, in place:
mask = df.duplicated('ID')
df.loc[mask, 'ID'] = df.loc[mask, 'ALT_ID']