How can I update duplicate rows with first record's RecID? (Python/Pandas)-CodePudding

I have a fairly large set of records (~200,000) that I need to dedupe, but there is a catch. Most of these records are related to another record type, and this relationship needs to be moved to the unique record left after the dedupe process is complete.

To keep things simple, I have 2 columns, 'RecID' and 'Name' and will be adding a 3rd to the results called 'ReplaceWith'. The data looks something like the following:

RecID	Name
111111	example1
222222	example1
333333	example2
444444	example2
555555	example2

When the 'Name' column is duplicated I want to populate the 'ReplaceWith' column with the 'RecID' value of the first item What I am looking to accomplish will look something like the following:

RecID	Name	ReplaceWith
111111	example1
222222	example1	111111
333333	example2
444444	example2	333333
555555	example2	333333

I am pretty novice at programming so this may be really simple. But I simply cannot figure out how to make this happen programmatically. On smaller data sets, I would simply use excel to manually sort, filter, and update the records. But that method just won't scale to this level.

I would appreciate any recommendations on how to do this.

CodePudding user response：

IF your RecId datatype is an int or float, you could do something like this:

import pandas as pd

t1 = [1111,2222,3333,4444,5555]
t1_name = ['example1', 'example1','example2','example2','example2']

df = pd.DataFrame({'rec_id': t1, 'name': t1_name})

df.sort_values(by=['rec_id', 'name'], inplace=True)

# Get values for `name` that occur more than once
names = df.groupby('name').count().reset_index()
names.rename(columns={'rec_id':'rec_count'}, inplace=True)
arr = names.loc[names.rec_count>1, 'name'].to_list()

for n in arr:
    min_id = df.loc[df.name==n, 'rec_id'].min()
    df.loc[(df.name==n) & (df.rec_id!=min_id), 'replace_with'] = min_id

This might be a bit slow depending on how many names you have to loop over.

CodePudding user response：

you could design a func that works on every group of Names (used shift here) an d apply it on the groupby:

def add_replacement(df):
df['ReplaceWith'] = df['RecID'].shift(1)
return df

df.groupby('Name').apply(add_replacement)

RecID	Name	ReplaceWith
111111	example1	nan
222222	example1	111111
333333	example2	nan
444444	example2	333333
555555	example2	444444