Home > Software engineering >  How can I update duplicate rows with first record's RecID? (Python/Pandas)
How can I update duplicate rows with first record's RecID? (Python/Pandas)

Time:03-02

I have a fairly large set of records (~200,000) that I need to dedupe, but there is a catch. Most of these records are related to another record type, and this relationship needs to be moved to the unique record left after the dedupe process is complete.

To keep things simple, I have 2 columns, 'RecID' and 'Name' and will be adding a 3rd to the results called 'ReplaceWith'. The data looks something like the following:

RecID Name
111111 example1
222222 example1
333333 example2
444444 example2
555555 example2

When the 'Name' column is duplicated I want to populate the 'ReplaceWith' column with the 'RecID' value of the first item What I am looking to accomplish will look something like the following:

RecID Name ReplaceWith
111111 example1
222222 example1 111111
333333 example2
444444 example2 333333
555555 example2 333333

I am pretty novice at programming so this may be really simple. But I simply cannot figure out how to make this happen programmatically. On smaller data sets, I would simply use excel to manually sort, filter, and update the records. But that method just won't scale to this level.

I would appreciate any recommendations on how to do this.

CodePudding user response:

IF your RecId datatype is an int or float, you could do something like this:

import pandas as pd

t1 = [1111,2222,3333,4444,5555]
t1_name = ['example1', 'example1','example2','example2','example2']

df = pd.DataFrame({'rec_id': t1, 'name': t1_name})

df.sort_values(by=['rec_id', 'name'], inplace=True)

# Get values for `name` that occur more than once
names = df.groupby('name').count().reset_index()
names.rename(columns={'rec_id':'rec_count'}, inplace=True)
arr = names.loc[names.rec_count>1, 'name'].to_list()

for n in arr:
    min_id = df.loc[df.name==n, 'rec_id'].min()
    df.loc[(df.name==n) & (df.rec_id!=min_id), 'replace_with'] = min_id

This might be a bit slow depending on how many names you have to loop over.

CodePudding user response:

you could design a func that works on every group of Names (used shift here) an d apply it on the groupby:

def add_replacement(df):
df['ReplaceWith'] = df['RecID'].shift(1)
return df

df.groupby('Name').apply(add_replacement)
RecID Name ReplaceWith
111111 example1 nan
222222 example1 111111
333333 example2 nan
444444 example2 333333
555555 example2 444444
  • Related