I have a fairly large set of records (~200,000) that I need to dedupe, but there is a catch. Most of these records are related to another record type, and this relationship needs to be moved to the unique record left after the dedupe process is complete.
To keep things simple, I have 2 columns, 'RecID' and 'Name' and will be adding a 3rd to the results called 'ReplaceWith'. The data looks something like the following:
RecID | Name |
---|---|
111111 | example1 |
222222 | example1 |
333333 | example2 |
444444 | example2 |
555555 | example2 |
When the 'Name' column is duplicated I want to populate the 'ReplaceWith' column with the 'RecID' value of the first item What I am looking to accomplish will look something like the following:
RecID | Name | ReplaceWith |
---|---|---|
111111 | example1 | |
222222 | example1 | 111111 |
333333 | example2 | |
444444 | example2 | 333333 |
555555 | example2 | 333333 |
I am pretty novice at programming so this may be really simple. But I simply cannot figure out how to make this happen programmatically. On smaller data sets, I would simply use excel to manually sort, filter, and update the records. But that method just won't scale to this level.
I would appreciate any recommendations on how to do this.
CodePudding user response:
IF your RecId
datatype is an int
or float
, you could do something like this:
import pandas as pd
t1 = [1111,2222,3333,4444,5555]
t1_name = ['example1', 'example1','example2','example2','example2']
df = pd.DataFrame({'rec_id': t1, 'name': t1_name})
df.sort_values(by=['rec_id', 'name'], inplace=True)
# Get values for `name` that occur more than once
names = df.groupby('name').count().reset_index()
names.rename(columns={'rec_id':'rec_count'}, inplace=True)
arr = names.loc[names.rec_count>1, 'name'].to_list()
for n in arr:
min_id = df.loc[df.name==n, 'rec_id'].min()
df.loc[(df.name==n) & (df.rec_id!=min_id), 'replace_with'] = min_id
This might be a bit slow depending on how many names you have to loop over.
CodePudding user response:
you could design a func that works on every group of Names (used shift here) an d apply it on the groupby:
def add_replacement(df):
df['ReplaceWith'] = df['RecID'].shift(1)
return df
df.groupby('Name').apply(add_replacement)
RecID | Name | ReplaceWith |
---|---|---|
111111 | example1 | nan |
222222 | example1 | 111111 |
333333 | example2 | nan |
444444 | example2 | 333333 |
555555 | example2 | 444444 |