Home > Blockchain >  De-duplication with merge of data
De-duplication with merge of data

Time:08-27

I have a dataset with duplicates, triplicates and more and I want to keep only one record of each unique with merge of data, for example:

id   name     address    age      city
1     Alex    123,blv    
1     Alex               13     
3     Alex               24     Florida
1     Alex                      Miami

Merging data using the id field: Output:

id   name     address    age      city
1     Alex    123,blv    13       Miami
3     Alex               24     Florida

CodePudding user response:

I've changed a bit the code from this answer.

Code to create the initial dataframe:

import pandas as pd
import numpy as np

d = {'id': [1,1,3,1], 
     'name': ["Alex", "Alex", "Alex", "Alex"], 
     'address': ["123,blv" , None, None, None], 
     'age': [None, 13, 24, None], 
     'city': [None, None, "Florida", "Miami"]
}
df = pd.DataFrame(data=d, index=d["id"])
print(df)

Output:

   id  name  address   age     city
1   1  Alex  123,blv   NaN     None
1   1  Alex     None  13.0     None
3   3  Alex     None  24.0  Florida
1   1  Alex     None   NaN    Miami

Aggregation code:

def get_notnull(x):
    if x.notnull().any():
        return  x[x.notnull()]
    else:
        return np.nan

aggregation_functions = {'name': 'first', 
                         'address': get_notnull, 
                         'age': get_notnull, 
                         'city': get_notnull
}
df = df.groupby(df['id']).aggregate(aggregation_functions)
print(df)

Output:

    name  address   age     city
id                              
1   Alex  123,blv  13.0    Miami
3   Alex      NaN  24.0  Florida

CodePudding user response:

(
    df
    .reset_index(drop=True)      # set unique index for eash record
    .drop('id', axis=1)          # exclude 'id' column from processing
    .groupby(df['id'])           # group by 'id'
    .agg(
        # return first non-NA/None value for each column
        lambda s: s.get(s.first_valid_index())
    )
    .reset_index()               # get back the 'id' value for each record
)

ps. As an option:

df.replace([None, ''], pd.NA).groupby('id').first().reset_index()
  • Related