Home > Mobile >  Find duplicates in a pandas dataframe column that have any different value in other columns
Find duplicates in a pandas dataframe column that have any different value in other columns

Time:09-06

I have a dataframe that has a unique ID per person, their name, street, email and date. The dataframe may contain blanks and different characters like in this example:

ID  Name    Street        Email       Date
1   Paulo   street #1   [email protected]    2001-01-20
1   Paulo   street #1   [email protected]    20-Jan
1   Paulo   street #1   [email protected]    2001-01-20
2   Maria   street #2   [email protected]    2020-01-01
2   Mari    street #2   [email protected]    2020-01-01
3   Peter                  xx
4   Josh    street #4   [email protected]    
4   Josh    street #4   [email protected]    

I need a way to filter this dataframe by finding the IDs that repeats in column "ID" and return all rows that have any different value in any other of the columns. If this case happens, return all rows with the same ID, resulting in this dataframe:

ID  Name    Street        Email       Date
1   Paulo   street #1   [email protected]    2001-01-20
1   Paulo   street #1   [email protected]    20-Jan
1   Paulo   street #1   [email protected]    2001-01-20
2   Maria   street #2   [email protected]    2020-01-01
2   Mari    street #2   [email protected]    2020-01-01

What would be the best solution for it? Thank you for your help!

CodePudding user response:

Use DataFrameGroupBy.nunique in GroupBy.transform for all columns and filter if greater like 1 for any column, last remove duplciates by all columns by DataFrame.drop_duplicates:

df = df[df.groupby('ID').transform('nunique').gt(1).any(axis=1)].drop_duplicates()
print (df)
   ID   Name     Street     Email        Date
0   1  Paulo  street #1  [email protected]  2001-01-20
1   1  Paulo  street #1  [email protected]      20-Jan
3   2  Maria  street #2  [email protected]  2020-01-01
4   2   Mari  street #2  [email protected]  2020-01-01

CodePudding user response:

You can use nunique to count distinct elements so that id 4 will not be returned:

df = df.groupby('ID').filter(
    lambda x: x['ID'].count() > 1 and (
                  x['Name'].nunique() > 1 
               or x['Street'].nunique() > 1 
               or x['Email'].nunique() > 1 
               or x['Date'].nunique() > 1
    ))
  • Related