I have a dataframe that has a unique ID per person, their name, street, email and date. The dataframe may contain blanks and different characters like in this example:
ID Name Street Email Date
1 Paulo street #1 [email protected] 2001-01-20
1 Paulo street #1 [email protected] 20-Jan
1 Paulo street #1 [email protected] 2001-01-20
2 Maria street #2 [email protected] 2020-01-01
2 Mari street #2 [email protected] 2020-01-01
3 Peter xx
4 Josh street #4 [email protected]
4 Josh street #4 [email protected]
I need a way to filter this dataframe by finding the IDs that repeats in column "ID" and return all rows that have any different value in any other of the columns. If this case happens, return all rows with the same ID, resulting in this dataframe:
ID Name Street Email Date
1 Paulo street #1 [email protected] 2001-01-20
1 Paulo street #1 [email protected] 20-Jan
1 Paulo street #1 [email protected] 2001-01-20
2 Maria street #2 [email protected] 2020-01-01
2 Mari street #2 [email protected] 2020-01-01
What would be the best solution for it? Thank you for your help!
CodePudding user response:
Use DataFrameGroupBy.nunique
in GroupBy.transform
for all columns and filter if greater like 1
for any column, last remove duplciates by all columns by DataFrame.drop_duplicates
:
df = df[df.groupby('ID').transform('nunique').gt(1).any(axis=1)].drop_duplicates()
print (df)
ID Name Street Email Date
0 1 Paulo street #1 [email protected] 2001-01-20
1 1 Paulo street #1 [email protected] 20-Jan
3 2 Maria street #2 [email protected] 2020-01-01
4 2 Mari street #2 [email protected] 2020-01-01
CodePudding user response:
You can use nunique
to count distinct elements so that id 4 will not be returned:
df = df.groupby('ID').filter(
lambda x: x['ID'].count() > 1 and (
x['Name'].nunique() > 1
or x['Street'].nunique() > 1
or x['Email'].nunique() > 1
or x['Date'].nunique() > 1
))