Home > Software design >  Identify duplicated rows with different value in another column pandas dataframe
Identify duplicated rows with different value in another column pandas dataframe

Time:12-13

Suppose I have a dataframe of names and countries:

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
4   Maria       Lurdes      Espanha
5   Maria       Lurdes      Espanha
6   John        Page        USA
7   Felipe      Cardoso     Brasil
8   John        Page        USA
9   Felipe      Cardoso     Espanha
10  Steve       Xis         UK

I need a way to identify all people that have the same firstname and lastname that appears more than once in the dataframe but at least one of the records appears belonging to another country and return all duplicated rows. This way resulting in this dataframe:

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
7   Felipe      Cardoso     Brasil
9   Felipe      Cardoso     Espanha

What would be the best way to achieve it?

CodePudding user response:

A possible solution, based on DataFrameGroupBy.filter:

(df.groupby(['FirstName', 'LastName'])
 .filter(lambda x: x['Country'].nunique() > 1)
 .reset_index(drop=True))

Output:

   ID FirstName LastName  Country
0   1     Paulo   Cortez   Brasil
1   2     Paulo   Cortez   Brasil
2   3     Paulo   Cortez  Espanha
3   7    Felipe  Cardoso   Brasil
4   9    Felipe  Cardoso  Espanha

CodePudding user response:

First drop duplicates from your pandas dataframe:

df = df.drop_duplicates()

Group by FirstName and LastName to count the number of times a given first and last name pair is associated with a different country:

new_df = df.groupby(['FirstName', 'LastName']).size().reset_index(name='counts')

Then keep only rows for which count is larger than 1:

new_df=new_df[new_df.counts > 1]

You can then merge your initial df with the new_df on FirstName and LastName:

pd.merge(df, new_df, on=['FirstName', 'LastName'])

This returns:

    FirstName   LastName    Country     counts
0   Paulo       Cortez      Brasil           3
1   Paulo       Cortez      Brasil           3
2   Paulo       Cortez      Espanha          3
3   Felipe      Cardoso     Brasil           2
4   Felipe      Cardoso     Espanha          2

CodePudding user response:

Use boolean indexing:

# is the name present in several countries?
m = df.groupby(['FirstName', 'LastName'])['Country'].transform('nunique').gt(1)

out = df.loc[m]

Output:

   ID FirstName LastName  Country
0   1     Paulo   Cortez   Brasil
1   2     Paulo   Cortez   Brasil
2   3     Paulo   Cortez  Espanha
6   7    Felipe  Cardoso   Brasil
8   9    Felipe  Cardoso  Espanha
  • Related