Suppose I have a dataframe of names and countries:
ID FirstName LastName Country
1 Paulo Cortez Brasil
2 Paulo Cortez Brasil
3 Paulo Cortez Espanha
4 Maria Lurdes Espanha
5 Maria Lurdes Espanha
6 John Page USA
7 Felipe Cardoso Brasil
8 John Page USA
9 Felipe Cardoso Espanha
10 Steve Xis UK
I need a way to identify all people that have the same firstname and lastname that appears more than once in the dataframe but at least one of the records appears belonging to another country and return all duplicated rows. This way resulting in this dataframe:
ID FirstName LastName Country
1 Paulo Cortez Brasil
2 Paulo Cortez Brasil
3 Paulo Cortez Espanha
7 Felipe Cardoso Brasil
9 Felipe Cardoso Espanha
What would be the best way to achieve it?
CodePudding user response:
A possible solution, based on DataFrameGroupBy.filter
:
(df.groupby(['FirstName', 'LastName'])
.filter(lambda x: x['Country'].nunique() > 1)
.reset_index(drop=True))
Output:
ID FirstName LastName Country
0 1 Paulo Cortez Brasil
1 2 Paulo Cortez Brasil
2 3 Paulo Cortez Espanha
3 7 Felipe Cardoso Brasil
4 9 Felipe Cardoso Espanha
CodePudding user response:
First drop duplicates from your pandas dataframe:
df = df.drop_duplicates()
Group by FirstName
and LastName
to count the number of times a given first and last name pair is associated with a different country:
new_df = df.groupby(['FirstName', 'LastName']).size().reset_index(name='counts')
Then keep only rows for which count is larger than 1:
new_df=new_df[new_df.counts > 1]
You can then merge your initial df
with the new_df
on FirstName
and LastName
:
pd.merge(df, new_df, on=['FirstName', 'LastName'])
This returns:
FirstName LastName Country counts
0 Paulo Cortez Brasil 3
1 Paulo Cortez Brasil 3
2 Paulo Cortez Espanha 3
3 Felipe Cardoso Brasil 2
4 Felipe Cardoso Espanha 2
CodePudding user response:
Use boolean indexing:
# is the name present in several countries?
m = df.groupby(['FirstName', 'LastName'])['Country'].transform('nunique').gt(1)
out = df.loc[m]
Output:
ID FirstName LastName Country
0 1 Paulo Cortez Brasil
1 2 Paulo Cortez Brasil
2 3 Paulo Cortez Espanha
6 7 Felipe Cardoso Brasil
8 9 Felipe Cardoso Espanha