Home > database >  Finding a mismatch between two correlated columns in pandas dataframe
Finding a mismatch between two correlated columns in pandas dataframe

Time:06-06

I have a large pandas dataframe which has two columns. One column is the unique code for an item and the second column contains the name of that item.

For example purposes lets say 'ID' has a unique code for an animal and 'Name' has the name of that animal.

df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE'], 
                   'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog']})


    ID      Name
0   AA      Ape
1   BB      Bull
2   CC      Chimp
3   DD      Dog
4   EE      Dog

When I run the code below I will get false because there is one more unique item in the ID column than in the Name column

len(df['Code'].unique()) == len(df['Name'].unique())

My question is how can I find out which animal is being represented by two codes. Ideally I would like to end up with a dataframe that has:

    ID  Name
3   DD  Dog
4   EE  Dog

But as long as I can identify the animal being represented by more than one ID, I am happy with whatever solution gets the desired result.

Thanks

EDIT:

The dataframe contains duplicate rows, so a more realistic example that resembles the actual problem is a dataframe that looks like the one below:

    ID      Name
0   AA      Ape
1   BB      Bull
2   CC      Chimp
3   DD      Dog
4   EE      Dog
5   CC      Chimp
6   AA      Ape
7   DD      Dog
8   FF      Frog

df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE', 'CC', 'AA', 'DD', 'FF'], 
                   'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog', 'Chimp', 'Ape', 'Dog', 'Frog']})

Desired output is still to find which ID has been matched to two different animal names

    ID  Name
3   DD  Dog
4   EE  Dog

CodePudding user response:

For a dataframe like the second example (containing duplicate rows) with a slightly augmented example

import pandas as pd

df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE', 'CC', 'AA', 'DD', 'FF', 'AA'], 
                   'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog', 'Chimp', 'Ape', 'Dog', 'Frog','Bull']})
#    ID   Name
# 0  AA    Ape
# 1  BB   Bull
# 2  CC  Chimp
# 3  DD    Dog
# 4  EE    Dog
# 5  CC  Chimp
# 6  AA    Ape
# 7  DD    Dog
# 8  FF   Frog
# 9  AA   Bull

we can use the following. The result is sorted to make it easier to identify ID by Name.

df.groupby('Name').filter(lambda x: x['ID'].nunique() > 1).drop_duplicates(subset=['ID','Name']).sort_values('Name')

Output

   ID  Name
1  BB  Bull
9  AA  Bull
3  DD   Dog
4  EE   Dog

For dataframes with unique rows @ShubhamSharma's answer is much faster.

CodePudding user response:

Use groupby.filter to filter which groups of names have ID count greater than or equal to 2.

>>> df.groupby('Name').filter(lambda s: s['ID'].count() >= 2)

    ID  Name
3   DD  Dog
4   EE  Dog

CodePudding user response:

If the column ID always contains unique values, then you can simply use:

df[df.duplicated('Name', keep=False)]

   ID Name
3  DD  Dog
4  EE  Dog

CodePudding user response:

You can do group by value counts

df.Name.value_counts()
Dog      2
Ape      1
Bull     1
Chimp    1
Name: Name, dtype: int64

CodePudding user response:

If you just want all rows where a Name has more than one ID you can use:

df.groupby(['Name']).filter(lambda count: len(count) > 1)

CodePudding user response:

aaa = df['Name'].value_counts()
m = aaa >= 2
index = aaa[m].index[0]
print(df[df['Name'] == index])

Output

   ID Name
3  DD  Dog
4  EE  Dog
  • Related