I have a large pandas dataframe which has two columns. One column is the unique code for an item and the second column contains the name of that item.
For example purposes lets say 'ID' has a unique code for an animal and 'Name' has the name of that animal.
df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE'],
'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog']})
ID Name
0 AA Ape
1 BB Bull
2 CC Chimp
3 DD Dog
4 EE Dog
When I run the code below I will get false because there is one more unique item in the ID column than in the Name column
len(df['Code'].unique()) == len(df['Name'].unique())
My question is how can I find out which animal is being represented by two codes. Ideally I would like to end up with a dataframe that has:
ID Name
3 DD Dog
4 EE Dog
But as long as I can identify the animal being represented by more than one ID, I am happy with whatever solution gets the desired result.
Thanks
EDIT:
The dataframe contains duplicate rows, so a more realistic example that resembles the actual problem is a dataframe that looks like the one below:
ID Name
0 AA Ape
1 BB Bull
2 CC Chimp
3 DD Dog
4 EE Dog
5 CC Chimp
6 AA Ape
7 DD Dog
8 FF Frog
df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE', 'CC', 'AA', 'DD', 'FF'],
'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog', 'Chimp', 'Ape', 'Dog', 'Frog']})
Desired output is still to find which ID has been matched to two different animal names
ID Name
3 DD Dog
4 EE Dog
CodePudding user response:
For a dataframe like the second example (containing duplicate rows) with a slightly augmented example
import pandas as pd
df = pd.DataFrame({'ID': ['AA','BB', 'CC', 'DD', 'EE', 'CC', 'AA', 'DD', 'FF', 'AA'],
'Name': ['Ape', 'Bull', 'Chimp', 'Dog', 'Dog', 'Chimp', 'Ape', 'Dog', 'Frog','Bull']})
# ID Name
# 0 AA Ape
# 1 BB Bull
# 2 CC Chimp
# 3 DD Dog
# 4 EE Dog
# 5 CC Chimp
# 6 AA Ape
# 7 DD Dog
# 8 FF Frog
# 9 AA Bull
we can use the following. The result is sorted to make it easier to identify ID by Name.
df.groupby('Name').filter(lambda x: x['ID'].nunique() > 1).drop_duplicates(subset=['ID','Name']).sort_values('Name')
Output
ID Name
1 BB Bull
9 AA Bull
3 DD Dog
4 EE Dog
For dataframes with unique rows @ShubhamSharma's answer is much faster.
CodePudding user response:
Use groupby.filter
to filter which groups of names have ID
count greater than or equal to 2.
>>> df.groupby('Name').filter(lambda s: s['ID'].count() >= 2)
ID Name
3 DD Dog
4 EE Dog
CodePudding user response:
If the column ID
always contains unique values, then you can simply use:
df[df.duplicated('Name', keep=False)]
ID Name
3 DD Dog
4 EE Dog
CodePudding user response:
You can do group by value counts
df.Name.value_counts()
Dog 2
Ape 1
Bull 1
Chimp 1
Name: Name, dtype: int64
CodePudding user response:
If you just want all rows where a Name
has more than one ID
you can use:
df.groupby(['Name']).filter(lambda count: len(count) > 1)
CodePudding user response:
aaa = df['Name'].value_counts()
m = aaa >= 2
index = aaa[m].index[0]
print(df[df['Name'] == index])
Output
ID Name
3 DD Dog
4 EE Dog