I have a DF with columns 'host_id' and 'host_name'. I need to find combinations that have more than one host_name associated with each host_id where host_id is the key.
grouped = df.groupby('host_id').aggregate({'host_name':list})
print(grouped)
Returns:
host_name
host_id
2787 [John, John, John, John, John]
2881 [Loli, Loli]
3151 [Eric]
3211 [Catherine]
3415 [Nataraj]
... ...
274195458 [Abayomi]
274273284 [Anastasia]
274298453 [Adrien]
274307600 [Jonathan]
274321313 [Kat]
So I need to find host_ids with unique host names > 1 in the host_name list
nunique() cannot work on a list. Is there another way to do this in python without looping through each value in every list for host_name?
CodePudding user response:
You can use drop_duplicates
:
sr = (df.drop_duplicates(['host_id', 'host_name'])
.value_counts('host_id').loc[lambda x: x > 1])
An output could be:
>>> sr
host_id
2718 3
1151 2
2415 2
dtype: int64
CodePudding user response:
Try:
df = pd.DataFrame(
{'host_id':[0, 1, 2, 3, 0],
'host_name':['John', 'John', 'Loli', 'Adrien', 'John']
})
df[~df.duplicated()]
Which gives:
host_id host_name
0 0 John
1 1 John
2 2 Loli
3 3 Adrien
Notice that the first two occurrences of John are preserved (different host_ids) but the last occurrence of John is dropped as the host_id is duplicated.
If you want to filter only on certain columns you can use the following syntax:
df[~df[['host_id', 'host_name']].duplicated()]