How to find unique occurences in a grouped DF column?-CodePudding

I have a DF with columns 'host_id' and 'host_name'. I need to find combinations that have more than one host_name associated with each host_id where host_id is the key.

grouped = df.groupby('host_id').aggregate({'host_name':list})
print(grouped)

Returns:

                                host_name
host_id                                  
2787       [John, John, John, John, John]
2881                         [Loli, Loli]
3151                               [Eric]
3211                          [Catherine]
3415                            [Nataraj]
...                                   ...
274195458                       [Abayomi]
274273284                     [Anastasia]
274298453                        [Adrien]
274307600                      [Jonathan]
274321313                           [Kat]

So I need to find host_ids with unique host names > 1 in the host_name list

nunique() cannot work on a list. Is there another way to do this in python without looping through each value in every list for host_name?

CodePudding user response：

You can use drop_duplicates:

sr = (df.drop_duplicates(['host_id', 'host_name'])
        .value_counts('host_id').loc[lambda x: x > 1])

An output could be:

>>> sr
host_id
2718    3
1151    2
2415    2
dtype: int64

CodePudding user response：

Try:

df = pd.DataFrame(
    {'host_id':[0, 1, 2, 3, 0],
    'host_name':['John', 'John', 'Loli', 'Adrien', 'John']
    })   

df[~df.duplicated()]

Which gives:

   host_id host_name
0        0      John
1        1      John
2        2      Loli
3        3    Adrien

Notice that the first two occurrences of John are preserved (different host_ids) but the last occurrence of John is dropped as the host_id is duplicated.

If you want to filter only on certain columns you can use the following syntax:

df[~df[['host_id', 'host_name']].duplicated()]