Consider this pandas dataframe:
df = pd.DataFrame({'animal': ["dog", "dog", "dog", "cat", "snake", "dog", "snake", "cat", "goat", "bird"],
'label': [0, 0, 0, 1, 0, 1, 0, 1, 1, 1]})
I want to remove all entries wherever the label does not match all animals. For example, dog is labelled with '0' three times, but the fourth time it is labelled '1', so I would then like to remove all dog entries. For the other animals, they are all labelled correspondingly, and thus I would like to keep the rest, so that the remaining dataframe will be:
df2 = pd.DataFrame({'Column1': ["cat", "snake", "snake", "cat", "goat", "bird"],
'label': [1, 0, 0, 1, 1, 1]})
Any help is appreciated.
CodePudding user response:
Use groupby
transform('nunique')
to get the count of unique values per group.
If the count is 1, then keep the row using boolean indexing:
df2 = df[df.groupby('animal')['label'].transform('nunique').eq(1)]
output:
animal label
3 cat 1
4 snake 0
6 snake 0
7 cat 1
8 goat 1
9 bird 1
CodePudding user response:
I'd use transform
to get a series of unique values per group and compute a boolean mask by checking equality with 1.
mask = df.groupby('animal')['label'].transform('nunique').eq(1)
result = df[mask]
Details:
>>> df.groupby('animal')['label'].transform('nunique')
0 2
1 2
2 2
3 1
4 1
5 2
6 1
7 1
8 1
9 1
Name: label, dtype: int64
>>> df.groupby('animal')['label'].transform('nunique').eq(1)
0 False
1 False
2 False
3 True
4 True
5 False
6 True
7 True
8 True
9 True
Name: label, dtype: bool