Home > Blockchain >  Delete all dataframe entries which are identical in one column but different in another
Delete all dataframe entries which are identical in one column but different in another

Time:02-16

Consider this pandas dataframe:

df = pd.DataFrame({'animal': ["dog", "dog", "dog", "cat", "snake", "dog", "snake", "cat", "goat", "bird"],
               'label': [0, 0, 0, 1, 0, 1, 0, 1, 1, 1]})

I want to remove all entries wherever the label does not match all animals. For example, dog is labelled with '0' three times, but the fourth time it is labelled '1', so I would then like to remove all dog entries. For the other animals, they are all labelled correspondingly, and thus I would like to keep the rest, so that the remaining dataframe will be:

df2 = pd.DataFrame({'Column1': ["cat", "snake", "snake", "cat", "goat", "bird"],
               'label': [1, 0, 0, 1, 1, 1]})

Any help is appreciated.

CodePudding user response:

Use groupby transform('nunique') to get the count of unique values per group.

If the count is 1, then keep the row using boolean indexing:

df2 = df[df.groupby('animal')['label'].transform('nunique').eq(1)]

output:

  animal  label
3    cat      1
4  snake      0
6  snake      0
7    cat      1
8   goat      1
9   bird      1

CodePudding user response:

I'd use transform to get a series of unique values per group and compute a boolean mask by checking equality with 1.

mask = df.groupby('animal')['label'].transform('nunique').eq(1)
result = df[mask]

Details:

>>> df.groupby('animal')['label'].transform('nunique')
0    2
1    2
2    2
3    1
4    1
5    2
6    1
7    1
8    1
9    1
Name: label, dtype: int64
>>> df.groupby('animal')['label'].transform('nunique').eq(1)
0    False
1    False
2    False
3     True
4     True
5    False
6     True
7     True
8     True
9     True
Name: label, dtype: bool
  • Related