Pandas check if two values appear at the same time but in different order and store them-CodePudding

I have a dataframe like

animal, small_animal, count
cat, dog, 1
cat, moo, 2
dog, cat, 3
moo, moo, 5
squirrel, moo, 1
moo, cat, 3

I would like to have stored together cat, dog and dog, cat

So I need to check if something both appears but in different "orders" in the column and keep the third column. I thought of multiple different dataframes or a dictionary. So far I did a groupby but I still can't work around other things.

CodePudding user response：

You could create a new column with the label

df["label_col"] = df[["animal", "small_animal"]].apply(
      lambda x: "-".join(sorted(x)), axis=1
)

"""
Output
     animal small_animal count    label_col
0       cat          dog     1       cat-dog
1       cat          moo     2       cat-moo
2       dog          cat     3       cat-dog
3       moo          moo     5       moo-moo
4  squirrel          moo     1  moo-squirrel
5       moo          cat     3       cat-moo
"""

Then you can group by or do whatever you want with the ordered key label_col

CodePudding user response：

You can try to compare the two columns joined in different order and filter out the same animals in two columns.

m = (df['animal'] df['small_animal']).isin(df['small_animal'] df['animal'])

out = df[m & df['animal'].ne(df['small_animal'])]

print(out)

  animal small_animal  count
0    cat          dog      1
1    cat          moo      2
2    dog          cat      3
5    moo          cat      3

CodePudding user response：

Records with different names which occur in reflected form

names = ['animal', 'small_animal']

# include all pairs of animal names which occur in reflected form
is_reflected = pd.Index(df[names]).isin(pd.Index(df[reversed(names)]))

# exclude records where names are duplicated, sort of ('moo', 'moo') pairs
is_different = df.animal != df.small_animal

# extract counts for records with reflected and different names
df[is_reflected & is_different]['count']