Home > Back-end >  Pandas check if two values appear at the same time but in different order and store them
Pandas check if two values appear at the same time but in different order and store them

Time:05-17

I have a dataframe like

animal, small_animal, count
cat, dog, 1
cat, moo, 2
dog, cat, 3
moo, moo, 5
squirrel, moo, 1
moo, cat, 3

I would like to have stored together cat, dog and dog, cat

So I need to check if something both appears but in different "orders" in the column and keep the third column. I thought of multiple different dataframes or a dictionary. So far I did a groupby but I still can't work around other things.

CodePudding user response:

You could create a new column with the label

df["label_col"] = df[["animal", "small_animal"]].apply(
      lambda x: "-".join(sorted(x)), axis=1
)

"""
Output
     animal small_animal count    label_col
0       cat          dog     1       cat-dog
1       cat          moo     2       cat-moo
2       dog          cat     3       cat-dog
3       moo          moo     5       moo-moo
4  squirrel          moo     1  moo-squirrel
5       moo          cat     3       cat-moo
"""

Then you can group by or do whatever you want with the ordered key label_col

CodePudding user response:

You can try to compare the two columns joined in different order and filter out the same animals in two columns.

m = (df['animal'] df['small_animal']).isin(df['small_animal'] df['animal'])

out = df[m & df['animal'].ne(df['small_animal'])]
print(out)

  animal small_animal  count
0    cat          dog      1
1    cat          moo      2
2    dog          cat      3
5    moo          cat      3

CodePudding user response:

Records with different names which occur in reflected form

names = ['animal', 'small_animal']

# include all pairs of animal names which occur in reflected form
is_reflected = pd.Index(df[names]).isin(pd.Index(df[reversed(names)]))

# exclude records where names are duplicated, sort of ('moo', 'moo') pairs
is_different = df.animal != df.small_animal

# extract counts for records with reflected and different names
df[is_reflected & is_different]['count']
  • Related