Home > database >  Pandas remove equivalent rows from dataframe where xy = yx
Pandas remove equivalent rows from dataframe where xy = yx

Time:01-31

I'm a beginner with pandas and I'm having a little problem.

I would like to remove equivalent entries in a pandas dataframe which uses repeated columns where the column _x and column _y are repeating data by changing it's order.

For example, i have the following code that merges a data frame with itself:

import pandas as pd

df = pd.DataFrame([
    ["mary","algebra"],
    ["mary","calculus"],
    ["john","algebra"],
    ["robert","calculus"]
    ],
    columns=["name","class"])

df = pd.merge(df,df,on="class").query("name_x != name_y")

print(df)

And the output is this:

   name_x     class  name_y
1    mary   algebra    john
2    john   algebra    mary
5    mary  calculus  robert
6  robert  calculus    mary

The problem is that some information is repeated, for example, line 1 and line 2 are equivalent because the ordering of the variable _x and _y doesn't matter for my problem, i would like if there's a way to remove the equivalent rows and turn the previous output into this one:

   name_x     class  name_y
1    mary   algebra    john
2    mary  calculus  robert

CodePudding user response:

Here is a possible solution using lambda:

df = pd.merge(df,df,on="class").query("name_x != name_y")
df["pair"] = df[["name_x", "name_y"]].apply(lambda x: tuple(sorted(x)), axis=1)
df = df.drop_duplicates(subset='pair').drop(columns='pair')

print(df)

  name_x     class  name_y
1   mary   algebra    john
5   mary  calculus  robert

CodePudding user response:

You can use duplicated with boolean indexing :

#is the pair of names duplicated ?
m = df.filter(like="name").apply(sorted, axis=1).duplicated()

df = df.loc[m]

Output :

print(df)

   name_x     class name_y
2    john   algebra   mary
6  robert  calculus   mary
  • Related