I'm a beginner with pandas and I'm having a little problem.
I would like to remove equivalent entries in a pandas dataframe which uses repeated columns where the column _x and column _y are repeating data by changing it's order.
For example, i have the following code that merges a data frame with itself:
import pandas as pd
df = pd.DataFrame([
["mary","algebra"],
["mary","calculus"],
["john","algebra"],
["robert","calculus"]
],
columns=["name","class"])
df = pd.merge(df,df,on="class").query("name_x != name_y")
print(df)
And the output is this:
name_x class name_y
1 mary algebra john
2 john algebra mary
5 mary calculus robert
6 robert calculus mary
The problem is that some information is repeated, for example, line 1 and line 2 are equivalent because the ordering of the variable _x and _y doesn't matter for my problem, i would like if there's a way to remove the equivalent rows and turn the previous output into this one:
name_x class name_y
1 mary algebra john
2 mary calculus robert
CodePudding user response:
Here is a possible solution using lambda
:
df = pd.merge(df,df,on="class").query("name_x != name_y")
df["pair"] = df[["name_x", "name_y"]].apply(lambda x: tuple(sorted(x)), axis=1)
df = df.drop_duplicates(subset='pair').drop(columns='pair')
print(df)
name_x class name_y
1 mary algebra john
5 mary calculus robert
CodePudding user response:
You can use duplicated
with boolean indexing :
#is the pair of names duplicated ?
m = df.filter(like="name").apply(sorted, axis=1).duplicated()
df = df.loc[m]
Output :
print(df)
name_x class name_y
2 john algebra mary
6 robert calculus mary