Home > Mobile >  Pandas drop duplicate pair data in different columns
Pandas drop duplicate pair data in different columns

Time:04-06

below is my data table, from my code output:

| columnA|ColumnB|ColumnC|
| ------ | ----- | ------|
|   12   | 8     | 1.34  |
|   8    | 12    | 1.34  |
|   1    | 7     | 0.25  |

I want to dedupe and only left

| columnA|ColumnB|ColumnC|
| ------ | ----- | ------|
|   12   | 8     | 1.34  |
|   1    | 7     | 0.25  |

Usually when I try to drop duplicate, I am using .drop_duplicates(subset=). But this time, I want to drop same pair,Ex:I want to drop (columnA,columnB)==(columnB,columnA). I do some research, I find someone uses set((a,b) if a<=b else (b,a) for a,b in pairs) to remove the same list pair. But I don't know how to use this method on my pandas data frame. Please help, and thank you in advance!

CodePudding user response:

You can combine a and b into a tuple and call drop_duplicates based on the combined columne:

t = df[["a", "b"]].apply(lambda row: tuple(set(row)), axis=1)
df.assign(t=t).drop_duplicates("t").drop(columns="t")

CodePudding user response:

Convert relevant columns to frozenset:

out = df[~df[['columnA', 'ColumnB']].apply(frozenset, axis=1).duplicated()]
print(out)

# Output
   columnA  ColumnB  ColumnC
0       12        8     1.34
2        1        7     0.25

Details:

>>> set([8, 12])
{8, 12}

>>> set([12, 8])
{8, 12}

CodePudding user response:

Possible solution is the following:

# pip install pandas

import pandas as pd

# create test dataframe
df = pd.DataFrame({"colA": [12,8,1],"colB": [8,12,1],"colC": [1.34,1.34,0.25]})
df

enter image description here

df.loc[df.colA > df.colB, df.columns] = df.loc[df.colA > df.colB, df.columns[[1,0,2]]].values
df.drop_duplicates()

Returns

enter image description here

  • Related