How to Compare two columns of CSV simultaneously in Python?-CodePudding

I have a csv file with a huge dataset conatining two columns. I want to compare the data of these two columns such that, if a dupplicated pair is present then it gets deleted. For example, if my data file looks something like this:

Column A             Column B
DIP-1N                  DIP-1N
DIP-2N                  DIP-3N
DIP-3N                  DIP-2N
DIP-4N                  DIP-5N

Then the first entry gets deleted because I don't want two "DIP-1Ns". Also, the order of occurrence of pair is not an issue as far as the entry is unique. For example, here, DIP-2N & DIP-3N and DIP-3N & DIP-3N are paired. But both the entries mean the same thing. So I want to keep one entry and delete the rest.

I have written the following code, but I don't know how to compare simultaneously the entry of both the columns.

import csv
import pandas as pd

file = pd.read_csv("/home/staph.csv")
for i in range(len(file['Column A'])):
    for j in range(len(file['Column B'])):
        list1 = []
        list2 = []
        list1.append(file[file['Column A'].str.contains('DIP-' str(i) 'N')])
        list2.append(file[file['Column B'].str.contains('DIP-' str(i) 'N')])
        for ele1,ele2 in list1,list2:
            if(list1[ele1]==list2[ele2]):
                print("Duplicate")
            else:
                print("The 1st element is :", ele1)
                print("The 2nd element is :", ele2)

Seems like something is wrong, as there is no output. The program just ends without any output or error. Any help would be much appreciated in terms of whether my code is wrong or if I can optimize the process in a better way. Thanks :)

CodePudding user response：

It might not be the best way to get what you need but, it works.

df['temp'] = df['Column A']   " "   df['Column B']
df['temp'] = df['temp'].str.split(" ")
df['temp'] = df['temp'].apply(lambda list_: " ".join(sorted(list_)))
df.drop_duplicates(subset=['temp'], inplace=True)
df = df[df['Column A'] != df['Column B']]
df.drop('temp', axis=1, inplace=True)

Output:

index	Column A	Column B
1	DIP-2N	DIP-3N
3	DIP-4N	DIP-5N

CodePudding user response：

With some tweaking you could use the pandas method:

# get indices of duplicate-free (except first occurence) combined sets of col A and B
keep_ind = pd.Series(df[["Column A", "Column B"]].values.tolist()).apply(set).drop_duplicates(keep="first").index
# use these indices to filter the DataFrame
df = df.loc[keep_ind]