Compare two columns in csv files-CodePudding

There are two csv files, in the first file the third column has a certain number of rows with data, and in the second file the first column has similar data, also in some indefinite amount, these are presented in the form of md5, for example:

file_1

column_1	column_2	column_3
blah blah blah	blah blah blah	aa7744226c695c0b2e440419848cf700
blah blah blah	blah blah blah	9b34939b137e24f8c6603a54b2305f07
blah blah blah	blah blah blah	ad1172b28f277eab7ca91f96f13a242b
`etc`

file_2

column_1	column_2	column_3
49269f413284abfa58f41687b6f631e0	blah blah blah	blah blah blah
a0879ff97178e03eb18470277fbc7056	blah blah blah	blah blah blah
9e5b91c360d6be29d556db7e1241ce82	blah blah blah	blah blah blah
`etc`

Could you tell me please, how can i compare these two columns from two files, i.e. find duplicate values, and if the values are repeated, then display what value is in the first and second csv file.

I tried to take something from this example:

import csv    
interesting_cols = [0, 2, 3, 4, 5]    
with open("/root/file1.csv", 'r') as file1,\
     open("/root/file2.csv", 'r') as file2:    
    reader1, reader2 = csv.reader(file1), csv.reader(file2)    
    for line1, line2 in zip(reader1, reader2):
        equal = all(x == y for n, (x, y) in enumerate(zip(line1, line2)) if n in interesting_cols)
        print(equal)

this example would work well if two files would have only one column each. According to my requirements, I could not implement it in any way, I am very weak in Python. Thank you very much!

CodePudding user response：

If you are allowed, you can use Pandas to do that. Firstly install the package using pip: python -m pip install pandas

or conda: conda install pandas

And then read and compare with pandas:

import pandas as pd
interesting_cols = [0, 2, 3, 4, 5]    
file1 = pd.read_csv("/root/file1.csv")
file2 = pd.read_csv("/root/file2.csv")
comp = file1.compare(file2)
print(comp.to_markdown())

OR, if you prefer to keep the 'with' statement, you should create a class and define the __enter__ and __exit__ methods:

import pandas as pd
interesting_cols = [0, 2, 3, 4, 5]    
class DataCSV:
    def __init__(self, file) -> None:
        self.filename = file
    def __enter__(self):
        self.file = pd.read_csv(self.filename)
        return self.file
    def __exit__(self, exc_type, exc_value, traceback):
        pass
with DataCSV("/root/file1.csv") as file1, DataCSV("/root/file2.csv") as file2:
    comp = file1.compare(file2)
    print(comp.to_markdown())

The output should be something like:

	('column_1', 'self')	('column_1', 'other')	('column_3', 'self')	('column_3', 'other')
0	blah blah blah	49269f413284abfa58f41687b6f631e0	aa7744226c695c0b2e440419848cf700	blah blah blah
1	blah blah blah	a0879ff97178e03eb18470277fbc7056	9b34939b137e24f8c6603a54b2305f07	blah blah blah
2	blah blah blah	9e5b91c360d6be29d556db7e1241ce82	ad1172b28f277eab7ca91f96f13a242b	blah blah blah

CodePudding user response：

You can reorder a list and check it quickly using a generator.

import csv

def parse_csv(filename, header=False, delim=',', quotechar='"'):
    with open(filename, 'r') as f:
        csvfile = csv.reader(f, delimiter=delim, quotechar=quotechar)
        if header:
            csvfile.__next__()
        for row in csvfile:
            yield row

def diff(l1, l2, reorder=None):
    if reorder:
        for i,line in enumerate(l2):
            l2[i] = [line[x] for x in line]
    for i, line in enumerate(l1):
        if line not in l2:
            yield i,  line

filename1 = ''
filename2 = ''
reorder = [2,1,0]

missing = [(i, line) for i,line in diff(parse_csv(filename1, header=False), list(parse_csv(filename2, header=False)), reorder=reorder) or (None, None)]
print(missing)

CodePudding user response：

Here, it says Each row read from the csv file is returned as a list of strings. You can read individual columns from those lines.

For example:

Using two simple csv files
addresses.csv

Doe,John,120 jefferson st.,Riverside, NJ, 08075
McGinnis,Jack,220 hobo Av.,Phila, PA,09119
Repici,"John ""Da Man""",120 Jefferson St.,Riverside, NJ,08075
Tyler,Stephen,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234

and

phones.csv

John,Doe,19122
Jack,McGinnis,20220
"John ""Da Man""",Repici,1202134
Stephen,Tyler,72384


>>> with open('addresses.csv') as file1, open('phones.csv') as file2:
...     r1, r2 = csv.reader(file1), csv.reader(file2)
...     for line1, line2 in zip(r1, r2):
...             if line1[1] == line2[0]:
...                     print('found a duplicate', line1[1])
...
found a duplicate John
found a duplicate Jack
found a duplicate John "Da Man"
found a duplicate Stephen

we get the rows which have same values in specified columns. In our case, these are second column of first csv file and first column of second csv file. In order to get row numbers, you can use enumerate(zip()) just like the example code you provided.

You can check Python list comprehensions for the syntax used in your example.

CodePudding user response：

My answer will work with all records in files. It will find match in all records in file1 and file2.

Reverse list reader1 = [i[::-1] for i in reader1] for ordering it.
Making a list of theese two reader = reader1 reader2
Making a dictionary, which will find all matches by number.
Just printing result of our searching

import csv

interesting_cols = [0, 2, 3, 4, 5]
with open("file1.csv", 'r') as file1,\
     open("file2.csv", 'r') as file2:
    reader1, reader2 = csv.reader(file1), csv.reader(file2)

    reader1 = [i[::-1] for i in reader1]
    reader2 = [i for i in reader2]
    reader = reader1   reader2

    dictionary_of_records = dict()

    for i, item in enumerate(reader):
        key = item[0]
        if key in dictionary_of_records:
            dictionary_of_records[key].append(i)
        else:
            dictionary_of_records[key] = list()
            dictionary_of_records[key].append(i)

    for key, value in dictionary_of_records.items():
        if len(value) > 1:
            print(f"Match for {key}")
            for index in value:
                print(' '.join(reader[index]))
        else:
            print(f"No match for {key}")
        print("-----------------------------")

P.S. It's quite hardcode, I think. You can also watch pandas library or itertools to find more beatiful way to do it.