There are two csv files
, in the first
file the third column
has a certain number of rows with data
, and in the second file
the first column
has similar data, also in some indefinite amount,
these are presented in the form of md5
, for example:
file_1
column_1 | column_2 | column_3 |
---|---|---|
blah blah blah | blah blah blah | aa7744226c695c0b2e440419848cf700 |
blah blah blah | blah blah blah | 9b34939b137e24f8c6603a54b2305f07 |
blah blah blah | blah blah blah | ad1172b28f277eab7ca91f96f13a242b |
etc |
file_2
column_1 | column_2 | column_3 |
---|---|---|
49269f413284abfa58f41687b6f631e0 | blah blah blah | blah blah blah |
a0879ff97178e03eb18470277fbc7056 | blah blah blah | blah blah blah |
9e5b91c360d6be29d556db7e1241ce82 | blah blah blah | blah blah blah |
etc |
Could you tell me please, how can i compare these two columns from two files, i.e. find duplicate
values, and if the values are repeated, then display what value is in the first and second csv file
.
I tried to take something from this example:
import csv
interesting_cols = [0, 2, 3, 4, 5]
with open("/root/file1.csv", 'r') as file1,\
open("/root/file2.csv", 'r') as file2:
reader1, reader2 = csv.reader(file1), csv.reader(file2)
for line1, line2 in zip(reader1, reader2):
equal = all(x == y for n, (x, y) in enumerate(zip(line1, line2)) if n in interesting_cols)
print(equal)
this example would work well if two files would have only one column each. According to my requirements, I could not implement it in any way, I am very weak in Python. Thank you very much!
CodePudding user response:
If you are allowed, you can use Pandas to do that. Firstly install the package using pip:
python -m pip install pandas
or conda:
conda install pandas
And then read and compare with pandas:
import pandas as pd
interesting_cols = [0, 2, 3, 4, 5]
file1 = pd.read_csv("/root/file1.csv")
file2 = pd.read_csv("/root/file2.csv")
comp = file1.compare(file2)
print(comp.to_markdown())
OR, if you prefer to keep the 'with' statement, you should create a class and define the __enter__
and __exit__
methods:
import pandas as pd
interesting_cols = [0, 2, 3, 4, 5]
class DataCSV:
def __init__(self, file) -> None:
self.filename = file
def __enter__(self):
self.file = pd.read_csv(self.filename)
return self.file
def __exit__(self, exc_type, exc_value, traceback):
pass
with DataCSV("/root/file1.csv") as file1, DataCSV("/root/file2.csv") as file2:
comp = file1.compare(file2)
print(comp.to_markdown())
The output should be something like:
('column_1', 'self') | ('column_1', 'other') | ('column_3', 'self') | ('column_3', 'other') | |
---|---|---|---|---|
0 | blah blah blah | 49269f413284abfa58f41687b6f631e0 | aa7744226c695c0b2e440419848cf700 | blah blah blah |
1 | blah blah blah | a0879ff97178e03eb18470277fbc7056 | 9b34939b137e24f8c6603a54b2305f07 | blah blah blah |
2 | blah blah blah | 9e5b91c360d6be29d556db7e1241ce82 | ad1172b28f277eab7ca91f96f13a242b | blah blah blah |
CodePudding user response:
You can reorder a list and check it quickly using a generator.
import csv
def parse_csv(filename, header=False, delim=',', quotechar='"'):
with open(filename, 'r') as f:
csvfile = csv.reader(f, delimiter=delim, quotechar=quotechar)
if header:
csvfile.__next__()
for row in csvfile:
yield row
def diff(l1, l2, reorder=None):
if reorder:
for i,line in enumerate(l2):
l2[i] = [line[x] for x in line]
for i, line in enumerate(l1):
if line not in l2:
yield i, line
filename1 = ''
filename2 = ''
reorder = [2,1,0]
missing = [(i, line) for i,line in diff(parse_csv(filename1, header=False), list(parse_csv(filename2, header=False)), reorder=reorder) or (None, None)]
print(missing)
CodePudding user response:
Here, it says Each row read from the csv file is returned as a list of strings. You can read individual columns from those lines.
For example:
Using two simple csv files
addresses.csv
Doe,John,120 jefferson st.,Riverside, NJ, 08075
McGinnis,Jack,220 hobo Av.,Phila, PA,09119
Repici,"John ""Da Man""",120 Jefferson St.,Riverside, NJ,08075
Tyler,Stephen,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
and
phones.csv
John,Doe,19122
Jack,McGinnis,20220
"John ""Da Man""",Repici,1202134
Stephen,Tyler,72384
>>> with open('addresses.csv') as file1, open('phones.csv') as file2:
... r1, r2 = csv.reader(file1), csv.reader(file2)
... for line1, line2 in zip(r1, r2):
... if line1[1] == line2[0]:
... print('found a duplicate', line1[1])
...
found a duplicate John
found a duplicate Jack
found a duplicate John "Da Man"
found a duplicate Stephen
we get the rows which have same values in specified columns. In our case, these are second column of first csv file and first column of second csv file. In order to get row numbers, you can use enumerate(zip())
just like the example code you provided.
You can check Python list comprehensions for the syntax used in your example.
CodePudding user response:
My answer will work with all records in files. It will find match in all records in file1 and file2.
- Reverse list
reader1 = [i[::-1] for i in reader1]
for ordering it. - Making a list of theese two
reader = reader1 reader2
- Making a dictionary, which will find all matches by number.
- Just printing result of our searching
import csv
interesting_cols = [0, 2, 3, 4, 5]
with open("file1.csv", 'r') as file1,\
open("file2.csv", 'r') as file2:
reader1, reader2 = csv.reader(file1), csv.reader(file2)
reader1 = [i[::-1] for i in reader1]
reader2 = [i for i in reader2]
reader = reader1 reader2
dictionary_of_records = dict()
for i, item in enumerate(reader):
key = item[0]
if key in dictionary_of_records:
dictionary_of_records[key].append(i)
else:
dictionary_of_records[key] = list()
dictionary_of_records[key].append(i)
for key, value in dictionary_of_records.items():
if len(value) > 1:
print(f"Match for {key}")
for index in value:
print(' '.join(reader[index]))
else:
print(f"No match for {key}")
print("-----------------------------")
P.S. It's quite hardcode, I think. You can also watch pandas library or itertools to find more beatiful way to do it.