Without using the diff option.
I want to compare two different CSV files with a different number of rows and columns that can include numbers, strings, chars, and special characters and compare them by 2-3 keywords (it's not needed it's just an extra) and print out to a new CSV file what is different. The CSV file might contain thousands of rows and columns.
file1.csv
round date first second third fourth fifth sixth
1 2 2021.04 2 45e69 10 16 4565 37
2 3 2021.04 4 15 456as df924 35 4N320
4 5 2021.03 4 43!d9 23 26 29 33
file2.csv
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
1 2 2021.04 2 45e69 10 16 4565 37
2 3 2021.04 4 15 456as df924 35 4N320
3 4 2021.03 10 11 20 21 24325 41
5 6 2021.03 4321 9 2#@6 28 34350 41
Line 1 and 2 from both CSV are the same.
Now, I want to search by the headline keyword for example: "first" and "fifth"- where both headline columns are not the same print them out. (again, it's not needed just an extra) The results for that will be printed in a new CSV file:
result.csv
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
3 4 2021.03 10 11 20 21 24325 41
4 5 2021.03 4 43!d9 23 26 29 33
5 6 2021.03 4321 9 2#@6 28 34350 41
Counting the lines was for easier reading.
I have tried those basic options but most of them were only if the rows and columns from both CSV files are equal.
option1:
with open('file1.csv', 'r') as t1, open('file2.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('result.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
option 2:
import pandas as pd
df1 = pd.read_csv('file1.csv')
df1 = pd.read_csv('file2.csv')
a=df1[df1.eq(df2).all(axis=1)==False]
a.index =1
print(a.to_string(index=False))
Those examples are not working when the rows and columns are not the same, and even when they are the same number of rows and columns it's not catching special characters/numbers from time to time.
CodePudding user response:
There is a library you can find useful: csv-diff
You can use it from command line:
csv-diff one.csv two.csv --key=id --json
or as a python import
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv"), key="id"),
load_csv(open("two.csv"), key="id")
)
CodePudding user response:
Use sets instead of lists after reading two files:
fileone = set(fileone)
filetwo = set(filetwo)
then in result you can get a symmetric difference - all elements of both sets except elements that are in both fileone and filetwo sets.
result = fileone.symmetric_difference(filetwo)
Attention: header row is not included in result because it is the same in both sets