Home > Mobile >  How to compare two different CSV files with different number of columns and rows by keyword?
How to compare two different CSV files with different number of columns and rows by keyword?

Time:07-13

Without using the diff option.

I want to compare two different CSV files with a different number of rows and columns that can include numbers, strings, chars, and special characters and compare them by 2-3 keywords (it's not needed it's just an extra) and print out to a new CSV file what is different. The CSV file might contain thousands of rows and columns.

file1.csv

     round    date  first  second  third  fourth  fifth  sixth
1     2  2021.04      2    45e69     10    16      4565   37
2     3  2021.04      4      15    456as  df924     35   4N320
4     5  2021.03      4    43!d9    23      26      29     33

file2.csv

     round    date  first  second  third  fourth  fifth  sixth
0     1  2021.04      1      14     15      24      40     41
1     2  2021.04      2    45e69    10      16     4565    37
2     3  2021.04      4      15    456as   df924    35    4N320
3     4  2021.03     10      11     20      21     24325   41
5     6  2021.03    4321       9   2#@6     28     34350   41

Line 1 and 2 from both CSV are the same.

Now, I want to search by the headline keyword for example: "first" and "fifth"- where both headline columns are not the same print them out. (again, it's not needed just an extra) The results for that will be printed in a new CSV file:

result.csv

     round    date  first  second  third  fourth  fifth  sixth
0     1  2021.04      1     14     15      24     40     41
3     4  2021.03     10     11     20      21     24325  41
4     5  2021.03      4   43!d9    23      26     29     33
5     6  2021.03    4321    9     2#@6     28     34350  41

Counting the lines was for easier reading.

I have tried those basic options but most of them were only if the rows and columns from both CSV files are equal.

option1:

with open('file1.csv', 'r') as t1, open('file2.csv', 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open('result.csv', 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

option 2:

import pandas as pd
df1 = pd.read_csv('file1.csv')
df1 = pd.read_csv('file2.csv')

a=df1[df1.eq(df2).all(axis=1)==False]
a.index  =1
print(a.to_string(index=False))

Those examples are not working when the rows and columns are not the same, and even when they are the same number of rows and columns it's not catching special characters/numbers from time to time.

CodePudding user response:

There is a library you can find useful: csv-diff

You can use it from command line:

csv-diff one.csv two.csv --key=id --json

or as a python import

from csv_diff import load_csv, compare
diff = compare(
    load_csv(open("one.csv"), key="id"),
    load_csv(open("two.csv"), key="id")
)

CodePudding user response:

Use sets instead of lists after reading two files:

fileone = set(fileone)
filetwo = set(filetwo)

then in result you can get a symmetric difference - all elements of both sets except elements that are in both fileone and filetwo sets.

result = fileone.symmetric_difference(filetwo)

Attention: header row is not included in result because it is the same in both sets

  • Related