comparing two csv files and getting the row that is not present in third file-CodePudding

I've seen many questions like this one here, but none of the solutions helped me. So lets say we have two csv files like this:

File1 (original)

3L261   6/27/2022 1:15  AUH SLL ACT 320
3L122   4/6/2022 23:35  CCJ AUH ACT 320
3L133   4/5/2022 8:45   AUH TRV ACT 320

File2 (system generated)

3L122   4/6/2022 23:35  CCJ AUH ACT 320

What i am trying to achive is to compare row 1 of file1 with the entire rows of file2, if the row exist then move to the next row of file1 and do the exact comparison again. if the row does not exist then output the row to another file comparion.csv

I tried the following code but it compares row1 of file1 with row1 of file2 , row2 of file1 with row 2 of file 2. Need to know how i can achieve my use case.

expected output

3L261   6/27/2022 1:15  AUH SLL ACT 320
3L133   4/5/2022 8:45   AUH TRV ACT 320

import csv
import pandas as pd
import numpy as np

with open('file1.csv', 'r') as f, open('file.csv', 'r') as s, open('Comparison.csv', 'w') as o:
    diffs = 0
    for i, (first, second) in enumerate(zip(f, s), start=1):
        if first != second:
            print((f'row #{i}\n'
                   f'in first file: {first.strip()}\n'
                   f'in second file: {second.strip()}'), file=o)
            diffs  = 1
    print(f'Different values on {diffs} row(s), same values on {i-diffs} row(s)')

CodePudding user response：

If I understand correctly your problem is about how to relate the lines of a dataFrame, right?

Before that, excuse me for my English.

In my opinion the simplest way to do this is using the == operator. I will demonstrate with a simple example how this can be done.


import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print(df_1 == df_2)

Result:

   Player  Goals
0    True  False
1    True   True
2    True   True
3    True  False
4    True   True

Compares the corresponding elements of df_1 to ad df_2 and returns True if the corresponding elements of that position are the same, otherwise returns False.

We can use pandas.DataFrame.all() method to know which rows are the same in both df_1 and df_2.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print((df_1 == df_2).all(axis=1))

Result:

0    False
1     True
2     True
3    False
4     True
dtype: bool

Lines with a value of True in the output have the same value as the corresponding elements. Thus, lines with a value of False in the output have different values than the corresponding elements.

We can use indexing to list all rows whose values differ in df_1 and df_2.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)

print(df_1[(df_1 == df_2).all(axis=1) == False])

Result:

        Player  Goals
0  Lewandowski     10
3        Messi      5

Lists all rows in df_1 that have different values than the corresponding rows in df_2.

If we have different indices for df_1 and df_2, we get an error saying ValueError: Can only compare identically-labeled DataFrame objects.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=['a', 'b', 'c', 'd', 'e'])

print(df_1 == df_2)

Result:

Traceback (most recent call last):
...
ValueError: Can only compare identically-labeled DataFrame objects

We can use the pandas.DataFrame.reset_index() method to reset the indexes in order to overcome the above mentioned problem.

import pandas as pd

data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [10, 8, 6, 5, 4]}

data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
                "Goals": [7, 8, 6, 7, 4]}

df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=['a', 'b', 'c', 'd', 'e'])
df_2.reset_index(drop=True, inplace=True)

print(df_1 == df_2)

Result:


   Player  Goals
0    True  False
1    True   True
2    True   True
3    True  False
4    True   True

Resets the index of df_2 before comparing df_1 and df_2, so that two dataframes have the same indices to make the comparison possible.

That's it.

CodePudding user response：

You can perform an outer merge with indicator=True and use it to export new files:

df1 = pd.read_csv('filename1'.csv, header=None) # or set headers accordingly
df1 = pd.read_csv('filename2.csv', header=None)

for name, g in df1.merge(df2, how='outer', indicator=True).groupby('_merge'):
    g.drop(columns='_merge').to_csv(f'{name}.csv', header=None, index=False)

output:

# left_only.csv
3L261,6/27/2022 1:15,AUH,SLL,ACT,320
3L133,4/5/2022 8:45,AUH,TRV,ACT,320

# right_only.csv
3L122,4/6/2022 23:35,CCJ,AUH,ACT,320

# both.csv
[empty file]