I've seen many questions like this one here, but none of the solutions helped me. So lets say we have two csv files like this:
File1 (original)
3L261 6/27/2022 1:15 AUH SLL ACT 320
3L122 4/6/2022 23:35 CCJ AUH ACT 320
3L133 4/5/2022 8:45 AUH TRV ACT 320
File2 (system generated)
3L122 4/6/2022 23:35 CCJ AUH ACT 320
What i am trying to achive is to compare row 1 of file1 with the entire rows of file2, if the row exist then move to the next row of file1 and do the exact comparison again. if the row does not exist then output the row to another file comparion.csv
I tried the following code but it compares row1 of file1 with row1 of file2 , row2 of file1 with row 2 of file 2. Need to know how i can achieve my use case.
expected output
3L261 6/27/2022 1:15 AUH SLL ACT 320
3L133 4/5/2022 8:45 AUH TRV ACT 320
import csv
import pandas as pd
import numpy as np
with open('file1.csv', 'r') as f, open('file.csv', 'r') as s, open('Comparison.csv', 'w') as o:
diffs = 0
for i, (first, second) in enumerate(zip(f, s), start=1):
if first != second:
print((f'row #{i}\n'
f'in first file: {first.strip()}\n'
f'in second file: {second.strip()}'), file=o)
diffs = 1
print(f'Different values on {diffs} row(s), same values on {i-diffs} row(s)')
CodePudding user response:
If I understand correctly your problem is about how to relate the lines of a dataFrame, right?
Before that, excuse me for my English.
In my opinion the simplest way to do this is using the ==
operator.
I will demonstrate with a simple example how this can be done.
import pandas as pd
data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4]}
data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4]}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)
print(df_1 == df_2)
Result:
Player Goals
0 True False
1 True True
2 True True
3 True False
4 True True
Compares the corresponding elements of df_1
to ad df_2
and returns True if the corresponding elements of that position are the same, otherwise returns False.
We can use pandas.DataFrame.all()
method to know which rows are the same in both df_1
and df_2
.
import pandas as pd
data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4]}
data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4]}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)
print((df_1 == df_2).all(axis=1))
Result:
0 False
1 True
2 True
3 False
4 True
dtype: bool
Lines with a value of True
in the output have the same value as the corresponding elements. Thus, lines with a value of False
in the output have different values than the corresponding elements.
We can use indexing to list all rows whose values differ in df_1
and df_2
.
import pandas as pd
data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4]}
data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4]}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)
print(df_1[(df_1 == df_2).all(axis=1) == False])
Result:
Player Goals
0 Lewandowski 10
3 Messi 5
Lists all rows in df_1
that have different values than the corresponding rows in df_2
.
If we have different indices for df_1
and df_2
, we get an error saying ValueError: Can only compare identically-labeled DataFrame objects.
import pandas as pd
data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4]}
data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4]}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=['a', 'b', 'c', 'd', 'e'])
print(df_1 == df_2)
Result:
Traceback (most recent call last):
...
ValueError: Can only compare identically-labeled DataFrame objects
We can use the pandas.DataFrame.reset_index()
method to reset the indexes in order to overcome the above mentioned problem.
import pandas as pd
data_season1 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4]}
data_season2 = {"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4]}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=['a', 'b', 'c', 'd', 'e'])
df_2.reset_index(drop=True, inplace=True)
print(df_1 == df_2)
Result:
Player Goals
0 True False
1 True True
2 True True
3 True False
4 True True
Resets the index of df_2
before comparing df_1
and df_2
, so that two dataframes have the same indices to make the comparison possible.
That's it.
CodePudding user response:
You can perform an outer merge
with indicator=True
and use it to export new files:
df1 = pd.read_csv('filename1'.csv, header=None) # or set headers accordingly
df1 = pd.read_csv('filename2.csv', header=None)
for name, g in df1.merge(df2, how='outer', indicator=True).groupby('_merge'):
g.drop(columns='_merge').to_csv(f'{name}.csv', header=None, index=False)
output:
# left_only.csv
3L261,6/27/2022 1:15,AUH,SLL,ACT,320
3L133,4/5/2022 8:45,AUH,TRV,ACT,320
# right_only.csv
3L122,4/6/2022 23:35,CCJ,AUH,ACT,320
# both.csv
[empty file]