I am looking to compare two CSVs. Both CSVs will have nearly identical data, however the second CSV will have 2 identical rows that CSV 1 does not have. I would like the program to output both of those 2 rows so I can see which row is present in CSV 2, but not CSV 1, and how many times that row is present.
Here is my current logic:
import csv
import pandas as pd
import numpy as np
data1 = {"Col1": [0,1,1,2],
"Col2": [1,2,2,3],
"Col3": [5,2,1,1],
"Col4": [1,2,2,3]}
data2 = {"Col1": [0,1,1,2,4,4],
"Col2": [1,2,2,3,4,4],
"Col3": [5,2,1,1,4,4],
"Col4": [1,2,2,3,4,4]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
ds1 = set(tuple(line) for line in df1.values)
ds2 = set(tuple(line) for line in df2.values)
df = pd.DataFrame(list(ds2.difference(ds1)), columns=df2.columns)
print(df)
Here is my current outcome:
Col1 Col2 Col3 Col4
0 4 4 4 4
Here is my desired outcome:
Col1 Col2 Col3 Col4
0 4 4 4 4
1 4 4 4 4
As of right now, it only outputs the row once even though CSV has the row twice. What can I do so that it not only shows the missing row, but also for each time it is in the second CSV? Thanks in advance!
CodePudding user response:
You can use:
df2[~df2.eq(df1).all(axis=1)]
Result:
Col1 Col2 Col3 Col4
4 4 4 4 4
5 4 4 4 4
Or (if you want the index to be 0
and 1
):
df2[~df2.eq(df1).all(axis=1)].reset_index(drop=True)
Result:
Col1 Col2 Col3 Col4
0 4 4 4 4
1 4 4 4 4
N.B.
You can also use df2[df2.ne(df1).all(axis=1)]
instead of df2[~df2.eq(df1).all(axis=1)]
.
CodePudding user response:
There is almost always a built-in pandas
function meant to do what you want that will be better than trying to re-invent the wheel.
df = df2[~df2.isin(df1).all(axis=1)]
# OR df = df2[df2.ne(df1).all(axis=1)]
print(df)
Output:
Col1 Col2 Col3 Col4
4 4 4 4 4
5 4 4 4 4