Home > Net >  How to output each missing row when comparing two CSV using pandas in python
How to output each missing row when comparing two CSV using pandas in python

Time:07-19

I am looking to compare two CSVs. Both CSVs will have nearly identical data, however the second CSV will have 2 identical rows that CSV 1 does not have. I would like the program to output both of those 2 rows so I can see which row is present in CSV 2, but not CSV 1, and how many times that row is present.

Here is my current logic:

import csv
import pandas as pd
import numpy as np

data1 = {"Col1": [0,1,1,2],
         "Col2": [1,2,2,3],
         "Col3": [5,2,1,1],
         "Col4": [1,2,2,3]}

data2 = {"Col1": [0,1,1,2,4,4],
         "Col2": [1,2,2,3,4,4],
         "Col3": [5,2,1,1,4,4],
         "Col4": [1,2,2,3,4,4]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

ds1 = set(tuple(line) for line in df1.values)
ds2 = set(tuple(line) for line in df2.values)
df = pd.DataFrame(list(ds2.difference(ds1)), columns=df2.columns)

print(df)

Here is my current outcome:

   Col1  Col2  Col3  Col4  
0     4     4     4     4

Here is my desired outcome:

   Col1  Col2  Col3  Col4  
0     4     4     4     4
1     4     4     4     4

As of right now, it only outputs the row once even though CSV has the row twice. What can I do so that it not only shows the missing row, but also for each time it is in the second CSV? Thanks in advance!

CodePudding user response:

You can use:

df2[~df2.eq(df1).all(axis=1)]

Result:

   Col1  Col2  Col3  Col4
4     4     4     4     4
5     4     4     4     4

Or (if you want the index to be 0 and 1):

df2[~df2.eq(df1).all(axis=1)].reset_index(drop=True)

Result:

   Col1  Col2  Col3  Col4
0     4     4     4     4
1     4     4     4     4

N.B.

You can also use df2[df2.ne(df1).all(axis=1)] instead of df2[~df2.eq(df1).all(axis=1)].

CodePudding user response:

There is almost always a built-in pandas function meant to do what you want that will be better than trying to re-invent the wheel.

df = df2[~df2.isin(df1).all(axis=1)]
# OR df = df2[df2.ne(df1).all(axis=1)]
print(df)

Output:

   Col1  Col2  Col3  Col4
4     4     4     4     4
5     4     4     4     4
  • Related