Home > Blockchain >  How do I compare 2 csv files for differences in Pandas?
How do I compare 2 csv files for differences in Pandas?

Time:05-04

I want to compare 2 csv files (File1.csv and File2.csv) which consists of hyperlinks and generate a csv (diff.csv) which consists of all the differences. This diff.csv should contain both the differences

  1. when a cell in file1.csv is compared against all the cells in file2.csv
  2. when a cell in file2.csv is compared against all the cells in file1.csv

Currently, I have done up the following script. I do get results but I am not sure if it is correct. As a beginner, I am not sure if this is the correct approach as to what I want to achieve.

import sys
import collections
import pandas as pd

#"read" each file
df = pd.read_csv('file1.csv')
dz = pd.read_csv('file2.csv')

# method1: Using the isin function method 
dg = df[~df.apply(tuple,1).isin(dz.apply(tuple,1))]
print (dg)

da = dz[~dz.apply(tuple,1).isin(df.apply(tuple,1))]
print (da)

dg.drop_duplicates()

dg.dropna(inplace = True)


    

    

CodePudding user response:

you can try this:

df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")

fullDf = pd.concat([df1,df2])
fullDf = fullDf[fullDf.duplicated(keep=False) == False]
fullDf.to_csv("answer.csv", index=False)

CodePudding user response:

df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns = ['A', 'B', 'C'])

df2 = pd.DataFrame([[1, 2, 3], [11, 15, 16], [17, 18, 19]], columns = ['A', 'B', 'C'])

df = pd.concat([df1,df2])

df.drop_duplicates(keep=False, inplace=True)

CodePudding user response:

update:

df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")

df = pd.concat([df1,df2])
df = df[df.duplicated(susbset='your_column_name', keep=False) == False]
df.to_csv("result.csv", index=False)

or

df = pd.concat([df1,df2])

df.drop_duplicates(keep=False, inplace=True, susbset='your_column_name')

you can simply do it like that:

with open('data1.csv', 'r') as csv1, open('data2.csv', 'r') as csv2:
    import1 = csv1.readlines()
    import2 = csv2.readlines()

with open('data_diff.csv', 'w') as outFile:
    for row in import2:
        if row not in import1:
            outFile.write(row)
  • Related