I want to compare 2 csv files (File1.csv and File2.csv) which consists of hyperlinks and generate a csv (diff.csv) which consists of all the differences. This diff.csv should contain both the differences
- when a cell in file1.csv is compared against all the cells in file2.csv
- when a cell in file2.csv is compared against all the cells in file1.csv
Currently, I have done up the following script. I do get results but I am not sure if it is correct. As a beginner, I am not sure if this is the correct approach as to what I want to achieve.
import sys
import collections
import pandas as pd
#"read" each file
df = pd.read_csv('file1.csv')
dz = pd.read_csv('file2.csv')
# method1: Using the isin function method
dg = df[~df.apply(tuple,1).isin(dz.apply(tuple,1))]
print (dg)
da = dz[~dz.apply(tuple,1).isin(df.apply(tuple,1))]
print (da)
dg.drop_duplicates()
dg.dropna(inplace = True)
CodePudding user response:
you can try this:
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
fullDf = pd.concat([df1,df2])
fullDf = fullDf[fullDf.duplicated(keep=False) == False]
fullDf.to_csv("answer.csv", index=False)
CodePudding user response:
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns = ['A', 'B', 'C'])
df2 = pd.DataFrame([[1, 2, 3], [11, 15, 16], [17, 18, 19]], columns = ['A', 'B', 'C'])
df = pd.concat([df1,df2])
df.drop_duplicates(keep=False, inplace=True)
CodePudding user response:
update:
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df = pd.concat([df1,df2])
df = df[df.duplicated(susbset='your_column_name', keep=False) == False]
df.to_csv("result.csv", index=False)
or
df = pd.concat([df1,df2])
df.drop_duplicates(keep=False, inplace=True, susbset='your_column_name')
you can simply do it like that:
with open('data1.csv', 'r') as csv1, open('data2.csv', 'r') as csv2:
import1 = csv1.readlines()
import2 = csv2.readlines()
with open('data_diff.csv', 'w') as outFile:
for row in import2:
if row not in import1:
outFile.write(row)