I want to make the code that compare two csv files!
import pandas as pd
import numpy as np
df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")
df['Correct'] = None
def Result(x):
if ....:
return int(1)
else:
return int(0)
df.loc[:,"Correct"]=df.apply(Result,axis=1)
print(df["Correct"])
df.to_csv("E:\file.csv")
print(df.head(20))
For example, file.csv format seems like below:
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
1 2 2021.04 2 9 10 16 35 37
2 3 2021.04 4 15 24 35 36 40
3 4 2021.03 10 11 20 21 25 41
4 5 2021.03 4 9 23 26 29 33
5 6 2021.03 1 9 26 28 30 41
Dupfile.csv seems like below:
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
0 1 2021.04 1 2 3 4 5 6
1 2 2021.04 2 9 10 16 35 37
1 2 2021.04 1 2 3 4 5 6
2 3 2021.04 4 15 24 35 36 40
2 3 2021.04 1 2 3 4 5 6
3 4 2021.03 10 11 20 21 25 41
3 4 2021.03 1 2 3 4 5 6
4 5 2021.03 4 9 23 26 29 33
4 5 2021.03 1 2 3 4 5 6
it has one more same round, but value is different.
check the file's round value with Dupfile's round and if the first to sixth value is equal, make the another "Correct" column in Dupfile and put 1. If not correct, put 0 to the "Correct" Column.
I tried to compare two different csv file but, I don't know how to do it. Can someone help me?
my expectation answer:
round date first second third fourth fifth sixth Correct
0 1 2021.04 1 14 15 24 40 41 1
0 1 2021.04 1 2 3 4 5 6 0
1 2 2021.04 2 9 10 16 35 37 1
1 2 2021.04 1 2 3 4 5 6 0
2 3 2021.04 4 15 24 35 36 40 1
2 3 2021.04 1 2 3 4 5 6 0
3 4 2021.03 10 11 20 21 25 41 1
3 4 2021.03 1 2 3 4 5 6 0
4 5 2021.03 4 9 23 26 29 33 1
4 5 2021.03 1 2 3 4 5 6 0
CodePudding user response:
If you use pandas
module, it will be better to gain the methods that provide in the module. I suggest you, try to use merge
for comparing 2 different DataFrames. I rewrite your code as follows.
import pandas as pd
df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")
df1['Correct'] = 1
df = df.merge(
df1,
how='left',
on=['round',
'date',
'first',
'second',
'third',
'fourth',
'fifth',
'sixth']).fillna(0)
print(df)
print(df['Correct'])
df.to_csv("E:\file.csv")
print(df.head(20))
How does it work?
The merge
method tries to match the columns in df
and df1
with the same names that exist in on
array. When you select left
for how
argument, no values on the left side of merging (df
) would be removed (Left Join). In another way, the correct
column that we create in file.csv
appends to Dupfil.csv
data, and non-match is assigned as nan
value. The fillna(0)
method helps us to replace nan
values with 0.
pandas.DataFrame.merge API reference
CodePudding user response:
You can do it with pure pandas using df.merge
.
Check out the example:
import pandas as pd
# file.csv
file_df = pd.DataFrame(
columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
data=[
("1", "2021.04", "1", "14", "15", "24", "40", "41"),
("2", "2021.04", "2", "9", "10", "16", "35", "37"),
("3", "2021.04", "4", "15", "24", "35", "36", "40"),
("4", "2021.03", "10", "11", "20", "21", "25", "41"),
("5", "2021.03", "4", "9", "23", "26", "29", "33"),
("6", "2021.03", "1", "9", "26", "28", "30", "41"),
],
)
# adding control column (we already know that those are the right values)
file_df["correct"] = 1
# Dupfile.csv
dup_file_df = pd.DataFrame(
columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
data=[
("1", "2021.04", "1", "14", "15", "24", "40", "41"),
("1", "2021.04", "1", "2", "3", "4", "5", "6"),
("2", "2021.04", "2", "9", "10", "16", "35", "37"),
("2", "2021.04", "1", "2", "3", "4", "5", "6"),
("3", "2021.04", "4", "15", "24", "35", "36", "40"),
("3", "2021.04", "1", "2", "3", "4", "5", "6"),
("4", "2021.03", "10", "11", "20", "21", "25", "41"),
("4", "2021.03", "1", "2", "3", "4", "5", "6"),
("5", "2021.03", "4", "9", "23", "26", "29", "33"),
("5", "2021.03", "1", "2", "3", "4", "5", "6"),
],
)
# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]
# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth).
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)
# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)
# Done!
print(merged)