How can I compare two different csv file using python?-CodePudding

I want to make the code that compare two csv files!

import pandas as pd
import numpy as np

    df = pd.read_csv("E:\Dupfile.csv")
    df1 = pd.read_csv("E:\file.csv")
    
    df['Correct'] = None
    
    def Result(x):
       if ....:
         return int(1)
       else:
         return int(0)
    
    
    df.loc[:,"Correct"]=df.apply(Result,axis=1)
    
    print(df["Correct"])
    
    df.to_csv("E:\file.csv")
    print(df.head(20))

For example, file.csv format seems like below:

     round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41     
1     2  2021.04      2       9     10      16     35     37      
2     3  2021.04      4      15     24      35     36     40      
3     4  2021.03     10      11     20      21     25     41     
4     5  2021.03      4       9     23      26     29     33     
5     6  2021.03      1       9     26      28     30     41

Dupfile.csv seems like below:

    round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41  
0     1  2021.04      1       2      3       4      5      6    
1     2  2021.04      2       9     10      16     35     37   
1     2  2021.04      1       2      3       4      5      6      
2     3  2021.04      4      15     24      35     36     40    
2     3  2021.04      1       2      3       4      5      6     
3     4  2021.03     10      11     20      21     25     41  
3     4  2021.03      1       2      3       4      5      6     
4     5  2021.03      4       9     23      26     29     33  
4     5  2021.03      1       2      3       4      5      6

it has one more same round, but value is different.

check the file's round value with Dupfile's round and if the first to sixth value is equal, make the another "Correct" column in Dupfile and put 1. If not correct, put 0 to the "Correct" Column.

I tried to compare two different csv file but, I don't know how to do it. Can someone help me?

my expectation answer:

    round    date  first  second  third  fourth  fifth  sixth Correct
0     1  2021.04      1      14     15      24     40     41    1
0     1  2021.04      1       2      3       4      5      6    0
1     2  2021.04      2       9     10      16     35     37    1
1     2  2021.04      1       2      3       4      5      6    0  
2     3  2021.04      4      15     24      35     36     40    1
2     3  2021.04      1       2      3       4      5      6    0 
3     4  2021.03     10      11     20      21     25     41    1
3     4  2021.03      1       2      3       4      5      6    0 
4     5  2021.03      4       9     23      26     29     33    1
4     5  2021.03      1       2      3       4      5      6    0

CodePudding user response：

If you use pandas module, it will be better to gain the methods that provide in the module. I suggest you, try to use merge for comparing 2 different DataFrames. I rewrite your code as follows.

import pandas as pd

df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")

df1['Correct'] = 1

df = df.merge(
        df1,
        how='left',
        on=['round',
            'date',
            'first',
            'second',
            'third',
            'fourth',
            'fifth',
            'sixth']).fillna(0)

print(df)

print(df['Correct'])

df.to_csv("E:\file.csv")
print(df.head(20))

How does it work?

The merge method tries to match the columns in df and df1 with the same names that exist in on array. When you select left for how argument, no values on the left side of merging (df) would be removed (Left Join). In another way, the correct column that we create in file.csv appends to Dupfil.csv data, and non-match is assigned as nan value. The fillna(0) method helps us to replace nan values with 0.

pandas.DataFrame.merge API reference

CodePudding user response：

You can do it with pure pandas using df.merge.

Check out the example:

import pandas as pd


# file.csv
file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("6", "2021.03", "1", "9", "26", "28", "30", "41"),
    ],
)

# adding control column (we already know that those are the right values)
file_df["correct"] = 1

# Dupfile.csv
dup_file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("1", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("2", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("3", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("4", "2021.03", "1", "2", "3", "4", "5", "6"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("5", "2021.03", "1", "2", "3", "4", "5", "6"),
    ],
)

# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]

# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth). 
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)

# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)

# Done!
print(merged)