How to match two dataframes precisely and get the output as 1 if matched and 0 if not matched?-CodePudding

The dataframe is as follows:

df1:

name     |   age    |   state    |   number   | score
------------------------------------------------------
A            23         AZ         5434567        92.1
B            54         AZ         1234543        87.6
C            32         AZ         7654344        89.9
D            44         GA         8765433        72.4

df2:

name     |   age    |   state    |   number   | score
------------------------------------------------------
A            23         GA         5434567       92.1
D            54         AZ         1234543       76.4
C            33         AZ         7654344       99.9
D            46         GA         8765433       72.4

The desired dataframe is as follows:

name     |   age    |   state    |   number   | score
-------------------------------------------------------
1            1            0            1          1
0            1            1            1          0
1            0            1            1          0
1            0            1            1          1

The code I tried is:

outputdf = df1.eq(df2)

and

outputdf = df1.ne(df2)

But neither of them seem to work correctly.

wrong output after using the eq line:

name     |   age    |   state    |   number   | score
-------------------------------------------------------
1            1            0            1          0
0            1            1            1          1
1            0            1            1          1
1            0            1            1          1

wrong output after using the ne line:

name     |   age    |   state    |   number   | score
-------------------------------------------------------
1            1            0            1          0
0            1            1            1          1
1            0            0            0          1
0            0            0            0          1

Could anyone please help me out here? Thank you

CodePudding user response：

direct comparison of the dataframes should work, just cast from bool to integer:

df1.eq(df2).astype(int)
# or (df1 == df2).astype(int)

output:

   name  age  state  number
0     1    1      0       1
1     0    1      1       1
2     1    0      1       1
3     1    0      1       1

CodePudding user response：

Could your issue be due to floating point approximation?

You can round the numerical columns before comparison:

out = (df1.select_dtypes('number').round(2) # use the desired precision
 .eq(df2.select_dtypes('number').round(2))
 .astype(int)
)

Output:

   age  number  score
0    1       1      1
1    1       1      0
2    0       1      0
3    0       1      1

If this was the issue and you want the output with all columns, you can correct your initial output with:

# initial output
out = df1.eq(df2).astype(int)

# correction to account for floating point approximation
# use the atol/rtol parameters if needed
cols = list(df1.select_dtypes('number'))
out[cols] = np.isclose(df1[cols], df2[cols]).astype(int)

# or correction with round
# out[cols] = df1[cols].round(2).eq(df2[cols].round(2)).astype(int)

Output:

   name  age  state  number  score
0     1    1      0       1      1
1     0    1      1       1      0
2     1    0      1       1      0
3     1    0      1       1      1

CodePudding user response：

Because float columns precision problem is possible exctract them and compare separately with numpy.isclose, then add all another columns in concat:

cols = df1.select_dtypes('floating').columns
cols1 = df1.columns.difference(cols)

df3 = pd.DataFrame(np.isclose(df1[cols], df2[cols]).astype(int), columns=cols)
df4 = df1[cols1].eq(df2[cols1]).astype(int)

df = pd.concat([df3, df4], axis=1).reindex(df1.columns, axis=1)
print (df)
   name  age  state  number  score
0     1    1      0       1      1
1     0    1      1       1      0
2     1    0      1       1      0
3     1    0      1       1      1

CodePudding user response：

The problem you are having is likely to be due to the way the dataframes are constructed. For example, if I generate the dataframes as indicated in the example inputs above - then I get the expected outputs using:

outputdf = df1.eq(df2).astype(int)

Here is an example of the code:

df1 = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D'],
                    'age': [23, 54, 32, 44],
                    'state': ['AZ', 'AZ', 'AZ', 'GA'],
                    'number': [5434567, 1234543, 7654344, 8765433],
                    'score': [92.1, 87.6, 89.9, 72.4]})

df2 = pd.DataFrame({ 'name': ['A', 'D', 'C', 'D'],
                    'age': [23, 54, 33, 46],
                    'state': ['GA', 'AZ', 'AZ', 'GA'],
                    'number': [5434567, 1234543, 7654344, 8765433],
                    'score': [92.1, 76.4, 99.9, 72.4]})

outputdf = df1.eq(df2).astype(int)
print(outputdf)

OUTPUT:

   name  age  state  number  score
0     1    1      0       1      1
1     0    1      1       1      0
2     1    0      1       1      0
3     1    0      1       1      1