The dataframe is as follows:
df1:
name | age | state | number | score
------------------------------------------------------
A 23 AZ 5434567 92.1
B 54 AZ 1234543 87.6
C 32 AZ 7654344 89.9
D 44 GA 8765433 72.4
df2:
name | age | state | number | score
------------------------------------------------------
A 23 GA 5434567 92.1
D 54 AZ 1234543 76.4
C 33 AZ 7654344 99.9
D 46 GA 8765433 72.4
The desired dataframe is as follows:
name | age | state | number | score
-------------------------------------------------------
1 1 0 1 1
0 1 1 1 0
1 0 1 1 0
1 0 1 1 1
The code I tried is:
outputdf = df1.eq(df2)
and
outputdf = df1.ne(df2)
But neither of them seem to work correctly.
wrong output after using the eq line:
name | age | state | number | score
-------------------------------------------------------
1 1 0 1 0
0 1 1 1 1
1 0 1 1 1
1 0 1 1 1
wrong output after using the ne line:
name | age | state | number | score
-------------------------------------------------------
1 1 0 1 0
0 1 1 1 1
1 0 0 0 1
0 0 0 0 1
Could anyone please help me out here? Thank you
CodePudding user response:
direct comparison of the dataframes should work, just cast from bool to integer:
df1.eq(df2).astype(int)
# or (df1 == df2).astype(int)
output:
name age state number
0 1 1 0 1
1 0 1 1 1
2 1 0 1 1
3 1 0 1 1
CodePudding user response:
Could your issue be due to floating point approximation?
You can round the numerical columns before comparison:
out = (df1.select_dtypes('number').round(2) # use the desired precision
.eq(df2.select_dtypes('number').round(2))
.astype(int)
)
Output:
age number score
0 1 1 1
1 1 1 0
2 0 1 0
3 0 1 1
If this was the issue and you want the output with all columns, you can correct your initial output with:
# initial output
out = df1.eq(df2).astype(int)
# correction to account for floating point approximation
# use the atol/rtol parameters if needed
cols = list(df1.select_dtypes('number'))
out[cols] = np.isclose(df1[cols], df2[cols]).astype(int)
# or correction with round
# out[cols] = df1[cols].round(2).eq(df2[cols].round(2)).astype(int)
Output:
name age state number score
0 1 1 0 1 1
1 0 1 1 1 0
2 1 0 1 1 0
3 1 0 1 1 1
CodePudding user response:
Because float columns precision problem is possible exctract them and compare separately with numpy.isclose
, then add all another columns in concat
:
cols = df1.select_dtypes('floating').columns
cols1 = df1.columns.difference(cols)
df3 = pd.DataFrame(np.isclose(df1[cols], df2[cols]).astype(int), columns=cols)
df4 = df1[cols1].eq(df2[cols1]).astype(int)
df = pd.concat([df3, df4], axis=1).reindex(df1.columns, axis=1)
print (df)
name age state number score
0 1 1 0 1 1
1 0 1 1 1 0
2 1 0 1 1 0
3 1 0 1 1 1
CodePudding user response:
The problem you are having is likely to be due to the way the dataframes are constructed. For example, if I generate the dataframes as indicated in the example inputs above - then I get the expected outputs using:
outputdf = df1.eq(df2).astype(int)
Here is an example of the code:
df1 = pd.DataFrame({ 'name': ['A', 'B', 'C', 'D'],
'age': [23, 54, 32, 44],
'state': ['AZ', 'AZ', 'AZ', 'GA'],
'number': [5434567, 1234543, 7654344, 8765433],
'score': [92.1, 87.6, 89.9, 72.4]})
df2 = pd.DataFrame({ 'name': ['A', 'D', 'C', 'D'],
'age': [23, 54, 33, 46],
'state': ['GA', 'AZ', 'AZ', 'GA'],
'number': [5434567, 1234543, 7654344, 8765433],
'score': [92.1, 76.4, 99.9, 72.4]})
outputdf = df1.eq(df2).astype(int)
print(outputdf)
OUTPUT:
name age state number score
0 1 1 0 1 1
1 0 1 1 1 0
2 1 0 1 1 0
3 1 0 1 1 1