Home > Back-end >  np.where to compare whether two columns are identical is inconsistent with results from Excel
np.where to compare whether two columns are identical is inconsistent with results from Excel

Time:01-21

I am trying to use np.where to compare whether the values from two columns are equal, but I am getting inconsistent results.

df['compare'] = np.where(df['a'] == df['b'], '0', '1')

Output:

a       b               compare
1B      NaN             1
NaN     NaN             1
NaN     NaN             1
32X     NaN             1
NaN     NaN             1
NaN     NaN             1
NaN     NaN             1
NaN     NaN             1
NaN     NaN             1
NaN     NaN             1
NaN     321             1
NaN     Z51             1
NaN     3Y              1   

It seemed strange that the command would return pairs of NaN as non-matches. I confirmed that column 'a' and column 'b' are both string data types.

I double checked the original CSV files. Using the 'if' formula in Excel, I found several additional pairs of non-matches. The NaN matches were not identified in matches in Excel.

Any tips on troubleshooting this issue?

CodePudding user response:

nan is a special value which is not equal to itself and should not be used in equality test. You need to fill the df with comparable values beforehand:

df_ = df.fillna(0)
df['compare'] = np.where(df_['a'] == df_['b'], '0', '1')

      a    b compare
0    1B  NaN       1
1   NaN  NaN       0
2   NaN  NaN       0
3   32X  NaN       1
4   NaN  NaN       0
5   NaN  NaN       0
6   NaN  NaN       0
7   NaN  NaN       0
8   NaN  NaN       0
9   NaN  NaN       0
10  NaN  321       1
11  NaN  Z51       1
12  NaN   3Y       1
  • Related