I am trying to use np.where to compare whether the values from two columns are equal, but I am getting inconsistent results.
df['compare'] = np.where(df['a'] == df['b'], '0', '1')
Output:
a b compare
1B NaN 1
NaN NaN 1
NaN NaN 1
32X NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN 321 1
NaN Z51 1
NaN 3Y 1
It seemed strange that the command would return pairs of NaN as non-matches. I confirmed that column 'a' and column 'b' are both string data types.
I double checked the original CSV files. Using the 'if' formula in Excel, I found several additional pairs of non-matches. The NaN matches were not identified in matches in Excel.
Any tips on troubleshooting this issue?
CodePudding user response:
nan
is a special value which is not equal to itself and should not be used in equality test. You need to fill the df
with comparable values beforehand:
df_ = df.fillna(0)
df['compare'] = np.where(df_['a'] == df_['b'], '0', '1')
a b compare
0 1B NaN 1
1 NaN NaN 0
2 NaN NaN 0
3 32X NaN 1
4 NaN NaN 0
5 NaN NaN 0
6 NaN NaN 0
7 NaN NaN 0
8 NaN NaN 0
9 NaN NaN 0
10 NaN 321 1
11 NaN Z51 1
12 NaN 3Y 1