I have a DataFrame below which has some missing values.
df = pd.DataFrame(data=[['A', 1, None], ['B', 2, 5]],
columns=['X', 'Y', 'Z'])
Since df['Z']
is supposed to be an integer column, I changed its data type to pandas
new experimental type nullable integer as below.
ydf['Z'] = ydf['Z'].astype(pd.Int32Dtype())
ydf
X Y Z
0 A 1 <NA>
1 B 2 5
Now I am trying to use a simple numpy
where method to replace the non-null values in the column df['Z']
with a fixed integer value (say 1
) using the code below.
np.where(pd.isna(ydf['Z']), pd.NA, np.where(ydf['Z'] > 0, 1, 0))
But I get the following error, and I am unable to understand why as I am already checking for the rows with null values in the first condition.
TypeError: boolean value of NA is ambiguous
CodePudding user response:
np.where
expects an array of booleans. With the int64
dtype, using >
on the Series returns False
for nans. With the Int32
dtype (note the capital I
), >
doesn't coerce nans to False, thus the error.
One solution would be to use ydf['Z'].gt(0).fillna(False)
instead of ydf['Z'] > 0
. (They're the same, the second one just changes NA to False):
np.where(pd.isna(ydf['Z']), pd.NA, np.where(ydf['Z'].gt(0).fillna(False), 1, 0))
CodePudding user response:
Let's use some python "OR" short circuiting logic.
np.where(df['Z'].isna() | (df['Z'] <= 0), 0, 1)
Output:
array([0, 1])
CodePudding user response:
As suggested by @user17242583, np.where
need an array of boolean values only but your comparison return a tri-state array: True
, False
and <NA>
.
>>> df['Z'] > 0
0 <NA>
1 True
Name: Z, dtype: boolean
In this case, np.where
can't decide if the returned value should be interpreted as True
or False
.
Just cast on the fly your column:
>>> np.where(pd.isna(df['Z']), pd.NA, np.where(df['Z'].astype(float) > 0, 1, 0))
array([<NA>, 1], dtype=object)