Suppose I have two 2D numpy arrays: x
of type float64
and mask
of type bool
. I want to find variation of every column in x
only taking into account numbers defined by mask. Here's what I did:
np.var(x, axis=0, where=mask)
Unfortunately, it produced error:
FloatingPointError: overflow encountered in multiply
So I came up with this code:
x[~mask] = np.nan
np.nanvar(x, axis=0)
which works fine. However I'd like to avoid this approach as it requires me to do unnecessary assignment of nans which is a waste of time. I'd also like to avoid explicitly specifying dtype
like this
np.var(x, axis=0, where=mask, dtype=np.float128)
since I consider the overflow error to be result of my poor code and np.float64
should be more than enough for my task.
So please help me understand why seemingly equivalent first and second snippets yield such different results.
EDIT: Important thing to note here is that I work in Jupyter notebook and that error seems to be reproducible only after kernel restart. Simply running the cell twice fixes it for some reason.
CodePudding user response:
There is an intermediate calculation in numpy.var
that operates on all the input values, even those where where
is False. This can result in spurious warnings from numpy.var
. See the numpy issue that I created about this.
Your nan
-based work-around is fine. It might also be sufficient to use x[~mask] = 0
and then use numpy.var
as you already are.
CodePudding user response:
You can shift your data to avoid an overflow without shifting the variance:
np.var(x - np.max(x), axis=0, where=mask)