Overflow encountered in multiply in np.var while np.nanvar works just fine-CodePudding

Suppose I have two 2D numpy arrays: x of type float64 and mask of type bool. I want to find variation of every column in x only taking into account numbers defined by mask. Here's what I did:

np.var(x, axis=0, where=mask)

Unfortunately, it produced error:

FloatingPointError: overflow encountered in multiply

So I came up with this code:

x[~mask] = np.nan
np.nanvar(x, axis=0)

which works fine. However I'd like to avoid this approach as it requires me to do unnecessary assignment of nans which is a waste of time. I'd also like to avoid explicitly specifying dtype like this

np.var(x, axis=0, where=mask, dtype=np.float128)

since I consider the overflow error to be result of my poor code and np.float64 should be more than enough for my task. So please help me understand why seemingly equivalent first and second snippets yield such different results.

EDIT: Important thing to note here is that I work in Jupyter notebook and that error seems to be reproducible only after kernel restart. Simply running the cell twice fixes it for some reason.

CodePudding user response：

There is an intermediate calculation in numpy.var that operates on all the input values, even those where where is False. This can result in spurious warnings from numpy.var. See the numpy issue that I created about this.

Your nan-based work-around is fine. It might also be sufficient to use x[~mask] = 0 and then use numpy.var as you already are.

CodePudding user response：

You can shift your data to avoid an overflow without shifting the variance:

np.var(x - np.max(x), axis=0, where=mask)