numpy: how to keep datatype of half precision array scalar input when adding a number-CodePudding

I have a half precision input that may be an array scalar x = np.float32(3). I'd like to add one to it: y = x 1. However y will be float64 instead of float32.

I can only think of two workarounds:

convert the input to 1d array: x = np.float32([3]) so that y = x 1 is float32
convert 1 into lower precision: y = np.float32(3) np.float16(1) is float32

However, I have a lot of functions, so the above fixes require me to add if-else statements to each function... Are there any better ways? Thanks!

CodePudding user response：

0x5 "Adding Integer to half-float not producing the expected result" Why is half the size? 0x6a0100 "float64 cannot be cast to numpy.complex64" in ufuncs. Numpy should have known

We are going through a type conversion uncertainty since numpy 1.13. It was discussed in 0x67 "Quick fix for integer operation with half dtype in NumPy". A decision was made to resolve as follows: "compatibility with Matlab, always convert to float16 before operation". The bug reported in 0x6e "sum(a) where a = float32(1) is float64" backtracked that decision, but without a clear understanding that:

The issue is with how datatypes propagate through scalar inputs. That's a bigger issue than just summing. Mixing scalars with arrays is always a gray area, as you experienced. In some contexts (deconte abd deduce) such a mix should raise, but there is no consensus how np should handle them (see 0x75 "Array scalar artifact at a ufunc boundary"). Until that's resolved.. Matlab's upcasting, because it does it to 16, is not a good one for numpy. That upcasting is especially problematic for product, and might be the reason why numpy issues sometimes suggest that, but "matlab doesn't need to be revised because mathematicians are used to this surprise", which also means matlab is used by these mathematicians with warnings, and "doesn't need to be revised because C was defined this way", which also means C is used on floats as if they are integers to avoid the surprise.

CodePudding user response：

You can use numpy.add(..., dtype=np.float64) like below:

>>> res = np.add(np.array([3], dtype=np.float32), np.float16(1), dtype=np.float64)
>>> res
array([4.])

>>> res.dtype
dtype('float64')