I want the values of the np.mean function to be roughly the same, before and after the dtype change. The dtype has to remain float32.
array = np.random.randint(0, high=255, size=(3, 12000, 12000),dtype="int")
array = array[:,500:10000,500:10000]
array= array.reshape((-1,3))
# array.shape is now (90250000, 3)
print(np.mean(array,axis=0),array.dtype) # Nr.1
array = array.astype("float32")
print(np.mean(array,axis=0),array.dtype) # Nr.2
Results of the two print functions:
[127.003107 127.00156286 126.99015613] int32
[47.589664 47.589664 47.589664] float32
Adding a .copy() the view line has no effect. The size of the view effects the impact on the float mean. Changing the size in both the last dimensions to [500:8000]
results in:
[76.35497 76.35497 76.35497] float32
Around [500:5000]
and below both means are actually around the same.
Changing the code starting from the reshape line:
array= array.reshape((-1,3))
array_float = array.astype("float32")
print(np.all(array_float==array),array.dtype,array_float.dtype)
Results in:
True int32 float32
So if the values are the same, why are the results from np.mean different ?
CodePudding user response:
Changing to "float64" solves the problem.
array = np.random.randint(0, high=255, size=(3, 12000, 12000),dtype="int")
array = array[:,500:10000,500:10000]
array= array.reshape((-1,3))
# array.shape is now (90250000, 3)
print(array.mean(axis=0),array.dtype) # Nr.1
array = array.astype("float64")
print(array.mean(axis=0),array.dtype) # Nr.2
Results in:
[126.98418438 126.9969912 127.00242922] int32
[126.98418438 126.9969912 127.00242922] float64
CodePudding user response:
Your array:
In [50]: arr.shape, arr.dtype
Out[50]: ((90250000, 3), dtype('int32'))
You could have gotten this with np.random.randint(0, high=255, size=(90250000,3),dtype="int")
. In fact we don't need that size 3 dimension. Anyways it's just many numbers in the (0,255) range.
The expected mean:
In [51]: np.mean(arr, axis=0)
Out[51]: array([126.9822936 , 126.99682718, 126.99214526])
But notice what we get if we just sum those numbers:
In [52]: np.sum(arr, axis=0)
Out[52]: array([-1424749891, -1423438235, -1423860778])
The int32
sum as overflowed and wrapped around. There are too many numbers. So mean
must be doing something more sophisticated than simply summing and dividing by the count.
Taking mean on the float32
gives the funny values:
In [53]: np.mean(arr.astype('float32'), axis=0)
Out[53]: array([47.589664, 47.589664, 47.589664], dtype=float32)
but float64
matches the int case (but with a long conversion time):
In [54]: np.mean(arr.astype('float64'), axis=0)
Out[54]: array([126.9822936 , 126.99682718, 126.99214526])
It looks like the float mean
is just doing the sum and divide method:
In [56]: np.sum(arr.astype('float64'), axis=0)
Out[56]: array([1.14601520e 10, 1.14614637e 10, 1.14610411e 10])
In [57]: np.sum(arr.astype('float32'), axis=0)
Out[57]: array([4.2949673e 09, 4.2949673e 09, 4.2949673e 09], dtype=float32)
In [58]: Out[56]/arr.shape[0]
Out[58]: array([126.9822936 , 126.99682718, 126.99214526])
In [59]: Out[57]/arr.shape[0]
Out[59]: array([47.58966533, 47.58966533, 47.58966533])
While the sum is within the range of float32
:
In [60]: np.finfo('float32')
Out[60]: finfo(resolution=1e-06, min=-3.4028235e 38, max=3.4028235e 38, dtype=float32)
for some reason it is having problems getting the right values.
Note that the python sum
has problems with the int version:
In [70]: sum(arr[:,0])
C:\Users\paul\AppData\Local\Temp\ipykernel_1128\1456076714.py:1: RuntimeWarning: overflow encountered in long_scalars
sum(arr[:,0])
Out[70]: -1424749891
There is a math.fsum
that handles large sums better:
In [71]: math.fsum(arr[:,0])
Out[71]: 11460151997.0
Sum on the long ints also works fine:
In [72]: np.sum(arr.astype('int64'),axis=0)
Out[72]: array([11460151997, 11461463653, 11461041110], dtype=int64)
From the np.mean
docs:
dtype : data-type, optional
Type to use in computing the mean. For integer inputs, the default
is `float64`; for floating point inputs, it is the same as the
input dtype.
Notes
-----
The arithmetic mean is the sum of the elements along the axis divided
by the number of elements.
Note that for floating-point input, the mean is computed using the
same precision the input has. Depending on the input data, this can
cause the results to be inaccurate, especially for `float32` (see
example below). Specifying a higher-precision accumulator using the
`dtype` keyword can alleviate this issue.
Playing with the dtype
parameter:
In [74]: np.mean(arr, axis=0, dtype='int32')
Out[74]: array([-15, -15, -15])
In [75]: np.mean(arr, axis=0, dtype='int64')
Out[75]: array([126, 126, 126], dtype=int64)
In [76]: np.mean(arr, axis=0, dtype='float32')
Out[76]: array([47.589664, 47.589664, 47.589664], dtype=float32)
In [77]: np.mean(arr, axis=0, dtype='float64')
Out[77]: array([126.9822936 , 126.99682718, 126.99214526])
The -15
is explained by:
In [78]: -1424749891/arr.shape[0]
Out[78]: -15.786702393351801
In sum, if you want accurate results you need to use float64
, either with the default mean dtype
, or the appropriate astype
. Working with float32
can give problems, especially with this many elements.