Changing dtype of view changes results of numpy functions without changing values of the array-CodePudding

I want the values of the np.mean function to be roughly the same, before and after the dtype change. The dtype has to remain float32.

array = np.random.randint(0, high=255, size=(3, 12000, 12000),dtype="int")

array = array[:,500:10000,500:10000]

array= array.reshape((-1,3))
# array.shape is now (90250000, 3)

print(np.mean(array,axis=0),array.dtype) # Nr.1

array = array.astype("float32")

print(np.mean(array,axis=0),array.dtype) # Nr.2

Results of the two print functions:

[127.003107   127.00156286 126.99015613] int32
[47.589664 47.589664 47.589664] float32

Adding a .copy() the view line has no effect. The size of the view effects the impact on the float mean. Changing the size in both the last dimensions to [500:8000] results in:

[76.35497 76.35497 76.35497] float32

Around [500:5000]and below both means are actually around the same.

Changing the code starting from the reshape line:

array= array.reshape((-1,3))

array_float = array.astype("float32")

print(np.all(array_float==array),array.dtype,array_float.dtype)

Results in:

True int32 float32

So if the values are the same, why are the results from np.mean different ?

CodePudding user response：

Changing to "float64" solves the problem.

array = np.random.randint(0, high=255, size=(3, 12000, 12000),dtype="int")

array = array[:,500:10000,500:10000]

array= array.reshape((-1,3))
# array.shape is now (90250000, 3)

print(array.mean(axis=0),array.dtype) # Nr.1

array = array.astype("float64")
print(array.mean(axis=0),array.dtype) # Nr.2

Results in:

[126.98418438 126.9969912  127.00242922] int32
[126.98418438 126.9969912  127.00242922] float64

CodePudding user response：

Your array:

In [50]: arr.shape, arr.dtype
Out[50]: ((90250000, 3), dtype('int32'))

You could have gotten this with np.random.randint(0, high=255, size=(90250000,3),dtype="int"). In fact we don't need that size 3 dimension. Anyways it's just many numbers in the (0,255) range.

The expected mean:

In [51]: np.mean(arr, axis=0)
Out[51]: array([126.9822936 , 126.99682718, 126.99214526])

But notice what we get if we just sum those numbers:

In [52]: np.sum(arr, axis=0)
Out[52]: array([-1424749891, -1423438235, -1423860778])

The int32 sum as overflowed and wrapped around. There are too many numbers. So mean must be doing something more sophisticated than simply summing and dividing by the count.

Taking mean on the float32 gives the funny values:

In [53]: np.mean(arr.astype('float32'), axis=0)
Out[53]: array([47.589664, 47.589664, 47.589664], dtype=float32)

but float64 matches the int case (but with a long conversion time):

In [54]: np.mean(arr.astype('float64'), axis=0)
Out[54]: array([126.9822936 , 126.99682718, 126.99214526])

It looks like the float mean is just doing the sum and divide method:

In [56]: np.sum(arr.astype('float64'), axis=0)
Out[56]: array([1.14601520e 10, 1.14614637e 10, 1.14610411e 10])    
In [57]: np.sum(arr.astype('float32'), axis=0)
Out[57]: array([4.2949673e 09, 4.2949673e 09, 4.2949673e 09], dtype=float32)

In [58]: Out[56]/arr.shape[0]
Out[58]: array([126.9822936 , 126.99682718, 126.99214526])    
In [59]: Out[57]/arr.shape[0]
Out[59]: array([47.58966533, 47.58966533, 47.58966533])

While the sum is within the range of float32:

In [60]: np.finfo('float32')
Out[60]: finfo(resolution=1e-06, min=-3.4028235e 38, max=3.4028235e 38, dtype=float32)

for some reason it is having problems getting the right values.

Note that the python sum has problems with the int version:

In [70]: sum(arr[:,0])
C:\Users\paul\AppData\Local\Temp\ipykernel_1128\1456076714.py:1: RuntimeWarning: overflow encountered in long_scalars
  sum(arr[:,0])
Out[70]: -1424749891

There is a math.fsum that handles large sums better:

In [71]: math.fsum(arr[:,0])
Out[71]: 11460151997.0

Sum on the long ints also works fine:

In [72]: np.sum(arr.astype('int64'),axis=0)
Out[72]: array([11460151997, 11461463653, 11461041110], dtype=int64)

From the `np.mean` docs:

dtype : data-type, optional
        Type to use in computing the mean.  For integer inputs, the default
        is `float64`; for floating point inputs, it is the same as the
        input dtype.

Notes
    -----
    The arithmetic mean is the sum of the elements along the axis divided
    by the number of elements.

    Note that for floating-point input, the mean is computed using the
    same precision the input has.  Depending on the input data, this can
    cause the results to be inaccurate, especially for `float32` (see
    example below).  Specifying a higher-precision accumulator using the
    `dtype` keyword can alleviate this issue.

Playing with the dtype parameter:

In [74]: np.mean(arr, axis=0, dtype='int32')
Out[74]: array([-15, -15, -15])

In [75]: np.mean(arr, axis=0, dtype='int64')
Out[75]: array([126, 126, 126], dtype=int64)

In [76]: np.mean(arr, axis=0, dtype='float32')
Out[76]: array([47.589664, 47.589664, 47.589664], dtype=float32)

In [77]: np.mean(arr, axis=0, dtype='float64')
Out[77]: array([126.9822936 , 126.99682718, 126.99214526])

The -15 is explained by:

In [78]: -1424749891/arr.shape[0]
Out[78]: -15.786702393351801

In sum, if you want accurate results you need to use float64, either with the default mean dtype, or the appropriate astype. Working with float32 can give problems, especially with this many elements.

From the np.mean docs:

From the `np.mean` docs: