Python pickling and unpickling numpy array?-CodePudding

I have the following dilemma. I am trying to pickle and then unpickle a numpy array that represents an image.

Executing this code:

a1 = np.zeros((1080, 1920, 3), dtype=np.uint8)
print(sys.getsizeof(a1), a1.shape)

a2 = pickle.dumps(a1)
print(sys.getsizeof(a2), type(a2))

a3 = pickle.loads(a2)
print(sys.getsizeof(a3), a3.shape)

Produces this output:

6220928 (1080, 1920, 3)
6220995 <class 'bytes'>
128 (1080, 1920, 3)

Now, a1 is thus around 6 MB, a2 is the pickle representation of a1 and is a bit longer but still roughly the same. And then I try to unpickle a2 and I get... something obviously not right.

a3 looks fine, i can call methods, I can assign values to it's cells etc.

The result is the same if I replace pickle calls with a1.dumps and np.loads since these just call pickle.

So what exactly is the deal with the weird size?

CodePudding user response：

From the sys.getsizeof docs:

Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.

Emphasis mine. Basically there's no guarantee that sys.getsizeof will give you consistent or correct values for numpy objects.

CodePudding user response：

try the following code to see what's the difference if using np.size(x) * x.itemsize:

import numpy as np
import sys
import pickle

a1 = np.zeros((1080, 1920, 3), dtype=np.uint8)
print(np.size(a1)*a1.itemsize, sys.getsizeof(a1), a1.shape, type(a1))

a2 = pickle.dumps(a1)
print(sys.getsizeof(a2), type(a2))

a3 = pickle.loads(a2)
print(np.size(a3)*a3.itemsize, sys.getsizeof(a3), a3.shape, type(a3))

Using np.size(x)*x.itemsize to get an actual np array size, thus a1 and a3 size are exactly the same:

6220800 6220944 (1080, 1920, 3) <class 'numpy.ndarray'>
6220998 <class 'bytes'>
6220800 144 (1080, 1920, 3) <class 'numpy.ndarray'>

a3 is actually a "view" object (only has size of 144 bytes) pointing to an array of 6,220,800 bytes. To verify, 6220944=6220800 144, magic? You might see more discussion here.

CodePudding user response：

Making the arrays and dump. It doesn't have to be big.

In [15]: a1 = np.zeros((10,20,30)); a2 = pickle.dumps(a1); a3 = pickle.loads(a2)

nbytes match, as does shape

In [16]: a1.nbytes, a3.nbytes
Out[16]: (48000, 48000)    
In [17]: a1.shape, a3.shape
Out[17]: ((10, 20, 30), (10, 20, 30))

In [18]: type(a2)
Out[18]: bytes
In [19]: len(a2)
Out[19]: 48154

Since the getsizeof for a3 is so small, I suspect it's a view of something. That is, getsizeof does not 'see' its databuffer.

If an array has its own data, the base will be None. Or it may be a view of another array. But apparently loads has constructed this array by referencing a bytes object:

In [20]: type(a3.base)
Out[20]: bytes    
In [21]: len(a3.base)
Out[21]: 48000

That looks like a2 without some sort of information header.

Anyways, getsizeof is not that useful when examining arrays - or lists.

Here's a simpler case, with a common test array:

In [22]: x = np.arange(12).reshape(3,4)
In [23]: sys.getsizeof(x)
Out[23]: 120
In [24]: sys.getsizeof(x.base)
Out[24]: 152
In [25]: x.base
Out[25]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

x is actually a view of the array created by arange.