Home > Software engineering >  Python pickling and unpickling numpy array?
Python pickling and unpickling numpy array?

Time:10-28

I have the following dilemma. I am trying to pickle and then unpickle a numpy array that represents an image.

Executing this code:

a1 = np.zeros((1080, 1920, 3), dtype=np.uint8)
print(sys.getsizeof(a1), a1.shape)

a2 = pickle.dumps(a1)
print(sys.getsizeof(a2), type(a2))

a3 = pickle.loads(a2)
print(sys.getsizeof(a3), a3.shape)

Produces this output:

6220928 (1080, 1920, 3)
6220995 <class 'bytes'>
128 (1080, 1920, 3)

Now, a1 is thus around 6 MB, a2 is the pickle representation of a1 and is a bit longer but still roughly the same. And then I try to unpickle a2 and I get... something obviously not right.

a3 looks fine, i can call methods, I can assign values to it's cells etc.

The result is the same if I replace pickle calls with a1.dumps and np.loads since these just call pickle.

So what exactly is the deal with the weird size?

CodePudding user response:

From the sys.getsizeof docs:

Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.

Emphasis mine. Basically there's no guarantee that sys.getsizeof will give you consistent or correct values for numpy objects.

CodePudding user response:

try the following code to see what's the difference if using np.size(x) * x.itemsize:

import numpy as np
import sys
import pickle

a1 = np.zeros((1080, 1920, 3), dtype=np.uint8)
print(np.size(a1)*a1.itemsize, sys.getsizeof(a1), a1.shape, type(a1))

a2 = pickle.dumps(a1)
print(sys.getsizeof(a2), type(a2))

a3 = pickle.loads(a2)
print(np.size(a3)*a3.itemsize, sys.getsizeof(a3), a3.shape, type(a3))

Using np.size(x)*x.itemsize to get an actual np array size, thus a1 and a3 size are exactly the same:

6220800 6220944 (1080, 1920, 3) <class 'numpy.ndarray'>
6220998 <class 'bytes'>
6220800 144 (1080, 1920, 3) <class 'numpy.ndarray'>

a3 is actually a "view" object (only has size of 144 bytes) pointing to an array of 6,220,800 bytes. To verify, 6220944=6220800 144, magic? You might see more discussion here.

CodePudding user response:

Making the arrays and dump. It doesn't have to be big.

In [15]: a1 = np.zeros((10,20,30)); a2 = pickle.dumps(a1); a3 = pickle.loads(a2)

nbytes match, as does shape

In [16]: a1.nbytes, a3.nbytes
Out[16]: (48000, 48000)    
In [17]: a1.shape, a3.shape
Out[17]: ((10, 20, 30), (10, 20, 30))

In [18]: type(a2)
Out[18]: bytes
In [19]: len(a2)
Out[19]: 48154

Since the getsizeof for a3 is so small, I suspect it's a view of something. That is, getsizeof does not 'see' its databuffer.

If an array has its own data, the base will be None. Or it may be a view of another array. But apparently loads has constructed this array by referencing a bytes object:

In [20]: type(a3.base)
Out[20]: bytes    
In [21]: len(a3.base)
Out[21]: 48000

That looks like a2 without some sort of information header.

Anyways, getsizeof is not that useful when examining arrays - or lists.


Here's a simpler case, with a common test array:

In [22]: x = np.arange(12).reshape(3,4)
In [23]: sys.getsizeof(x)
Out[23]: 120
In [24]: sys.getsizeof(x.base)
Out[24]: 152
In [25]: x.base
Out[25]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

x is actually a view of the array created by arange.

  • Related