I have the following dilemma. I am trying to pickle and then unpickle a numpy array that represents an image.
Executing this code:
a1 = np.zeros((1080, 1920, 3), dtype=np.uint8)
print(sys.getsizeof(a1), a1.shape)
a2 = pickle.dumps(a1)
print(sys.getsizeof(a2), type(a2))
a3 = pickle.loads(a2)
print(sys.getsizeof(a3), a3.shape)
Produces this output:
6220928 (1080, 1920, 3)
6220995 <class 'bytes'>
128 (1080, 1920, 3)
Now, a1
is thus around 6 MB, a2
is the pickle representation of a1
and is a bit longer but still roughly the same. And then I try to unpickle a2
and I get... something obviously not right.
a3
looks fine, i can call methods, I can assign values to it's cells etc.
The result is the same if I replace pickle calls with a1.dumps
and np.loads
since these just call pickle.
So what exactly is the deal with the weird size?
CodePudding user response:
From the sys.getsizeof
docs:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Emphasis mine. Basically there's no guarantee that sys.getsizeof
will give you consistent or correct values for numpy objects.
CodePudding user response:
try the following code to see what's the difference if using np.size(x) * x.itemsize:
import numpy as np
import sys
import pickle
a1 = np.zeros((1080, 1920, 3), dtype=np.uint8)
print(np.size(a1)*a1.itemsize, sys.getsizeof(a1), a1.shape, type(a1))
a2 = pickle.dumps(a1)
print(sys.getsizeof(a2), type(a2))
a3 = pickle.loads(a2)
print(np.size(a3)*a3.itemsize, sys.getsizeof(a3), a3.shape, type(a3))
Using np.size(x)*x.itemsize to get an actual np array size, thus a1 and a3 size are exactly the same:
6220800 6220944 (1080, 1920, 3) <class 'numpy.ndarray'>
6220998 <class 'bytes'>
6220800 144 (1080, 1920, 3) <class 'numpy.ndarray'>
a3 is actually a "view" object (only has size of 144 bytes) pointing to an array of 6,220,800 bytes. To verify, 6220944=6220800 144, magic? You might see more discussion here.
CodePudding user response:
Making the arrays and dump. It doesn't have to be big.
In [15]: a1 = np.zeros((10,20,30)); a2 = pickle.dumps(a1); a3 = pickle.loads(a2)
nbytes
match, as does shape
In [16]: a1.nbytes, a3.nbytes
Out[16]: (48000, 48000)
In [17]: a1.shape, a3.shape
Out[17]: ((10, 20, 30), (10, 20, 30))
In [18]: type(a2)
Out[18]: bytes
In [19]: len(a2)
Out[19]: 48154
Since the getsizeof
for a3
is so small, I suspect it's a view
of something. That is, getsizeof
does not 'see' its databuffer.
If an array has its own data, the base
will be None
. Or it may be a view
of another array. But apparently loads
has constructed this array by referencing a bytes
object:
In [20]: type(a3.base)
Out[20]: bytes
In [21]: len(a3.base)
Out[21]: 48000
That looks like a2
without some sort of information header.
Anyways, getsizeof
is not that useful when examining arrays - or lists.
Here's a simpler case, with a common test array:
In [22]: x = np.arange(12).reshape(3,4)
In [23]: sys.getsizeof(x)
Out[23]: 120
In [24]: sys.getsizeof(x.base)
Out[24]: 152
In [25]: x.base
Out[25]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
x
is actually a view
of the array created by arange
.