Home > Software engineering >  Numpy names and assignment handling
Numpy names and assignment handling

Time:06-23

I was coding using numpy and had a question about how things are actually implemented in Numpy and what is the difference between assignment and naming consideration in numpy compared to the python lists. Here is, for instance, a code snippet that could explain the issue.

import numpy as np


a = np.array([1, 2, 3, 4])
b = a[1:3]
a[1] = 5

The value of a would be here [1, 5, 3, 4] and b [5, 3].

On the other hand, if you consult Ned Batchelder blog post about python names and variables, you could conclude that python assignments are just matchings between references and variables. For instance, if we choose a to be a list with the same elements and b the same slicing, b is then a name of the value hold by a[1:3], a[1:3] is itself another name of the same value. Changing a1 would then just change the name a1 to refer to another value. However, b1 would still refer to the old value 2. So, basically there is nothing like a variable pointing to another variable, aka a name that points to another name in python. As well, the blog post confirms that python assignments never copy data

How could the different behavior of numpy be then explained?

CodePudding user response:

Hmm I never knew that before.

I googled it and this link seems to cover it.

https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html

Basically, numpy slicing (sometimes?) returns views, not a new copy. Where as slicing a python list returns a new list (copied from the original)

CodePudding user response:

A list contains references; here b has references to some of the same objects that are contained in a:

In [54]: a= [1,2,3,4]; b=a[1:3]
In [55]: b
Out[55]: [2, 3]
In [56]: a[1] = 5;a,b
Out[56]: ([1, 5, 3, 4], [2, 3])

But the a[1] assignment changes a references in a. It does not change any references in b.

Using integers here is convenient, but not totally diagnostic, since small integers have unique ids. It might be better if I used a custom class. But integers are enough to illustrate the point.

But arrays don't contain references (unless they are object dtype). They actually 'store' values:

In [64]: a= np.array([1,2,3,4]); b=a[1:3]
In [65]: b
Out[65]: array([2, 3])

One way of displaying the key attributes of an array is:

In [66]: a.__array_interface__
Out[66]: 
{'data': (2457987642000, False),
 'strides': None,
 'descr': [('', '<i4')],
 'typestr': '<i4',
 'shape': (4,),
 'version': 3}

In [67]: b.__array_interface__
Out[67]: 
{'data': (2457987642004, False),
 'strides': None,
 'descr': [('', '<i4')],
 'typestr': '<i4',
 'shape': (2,),
 'version': 3}

The data entry for b is similar to that for a, just 4 bytes different. That's one element. b is a view, a new array, with its own shape, but sharing the data buffer.

a is its own base, but it is also b's base:

In [68]: a.base    
In [69]: b.base
Out[69]: array([1, 2, 3, 4])

tobytes gives an idea of what the data buffer contents looks like:

In [70]: a.tobytes()
Out[70]: b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00'
In [71]: b.tobytes()
Out[71]: b'\x02\x00\x00\x00\x03\x00\x00\x00'

b's bytes are a subset of a's.

Modifying a modifies their shared data buffer (see the new \x05 in both):

In [72]: a[1]=5
In [73]: a.tobytes()
Out[73]: b'\x01\x00\x00\x00\x05\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00'
In [74]: b.tobytes()
Out[74]: b'\x05\x00\x00\x00\x03\x00\x00\x00'

The distinction between view and copy is important. I too familiar with Python and numpy to say whether it's more or less intuitive than list references.

unboxing

The different data storage means that accessing an element of a list is actually faster than accessing an element of an array. Same for that a[1]=5 assignment. For lists, the element access just returns the reference that's stored in the list. No translation or new object creation is needed. But for an array, a[1] actually has to create a new object with a copy of the value.

That means that simply substituting arrays for lists in your code might actually slow it down. You have to learn the numpy basics, and use them as intended to get actually performance gains.

  • Related