Home > Software design >  Why does np.split create a copy when passing into an existing array?
Why does np.split create a copy when passing into an existing array?

Time:02-04

I am new to Python and trying to understand the behaviour of views and copies, so apologies if this is an obvious question! In the example below, I use np.split() to split an array x. When I pass np.split into a new object (either x1, a list of 3 1D arrays or x2, x3, x4, three separate 1D arrays) the objects are views of x, as expected:

import numpy as np

x = np.arange(1, 10)                        # create an array of length 9

x1 = np.split(x, (3,6))                     # split array into 1 new object (list of arrays)
print(x1[0].base)                           # each array in list is a view

x2, x3, x4 = np.split(x, (3, 6))            # split array into 3 new objects
print(x2.base)                              # these objects are views

However if I create an empty (3,3) array x5 and pass np.split into each row of this array (I know this is a silly thing to do, I'm just trying to figure out how splitting works), a copy is created:

x5 = np.empty((3,3), dtype = np.int32)      # create an uninitialised array
x5[0], x5[1], x5[2] = np.split(x, (3, 6))   # split x into each row of x5
print(x5.base)                              # this object is a COPY

I thought perhaps that the slicing of x5 was causing a copy to be made, but if I slice x2, x3, x4 they are still views:

x2[:], x3[:], x4[:] = np.split(x, (3, 6))   # split array into 3 existing objects using indexing
print(x2.base)                              # these objects are views

I haven't managed to find an explanation for this in any explanations of views and copies or np.split - what am I missing?

CodePudding user response:

The behavior you're seeing is only loosely related to split. numpy arrays have a single "rectangular block" of data behind them, they can't refer to multiple independent allocations. That block might be viewed in weird ways (by striding, masking, transposing, etc.) so the individual entries are not necessarily contiguous, but they're all backed by one, and only one, contiguous bulk allocation.

That allocation can be unique to the array, or a view of some other array, but a given array object doesn't change between owning storage and viewing storage, it's one or the other. The name the array is bound to can be reassigned, but all names can be reassigned (x = 1 doesn't make x always that specific int, or even always some int; x = ANYTHING later on would rebind it to a completely new object of arbitrary type, ignoring what it used to be).

Knowing this makes it clear that it's impossible to reassign part of an array to be a view of some other array. So to explain your various observations:

x2, x3, x4 = np.split(x, (3, 6))            # split array into 3 new objects
print(x2.base)                              # these objects are views

As you noted, this is the expected behavior. np.split returned a list containing three new arrays of view type, each backed by part of x.

x5 = np.empty((3,3), dtype = np.int32)      # create an uninitialised array
x5[0], x5[1], x5[2] = np.split(x, (3, 6))   # split x into each row of x5
print(x5.base)                              # this object is a COPY

Line 1 here creates a new array of owning type. Even if it were possible to replace the underlying storage buffer with a different one (either to own a new buffer, or to view some other array's buffer), and I won't swear this is impossible (I haven't explored numpy enough to say for sure), it's definitely impossible to make it a mix of owning and viewing. Consider the simpler code:

x5[0] = x[:3]

If that code worked, and didn't copy the data from the view of x to the view of x5[0], then somehow, x5[0] would need to be a view of x, while the rest of x5's data would be owned. What happens to the three elements in the buffer that backed x5[0]? You might think "But I'm replacing all of them at once", but you're actually not. x5[0], x5[1], x5[2] = np.split(x, (3, 6)) is effectively equivalent to __unnamedtemp = [x[:3], x[3:6], x[6:]], followed by x5[0] = __unnamedtemp[0], then x5[1] = __unnamedtemp[1], then x5[2] = __unnamedtemp[2]. The unpacking work must be done piecemeal (unpacking is a general feature of Python that numpy can't hook), so even if, in theory, the end result could leave x5 viewing x, in practice it can't, even numpy wanted to, because the intermediate stages are illegal state.

By comparison,

x2[:], x3[:], x4[:] = np.split(x, (3, 6))   # split array into 3 existing objects using indexing
print(x2.base)

"works" only because x2, x3 and x4 were already views of x. But copies are still being made; x2 was already a view of x[:3], and you just told numpy to copy the contents of x[:3] to x2[:]. Under-the-hood, once x2.__setitem__ is invoked with an argument of the complete slice and another numpy view, and numpy has enough information and control, it might notice that the raw memory addresses are the same and avoid the copy, but I wouldn't be at all surprised if it just blindly copied the data from every address in the view to itself.

You'd be able to see that it is not making new views if you hadn't reused x2 through x4:

x2, x3, x4 = np.arange(1, 4), np.arange(1, 4), np.arange(1, 4)  # Three owning arrays
x2[:], x3[:], x4[:] = np.split(x, (3, 6))  # Makes three views, then copies from views to owned buffers
print(x2.base)  # Does not have a base, because it's still not a view

The short version here is that:

  1. Assignment of anything to plain names (no indexing/slicing) will rebind that name, without copying data (ignoring whatever used to be bound to that name); if you assigned a viewing array, it's now a view, if you assigned an owning array, it's now an owning array.
  2. Assignment of either views or owning arrays to an index or slice of an existing array (view or owning) will copy the data from one array to the other (copying into the viewed buffer if applicable)

This is how it works for all types in Python. The only thing unusual about the numpy case is that you can make views; the built-in Python sequences (save the weird memoryview that's lightly numpy-like) don't have a concept of views, so the slices on the right-hand side of an equals sign will be shallow copies, but assignment to slices on the left-hand side will still copy (whether or not the right-hand side is a view or copy).

CodePudding user response:

In an ipython session, make your x, and splits

In [23]: x=np.arange(1,10);x
Out[23]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [24]: x2,x3,x4 = np.split(x,(3,6))

Examine one:

In [25]: x2
Out[25]: array([1, 2, 3])    
In [26]: x2.base
Out[26]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])

I find the following display of array information helpful. The 'data' field "points" to the underlying databuffer. It can't be used in code, but the number is a useful identifier of where the values are actually stored.

In [28]: x.__array_interface__
Out[28]: 
{'data': (2389644026224, False),
 'strides': None,
 'descr': [('', '<i4')],
 'typestr': '<i4',
 'shape': (9,),
 'version': 3}

x2 has the same 'data' value:

In [29]: x2.__array_interface__['data']
Out[29]: (2389644026224, False)

x3 and x4 will have slightly different values, pointing to bytes further in the buffer (e.g. 2389644026236 for x3, 3*4 further on).

x5 is a new array with its own databuffer:

In [30]: x5 = np.empty((3,3),int);x5
Out[30]: 
array([[4128860, 6029375, 3801155],
       [5570652, 6619251, 7536754],
       [7340124, 7667809,     108]])    
In [31]: x5.__array_interface__['data']
Out[31]: (2389645593712, False)

Assigning values to x5 copies those values, but doesn't change its databuffer location:

In [32]: x5[:] = x2,x3,x4    
In [33]: x5
Out[33]: 
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])    
In [34]: x5.__array_interface__['data']
Out[34]: (2389645593712, False)

x and x5 remain their own 'bases':

In [35]: x.base, x5.base
Out[35]: (None, None)

We could index another 3 values from x and assign them to the x2 view:

In [38]: x[-2:-5:-1]
Out[38]: array([8, 7, 6])    
In [39]: x2[:]=x[-2:-5:-1]      # not x2=x[...]

x2 will be changed, but so will its base:

In [40]: x2.base
Out[40]: array([8, 7, 6, 4, 5, 6, 7, 8, 9])    
In [41]: x
Out[41]: array([8, 7, 6, 4, 5, 6, 7, 8, 9])

The assignment in [39] is just as "expensive" (time wise) as copying 3 values to x5. It's still copying. But the difference lies in where the values are copied too.

  • Related