Concatenate many nd-arrays of different shapes (filling values until the edges)-CodePudding

I have a list of 2d arrays of different shapes: lst (see the example below).

I would like to concatenate them into a 3d array of shape (len(lst), max_0, max_1), where max_0 is the maximum .shape[0] of all the arrays, and max_1 is the maximum .shape[1].

If an array's shape is smaller than (max_0, max_1), then this array should start at the top left corner, and all the missing values should be filled with some value of choice (e.g. 0 or np.nan).

Example:

lst = [np.array([[1, 2],
                 [3, 4]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2],
                 [3, 4],
                 [5, 6]])]

# max_0 == 3
# max_1 == 3

result = np.array([[[1, 2, 0],
                    [3, 4, 0],
                    [0, 0, 0]],

                    [[1, 2, 3],
                     [4, 5, 6],
                     [0, 0, 0]],

                    [[1, 2, 0],
                     [3, 4, 0],
                     [5, 6, 0]]])

Notes:

np.concatenate requires the shapes of all arrays to match.

This question is similar - but it is only for 1d arrays.

A sub-problem:

As a special case, you may assume that .shape[1] == max_1 is the same for all arrays. For example:

lst = [np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2, 3]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]
                 [7, 8, 9]])]

Bonus (a hard question):

Can this be applied to more dimensions? E.g., while concatenating 3d arrays into a 4d array, all 3d arrays (rectangular parallelepipeds) will start at the same corner, and if their shapes are too small - the missing values (until the edges) will be filled with 0 or np.nan.

How to do this at all? How to do this efficiently (potentially for thousands of arrays, each with thousands of elements)?

Maybe creating an array of the final shape and filling it somehow in a vectored way?
Or converting all arrays into dataframes and concatenating them with pd.concat?
Maybe SciPy has some helpful functions for this?

CodePudding user response：

You can use numpy.pad to do this.

import numpy as np

lst = [np.array([[1, 2],
                 [3, 4]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2],
                 [3, 4],
                 [5, 6]])]

maxx = max(x.shape[0] for x in lst)
maxy = max(x.shape[1] for x in lst)

lst = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
print(lst)

Output:

[array([[1, 2, 0],
       [3, 4, 0],
       [0, 0, 0]]),
 array([[1, 2, 3],
       [4, 5, 6],
       [0, 0, 0]]),
 array([[1, 2, 0],
       [3, 4, 0],
       [5, 6, 0]])]

This process will work with any number of dimensions. You'd have to use a loop instead of the maxx/maxy computation.

CodePudding user response：

A solution for general dimensions, non-vectorized but avoiding a slow np.pad call. (~20x faster, benchmarked with the example lst * 10000).

import numpy as np

def fill_axis(lst):
    shapes = np.array([arr.shape for arr in lst])
    res = np.zeros((len(shapes),)   (*shapes.max(0),), int)
    for x, arr in enumerate(lst):
        slices = [x]
        slices  = (slice(None, shape) for shape in arr.shape)
        res[tuple(slices)] = arr
    return res

lst = lst * 10000

%timeit fill_axis(lst)
# 77.3 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# solution by @TimRoberts https://stackoverflow.com/a/73536898/14277722
def pad_fill(lst):
    maxx = max(x.shape[0] for x in lst)
    maxy = max(x.shape[1] for x in lst)

    res = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
    return np.array(res)

%timeit pad_fill(lst)
# 1.82 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.testing.assert_equal(pad_fill(lst), fill_axis(lst))

An example with stacked 3D arrays

lst_4D = [np.arange(1*2*3*3).reshape(1,2,3,3),
          np.arange(2*3*2*2).reshape(2,3,2,2)]

fill_axis(lst_4D)

Output

array([[[[[ 0,  1,  2],
          [ 3,  4,  5],
          [ 6,  7,  8]],

         [[ 9, 10, 11],
          [12, 13, 14],
          [15, 16, 17]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]]],


        [[[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]]]],



       [[[[ 0,  1,  0],
          [ 2,  3,  0],
          [ 0,  0,  0]],

         [[ 4,  5,  0],
          [ 6,  7,  0],
          [ 0,  0,  0]],

         [[ 8,  9,  0],
          [10, 11,  0],
          [ 0,  0,  0]]],


        [[[12, 13,  0],
          [14, 15,  0],
          [ 0,  0,  0]],

         [[16, 17,  0],
          [18, 19,  0],
          [ 0,  0,  0]],

         [[20, 21,  0],
          [22, 23,  0],
          [ 0,  0,  0]]]]])

An adaptation of @divakar's solution for 2D arrays. ~2x faster with larger lists than the more general solution in my benchmarks, but harder to generalize to more dimensions.

def einsum_fill(lst):
    shapes = np.array([arr.shape for arr in lst])
    a = np.arange(shapes[:,0].max()) < shapes[:,[0]]
    b = np.arange(shapes[:,1].max()) < shapes[:,[1]]
    mask = np.einsum('ij,ik->ijk', a, b)
    res = np.zeros_like(mask, int)
    res[mask] = np.concatenate([arr.ravel() for arr in lst])
    return res
                 
%timeit einsum_fill(lst)
# 46.7 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.testing.assert_equal(einsum_fill(lst), fill_axis(lst))