Home > Blockchain >  Concatenate many nd-arrays of different shapes (filling values until the edges)
Concatenate many nd-arrays of different shapes (filling values until the edges)

Time:08-30

I have a list of 2d arrays of different shapes: lst (see the example below).

I would like to concatenate them into a 3d array of shape (len(lst), max_0, max_1), where max_0 is the maximum .shape[0] of all the arrays, and max_1 is the maximum .shape[1].

If an array's shape is smaller than (max_0, max_1), then this array should start at the top left corner, and all the missing values should be filled with some value of choice (e.g. 0 or np.nan).

Example:

lst = [np.array([[1, 2],
                 [3, 4]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2],
                 [3, 4],
                 [5, 6]])]

# max_0 == 3
# max_1 == 3

result = np.array([[[1, 2, 0],
                    [3, 4, 0],
                    [0, 0, 0]],

                    [[1, 2, 3],
                     [4, 5, 6],
                     [0, 0, 0]],

                    [[1, 2, 0],
                     [3, 4, 0],
                     [5, 6, 0]]])

Notes:

np.concatenate requires the shapes of all arrays to match.

This question is similar - but it is only for 1d arrays.


A sub-problem:

As a special case, you may assume that .shape[1] == max_1 is the same for all arrays. For example:

lst = [np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2, 3]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]
                 [7, 8, 9]])]

Bonus (a hard question):

Can this be applied to more dimensions? E.g., while concatenating 3d arrays into a 4d array, all 3d arrays (rectangular parallelepipeds) will start at the same corner, and if their shapes are too small - the missing values (until the edges) will be filled with 0 or np.nan.

 


How to do this at all? How to do this efficiently (potentially for thousands of arrays, each with thousands of elements)?

  • Maybe creating an array of the final shape and filling it somehow in a vectored way?

  • Or converting all arrays into dataframes and concatenating them with pd.concat?

  • Maybe SciPy has some helpful functions for this?

CodePudding user response:

You can use numpy.pad to do this.

import numpy as np

lst = [np.array([[1, 2],
                 [3, 4]]),
       np.array([[1, 2, 3],
                 [4, 5, 6]]),
       np.array([[1, 2],
                 [3, 4],
                 [5, 6]])]

maxx = max(x.shape[0] for x in lst)
maxy = max(x.shape[1] for x in lst)

lst = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
print(lst)

Output:

[array([[1, 2, 0],
       [3, 4, 0],
       [0, 0, 0]]),
 array([[1, 2, 3],
       [4, 5, 6],
       [0, 0, 0]]),
 array([[1, 2, 0],
       [3, 4, 0],
       [5, 6, 0]])]

This process will work with any number of dimensions. You'd have to use a loop instead of the maxx/maxy computation.

CodePudding user response:

A solution for general dimensions, non-vectorized but avoiding a slow np.pad call. (~20x faster, benchmarked with the example lst * 10000).

import numpy as np

def fill_axis(lst):
    shapes = np.array([arr.shape for arr in lst])
    res = np.zeros((len(shapes),)   (*shapes.max(0),), int)
    for x, arr in enumerate(lst):
        slices = [x]
        slices  = (slice(None, shape) for shape in arr.shape)
        res[tuple(slices)] = arr
    return res

lst = lst * 10000

%timeit fill_axis(lst)
# 77.3 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# solution by @TimRoberts https://stackoverflow.com/a/73536898/14277722
def pad_fill(lst):
    maxx = max(x.shape[0] for x in lst)
    maxy = max(x.shape[1] for x in lst)

    res = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
    return np.array(res)

%timeit pad_fill(lst)
# 1.82 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.testing.assert_equal(pad_fill(lst), fill_axis(lst))

An example with stacked 3D arrays

lst_4D = [np.arange(1*2*3*3).reshape(1,2,3,3),
          np.arange(2*3*2*2).reshape(2,3,2,2)]

fill_axis(lst_4D)

Output

array([[[[[ 0,  1,  2],
          [ 3,  4,  5],
          [ 6,  7,  8]],

         [[ 9, 10, 11],
          [12, 13, 14],
          [15, 16, 17]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]]],


        [[[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]],

         [[ 0,  0,  0],
          [ 0,  0,  0],
          [ 0,  0,  0]]]],



       [[[[ 0,  1,  0],
          [ 2,  3,  0],
          [ 0,  0,  0]],

         [[ 4,  5,  0],
          [ 6,  7,  0],
          [ 0,  0,  0]],

         [[ 8,  9,  0],
          [10, 11,  0],
          [ 0,  0,  0]]],


        [[[12, 13,  0],
          [14, 15,  0],
          [ 0,  0,  0]],

         [[16, 17,  0],
          [18, 19,  0],
          [ 0,  0,  0]],

         [[20, 21,  0],
          [22, 23,  0],
          [ 0,  0,  0]]]]])

An adaptation of @divakar's solution for 2D arrays. ~2x faster with larger lists than the more general solution in my benchmarks, but harder to generalize to more dimensions.

def einsum_fill(lst):
    shapes = np.array([arr.shape for arr in lst])
    a = np.arange(shapes[:,0].max()) < shapes[:,[0]]
    b = np.arange(shapes[:,1].max()) < shapes[:,[1]]
    mask = np.einsum('ij,ik->ijk', a, b)
    res = np.zeros_like(mask, int)
    res[mask] = np.concatenate([arr.ravel() for arr in lst])
    return res
                 
%timeit einsum_fill(lst)
# 46.7 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.testing.assert_equal(einsum_fill(lst), fill_axis(lst))
  • Related