I have a list of 2d arrays of different shapes: lst
(see the example below).
I would like to concatenate them into a 3d array of shape (len(lst), max_0, max_1)
, where max_0
is the maximum .shape[0]
of all the arrays, and max_1
is the maximum .shape[1]
.
If an array's shape is smaller than (max_0, max_1)
, then this array should start at the top left corner, and all the missing values should be filled with some value of choice (e.g. 0 or np.nan
).
Example:
lst = [np.array([[1, 2],
[3, 4]]),
np.array([[1, 2, 3],
[4, 5, 6]]),
np.array([[1, 2],
[3, 4],
[5, 6]])]
# max_0 == 3
# max_1 == 3
result = np.array([[[1, 2, 0],
[3, 4, 0],
[0, 0, 0]],
[[1, 2, 3],
[4, 5, 6],
[0, 0, 0]],
[[1, 2, 0],
[3, 4, 0],
[5, 6, 0]]])
Notes:
np.concatenate
requires the shapes of all arrays to match.
This question is similar - but it is only for 1d arrays.
A sub-problem:
As a special case, you may assume that .shape[1] == max_1
is the same for all arrays. For example:
lst = [np.array([[1, 2, 3],
[4, 5, 6]]),
np.array([[1, 2, 3]]),
np.array([[1, 2, 3],
[4, 5, 6]
[7, 8, 9]])]
Bonus (a hard question):
Can this be applied to more dimensions? E.g., while concatenating 3d arrays into a 4d array, all 3d arrays (rectangular parallelepipeds) will start at the same corner, and if their shapes are too small - the missing values (until the edges) will be filled with 0 or np.nan
.
How to do this at all? How to do this efficiently (potentially for thousands of arrays, each with thousands of elements)?
Maybe creating an array of the final shape and filling it somehow in a vectored way?
Or converting all arrays into dataframes and concatenating them with
pd.concat
?Maybe SciPy has some helpful functions for this?
CodePudding user response:
You can use numpy.pad to do this.
import numpy as np
lst = [np.array([[1, 2],
[3, 4]]),
np.array([[1, 2, 3],
[4, 5, 6]]),
np.array([[1, 2],
[3, 4],
[5, 6]])]
maxx = max(x.shape[0] for x in lst)
maxy = max(x.shape[1] for x in lst)
lst = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
print(lst)
Output:
[array([[1, 2, 0],
[3, 4, 0],
[0, 0, 0]]),
array([[1, 2, 3],
[4, 5, 6],
[0, 0, 0]]),
array([[1, 2, 0],
[3, 4, 0],
[5, 6, 0]])]
This process will work with any number of dimensions. You'd have to use a loop instead of the maxx
/maxy
computation.
CodePudding user response:
A solution for general dimensions, non-vectorized but avoiding a slow np.pad
call. (~20x faster, benchmarked with the example lst * 10000).
import numpy as np
def fill_axis(lst):
shapes = np.array([arr.shape for arr in lst])
res = np.zeros((len(shapes),) (*shapes.max(0),), int)
for x, arr in enumerate(lst):
slices = [x]
slices = (slice(None, shape) for shape in arr.shape)
res[tuple(slices)] = arr
return res
lst = lst * 10000
%timeit fill_axis(lst)
# 77.3 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# solution by @TimRoberts https://stackoverflow.com/a/73536898/14277722
def pad_fill(lst):
maxx = max(x.shape[0] for x in lst)
maxy = max(x.shape[1] for x in lst)
res = [np.pad( k, [(0,maxx-k.shape[0]),(0,maxy-k.shape[1])] ) for k in lst]
return np.array(res)
%timeit pad_fill(lst)
# 1.82 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.testing.assert_equal(pad_fill(lst), fill_axis(lst))
An example with stacked 3D arrays
lst_4D = [np.arange(1*2*3*3).reshape(1,2,3,3),
np.arange(2*3*2*2).reshape(2,3,2,2)]
fill_axis(lst_4D)
Output
array([[[[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]]],
[[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]]]],
[[[[ 0, 1, 0],
[ 2, 3, 0],
[ 0, 0, 0]],
[[ 4, 5, 0],
[ 6, 7, 0],
[ 0, 0, 0]],
[[ 8, 9, 0],
[10, 11, 0],
[ 0, 0, 0]]],
[[[12, 13, 0],
[14, 15, 0],
[ 0, 0, 0]],
[[16, 17, 0],
[18, 19, 0],
[ 0, 0, 0]],
[[20, 21, 0],
[22, 23, 0],
[ 0, 0, 0]]]]])
An adaptation of @divakar's solution for 2D arrays. ~2x faster with larger lists than the more general solution in my benchmarks, but harder to generalize to more dimensions.
def einsum_fill(lst):
shapes = np.array([arr.shape for arr in lst])
a = np.arange(shapes[:,0].max()) < shapes[:,[0]]
b = np.arange(shapes[:,1].max()) < shapes[:,[1]]
mask = np.einsum('ij,ik->ijk', a, b)
res = np.zeros_like(mask, int)
res[mask] = np.concatenate([arr.ravel() for arr in lst])
return res
%timeit einsum_fill(lst)
# 46.7 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.testing.assert_equal(einsum_fill(lst), fill_axis(lst))