I have to load a dataset into a big array with p
instances where each instance has 2 dimensions (n_i, m)
. The length of the first dimension n_i
is variable.
My first approach was to pad all the instances to the max_len
over the first dimension, initialize an array of size (p, max_len, m)
and then broadcast each instance into the big array as follows big_array[i*max_len:i*max_len max_len] = padded_i_instance
. This is fast and works well, the problem is that I only have 8Gb of RAM and I get (interrupted by signal 9: SIGKILL) error
when I try to load the whole dataset. It also feels very wasteful since the shortest instance is almost 10 times shorter than the max_len
so some instances are 90% padding.
My second approach was to use np.vstack
and then build the big_array
iteratively. Something like this:
big_array = np.zeros([1,l])
for i in range(1,n):
big_array = np.vstack([big_array, np.full([i,l], i)])
this feels less "wasteful" but it actually takes 100x longer to execute for only 10000 instances, it is unfeasible to use for 100k .
So I was wondering if there was a method that is both more memory efficient than approach 1 and more computationally efficient than approach 2. I read about np.append
and np.insert
but they seem to be other versions of np.vstack
so I assume the would take roughly as much time.
CodePudding user response:
The slow repeated vstack:
In [200]: n=5; l=2
...: big_array = np.zeros([1,l])
...: for i in range(1,n):
...: big_array = np.vstack([big_array, np.full([i,l], i)])
...:
In [201]: big_array
Out[201]:
array([[0., 0.],
[1., 1.],
[2., 2.],
[2., 2.],
[3., 3.],
[3., 3.],
[3., 3.],
[4., 4.],
[4., 4.],
[4., 4.],
[4., 4.]])
list append is faster:
In [202]: alist = []
In [203]: for i in range(1,n):
...: alist.append(np.full([i,l], i))
...:
...:
In [204]: alist
Out[204]:
[array([[1, 1]]),
array([[2, 2],
[2, 2]]),
array([[3, 3],
[3, 3],
[3, 3]]),
array([[4, 4],
[4, 4],
[4, 4],
[4, 4]])]
In [205]: np.vstack(alist)
Out[205]:
array([[1, 1],
[2, 2],
[2, 2],
[3, 3],
[3, 3],
[3, 3],
[4, 4],
[4, 4],
[4, 4],
[4, 4]])
filling a preallocated array:
In [210]: arr = np.zeros((10,2),int)
...: cnt=0
...: for i in range(0,n):
...: arr[cnt:cnt i,:] = np.full([i,l],i)
...: cnt = i
...:
In [211]: arr
Out[211]:
array([[1, 1],
[2, 2],
[2, 2],
[3, 3],
[3, 3],
[3, 3],
[4, 4],
[4, 4],
[4, 4],
[4, 4]])