Home > Software design >  Efficient way to load big heterogeneous dataset in a numpy array?
Efficient way to load big heterogeneous dataset in a numpy array?

Time:12-10

I have to load a dataset into a big array with p instances where each instance has 2 dimensions (n_i, m). The length of the first dimension n_i is variable.

My first approach was to pad all the instances to the max_len over the first dimension, initialize an array of size (p, max_len, m) and then broadcast each instance into the big array as follows big_array[i*max_len:i*max_len max_len] = padded_i_instance. This is fast and works well, the problem is that I only have 8Gb of RAM and I get (interrupted by signal 9: SIGKILL) error when I try to load the whole dataset. It also feels very wasteful since the shortest instance is almost 10 times shorter than the max_len so some instances are 90% padding.

My second approach was to use np.vstack and then build the big_array iteratively. Something like this:

big_array = np.zeros([1,l])
for i in range(1,n):
    big_array = np.vstack([big_array, np.full([i,l], i)])

this feels less "wasteful" but it actually takes 100x longer to execute for only 10000 instances, it is unfeasible to use for 100k .

So I was wondering if there was a method that is both more memory efficient than approach 1 and more computationally efficient than approach 2. I read about np.append and np.insert but they seem to be other versions of np.vstack so I assume the would take roughly as much time.

CodePudding user response:

The slow repeated vstack:

In [200]: n=5; l=2
     ...: big_array = np.zeros([1,l])
     ...: for i in range(1,n):
     ...:     big_array = np.vstack([big_array, np.full([i,l], i)])
     ...: 
In [201]: big_array
Out[201]: 
array([[0., 0.],
       [1., 1.],
       [2., 2.],
       [2., 2.],
       [3., 3.],
       [3., 3.],
       [3., 3.],
       [4., 4.],
       [4., 4.],
       [4., 4.],
       [4., 4.]])

list append is faster:

In [202]: alist = []
In [203]: for i in range(1,n):
     ...:     alist.append(np.full([i,l], i))
     ...: 
     ...: 
In [204]: alist
Out[204]: 
[array([[1, 1]]),
 array([[2, 2],
        [2, 2]]),
 array([[3, 3],
        [3, 3],
        [3, 3]]),
 array([[4, 4],
        [4, 4],
        [4, 4],
        [4, 4]])]
In [205]: np.vstack(alist)
Out[205]: 
array([[1, 1],
       [2, 2],
       [2, 2],
       [3, 3],
       [3, 3],
       [3, 3],
       [4, 4],
       [4, 4],
       [4, 4],
       [4, 4]])

filling a preallocated array:

In [210]: arr = np.zeros((10,2),int)
     ...: cnt=0
     ...: for i in range(0,n):
     ...:     arr[cnt:cnt i,:] = np.full([i,l],i)
     ...:     cnt  = i
     ...: 
In [211]: arr
Out[211]: 
array([[1, 1],
       [2, 2],
       [2, 2],
       [3, 3],
       [3, 3],
       [3, 3],
       [4, 4],
       [4, 4],
       [4, 4],
       [4, 4]])
  • Related