How to build an object numpy array from an iterator?-CodePudding

I want to create a NumPy array of np.ndarray from an iterable. This is because I have a function that will return np.ndarray of some constant shape, and I need to create an array of results from this function, something like this:

OUTPUT_SHAPE = some_constant

def foo(input) -> np.ndarray:
  # processing
  # generated np.ndarray of shape OUTPUT_SHAPE
  return output

inputs = [i for i in range(100000)]

iterable = (foo(input) for input in inputs)
arr = np.fromiter(iterable, np.ndarray)

This obviously gives an error:- cannot create object arrays from iterator

I cannot first create a list then convert it to an array, because it will first create a copy of every output array, so for a time, there will be almost double memory occupied, and I have very limited memory.

Can anyone help me?

CodePudding user response：

You probably shouldn't make an object array. You should probably make an ordinary 2D array of non-object dtype. As long as you know the number of results the iterator will give in advance, you can avoid most of the copying you're worried about by doing it like this:

arr = numpy.empty((num_iterator_outputs, OUTPUT_SHAPE), dtype=whatever_appropriate_dtype)
for i, output in enumerate(iterable):
    arr[i] = output

This only needs to hold arr and a single output in memory at once, instead of arr and every output.

If you really want an object array, you can get one. The simplest way would be to go through a list, which will not perform the copying you're worried about as long as you do it right:

outputs = list(iterable)
arr = numpy.empty(len(outputs), dtype=object)
arr[:] = outputs

Note that if you just try to call numpy.array on outputs, it will try to build a 2D array, which will cause the copying you're worried about. This is true even if you specify dtype=object - it'll try to build a 2D array of object dtype, and that'll be even worse, for both usability and memory.

CodePudding user response：

An object dtype array contains references, just like a list.

Define 3 arrays:

In [589]: a,b,c = np.arange(3), np.ones(3), np.zeros(3)

put them in a list:

In [590]: alist = [a,b,c]

and in an object dtype array:

In [591]: arr = np.empty(3,object)
In [592]: arr[:] = alist
In [593]: arr
Out[593]: 
array([array([0, 1, 2]), array([1., 1., 1.]), array([0., 0., 0.])],
      dtype=object)
In [594]: alist
Out[594]: [array([0, 1, 2]), array([1., 1., 1.]), array([0., 0., 0.])]

Modify one, and see the change in the list and array:

In [595]: b[:] = [1,2,3]
In [596]: b
Out[596]: array([1., 2., 3.])
In [597]: alist
Out[597]: [array([0, 1, 2]), array([1., 2., 3.]), array([0., 0., 0.])]
In [598]: arr
Out[598]: 
array([array([0, 1, 2]), array([1., 2., 3.]), array([0., 0., 0.])],
      dtype=object)

A numeric dtype array created from these copies all values:

In [599]: arr1 = np.stack(arr)
In [600]: arr1
Out[600]: 
array([[0., 1., 2.],
       [1., 2., 3.],
       [0., 0., 0.]])

So even if your use of fromiter worked, it wouldn't be any different, memory wise from a list accumulation:

alist = []
for i in range(n):
    alist.append(constant_array)