Element wise mean of numpy arrays of different sizes-CodePudding

So there is a csv file I'm reading where I'm focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})


row e.g. [0, 100, -200, 300, -150...]

There are many rows of different sizes and I want to calculate the element wise average, where I have followed this

And the error is pointing towards the last line

The error I can see if pointing towards the arrays being of type object but I'm not sure how to fix it.

CodePudding user response：

Make a ragged array:

In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]: 
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
       array([0., 0., 0.])], dtype=object)

Note the shape and dtype.

Try to use mean on it:

In [25]: np.mean(arr)
Traceback (most recent call last):
  Input In [25] in <cell line: 1>
    np.mean(arr)
  File <__array_function__ internals>:180 in mean
  File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,)

Apply mean to each element array works:

In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]

Trying to use zip_longest:

In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]: 
[(array([0, 1, 2, 3, 4]),),
 (array([1., 1., 1., 1., 1.]),),
 (array([0., 0., 0.]),)]

No change. We can use it by unpacking the arr - but it has padded the arrays in the wrong way:

In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]

zip_longest can be used to pad lists, but it takes more thought than this.

If we make an array from that list:

In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]: 
array([[ 0.,  1.,  0.],
       [ 1.,  1.,  0.],
       [ 2.,  1.,  0.],
       [ 3.,  1., nan],
       [ 4.,  1., nan]])

and transpose it, we can take the nanmean:

In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]: 
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])