So there is a csv file I'm reading where I'm focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.
df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})
row e.g. [0, 100, -200, 300, -150...]
There are many rows of different sizes and I want to calculate the element wise average, where I have followed this
And the error is pointing towards the last line
The error I can see if pointing towards the arrays being of type object but I'm not sure how to fix it.
CodePudding user response:
Make a ragged array:
In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]:
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
array([0., 0., 0.])], dtype=object)
Note the shape and dtype.
Try to use mean on it:
In [25]: np.mean(arr)
Traceback (most recent call last):
Input In [25] in <cell line: 1>
np.mean(arr)
File <__array_function__ internals>:180 in mean
File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
return _methods._mean(a, axis=axis, dtype=dtype,
File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,)
Apply mean to each element array works:
In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]
Trying to use zip_longest:
In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]:
[(array([0, 1, 2, 3, 4]),),
(array([1., 1., 1., 1., 1.]),),
(array([0., 0., 0.]),)]
No change. We can use it by unpacking the arr - but it has padded the arrays in the wrong way:
In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]
zip_longest
can be used to pad lists, but it takes more thought than this.
If we make an array from that list:
In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]:
array([[ 0., 1., 0.],
[ 1., 1., 0.],
[ 2., 1., 0.],
[ 3., 1., nan],
[ 4., 1., nan]])
and transpose it, we can take the nanmean
:
In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]:
array([[ 0., 1., 2., 3., 4.],
[ 1., 1., 1., 1., 1.],
[ 0., 0., 0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])