How do I efficiently transform a nested structure of equal-sized numpy object arrays into a single n-CodePudding

Through a Spark pipeline I am retrieving a matrix (ArrayType(ArrayType(FloatType()))) and using the .toPandas() for retrieving the data for in-memory analysis.

from pyspark.sql.types import ArrayType, FloatType
from pyspark.sql.functions import from_json

schema = ArrayType(ArrayType(FloatType()))
sdf = sdf.withColumn("my_data", from_json("my_column", schema))
pdf = sdf.select("my_data").toPandas()

The dataframe contains N rows, and each entry of my_data contains a matrix of shape (M, D). I would like to end up with a numpy array that is shape (N, M, D) and dtype float.

The issue is that .toPandas() converts arrays into numpy arrays with dtype object, so I end up with a nested structure of each element in the pandas dataframe on column my_data has shape (M, ) dtype object, with each child element therein having shape (D, ) and dtype float.

(I guess the reasoning behind this design choice is that there is no inherent guarantee that the inner lists have same length, but in my case I know they do.)

One naive solution would be to create a nested .tolist() and then np.array(), but this seems so inefficient (and does not generalise well to deeper structures):

my_data = np.array([
  [inner.tolist() for inner in row]
    for row in pdf["my_data"]
])

There must be a better way? Is there anything in the numpy API I'm missing?

Update 1: Some debug info

my_data = pdf["my_data"]
example = my_data[0] # first row

print(type(example)) # <class 'numpy.ndarray'>
print(example.shape) # (15,)
print(example.dtype) # object

element = example[0]
print(type(element)) # <class 'numpy.ndarray'>
print(element.shape) # (17,)
print(element.dtype) # float32

example.astype(float)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
analysis.ipynb Cell 20 in <cell line: 1>()
----> 1 example.astype(float)

ValueError: setting an array element with a sequence.

Update 2: This snippet can recreate input data structure.

N, M, D = 100, 15, 17

rows = np.ndarray(N, dtype=object)
for i in range(N):
    example = np.ndarray(M, dtype=object)
    for j in range(M):
        example[j] = np.random.rand(D).astype(np.float32)
    rows[i] = example

CodePudding user response：

Considering your rows array, I can't see a way without for loop for now, but the following does the job:

>>> res = np.array([np.stack(r) for r in rows])
>>> res.shape
(100, 15, 17)
>>> res.dtype
dtype('float32')

Before edit

Not exactly sure how your data is formatted, but if you have a list of numpy arrays of same shape and dtype object:

import numpy as np

N, M, D = 3, 5, 7

arrays = [np.random.randn(M, D).astype(object) for _ in range(N)]

You can just do np.array(arrays).astype(float):

>>> np.array(arrays).astype(float).shape
(3, 5, 7)

CodePudding user response：

You have object arrays within an object array.

In [29]: rows.shape, rows.dtype
Out[29]: ((100,), dtype('O'))

In [30]: rows[0].shape, rows[0].dtype
Out[30]: ((15,), dtype('O'))

In [31]: rows[0][0].shape, rows[0][0].dtype
Out[31]: ((17,), dtype('float32'))

A double conversion is required:

In [32]: arr = np.array([a.tolist() for a in rows.tolist()])
In [33]: arr.shape, arr.dtype
Out[33]: ((100, 15, 17), dtype('float32'))

You could substitute stack for that inner tolist, but it gains any speed.

In [34]: arr = np.array([np.stack(a) for a in rows.tolist()])
In [35]: arr.shape, arr.dtype
Out[35]: ((100, 15, 17), dtype('float32'))

One way or other the object dtype of the inner arrays has to be 'flattened' - one at a time.

I haven't followed your pandas work, but trying to put a 3d structure into a frame is bound to create this kind of nesting. A dataframe is a 2d structure - rows and columns. That means the 3rd dimension has to be an array or list in each cell - hence object dtypes.

pd.DataFrame(arr)          # error
pd.DataFrame(arr.tolist())  # [100 rows x 15 columns]

though the values of such a frame is a 2d object dype.