Through a Spark pipeline I am retrieving a matrix (ArrayType(ArrayType(FloatType()))
) and using the .toPandas()
for retrieving the data for in-memory analysis.
from pyspark.sql.types import ArrayType, FloatType
from pyspark.sql.functions import from_json
schema = ArrayType(ArrayType(FloatType()))
sdf = sdf.withColumn("my_data", from_json("my_column", schema))
pdf = sdf.select("my_data").toPandas()
The dataframe contains N rows, and each entry of my_data
contains a matrix of shape (M, D). I would like to end up with a numpy array that is shape (N, M, D) and dtype float.
The issue is that .toPandas()
converts arrays into numpy arrays with dtype object, so I end up with a nested structure of each element in the pandas dataframe on column my_data
has shape (M, ) dtype object, with each child element therein having shape (D, ) and dtype float.
(I guess the reasoning behind this design choice is that there is no inherent guarantee that the inner lists have same length, but in my case I know they do.)
One naive solution would be to create a nested .tolist()
and then np.array()
, but this seems so inefficient (and does not generalise well to deeper structures):
my_data = np.array([
[inner.tolist() for inner in row]
for row in pdf["my_data"]
])
There must be a better way? Is there anything in the numpy API I'm missing?
Update 1: Some debug info
my_data = pdf["my_data"]
example = my_data[0] # first row
print(type(example)) # <class 'numpy.ndarray'>
print(example.shape) # (15,)
print(example.dtype) # object
element = example[0]
print(type(element)) # <class 'numpy.ndarray'>
print(element.shape) # (17,)
print(element.dtype) # float32
example.astype(float)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
analysis.ipynb Cell 20 in <cell line: 1>()
----> 1 example.astype(float)
ValueError: setting an array element with a sequence.
Update 2: This snippet can recreate input data structure.
N, M, D = 100, 15, 17
rows = np.ndarray(N, dtype=object)
for i in range(N):
example = np.ndarray(M, dtype=object)
for j in range(M):
example[j] = np.random.rand(D).astype(np.float32)
rows[i] = example
CodePudding user response:
Considering your rows
array, I can't see a way without for loop for now, but the following does the job:
>>> res = np.array([np.stack(r) for r in rows])
>>> res.shape
(100, 15, 17)
>>> res.dtype
dtype('float32')
Before edit
Not exactly sure how your data is formatted, but if you have a list of numpy arrays of same shape and dtype object:
import numpy as np
N, M, D = 3, 5, 7
arrays = [np.random.randn(M, D).astype(object) for _ in range(N)]
You can just do np.array(arrays).astype(float)
:
>>> np.array(arrays).astype(float).shape
(3, 5, 7)
CodePudding user response:
You have object arrays within an object array.
In [29]: rows.shape, rows.dtype
Out[29]: ((100,), dtype('O'))
In [30]: rows[0].shape, rows[0].dtype
Out[30]: ((15,), dtype('O'))
In [31]: rows[0][0].shape, rows[0][0].dtype
Out[31]: ((17,), dtype('float32'))
A double conversion is required:
In [32]: arr = np.array([a.tolist() for a in rows.tolist()])
In [33]: arr.shape, arr.dtype
Out[33]: ((100, 15, 17), dtype('float32'))
You could substitute stack
for that inner tolist
, but it gains any speed.
In [34]: arr = np.array([np.stack(a) for a in rows.tolist()])
In [35]: arr.shape, arr.dtype
Out[35]: ((100, 15, 17), dtype('float32'))
One way or other the object
dtype of the inner arrays has to be 'flattened' - one at a time.
I haven't followed your pandas
work, but trying to put a 3d structure into a frame is bound to create this kind of nesting. A dataframe is a 2d structure - rows and columns. That means the 3rd dimension has to be an array or list in each cell - hence object dtypes.
pd.DataFrame(arr) # error
pd.DataFrame(arr.tolist()) # [100 rows x 15 columns]
though the values
of such a frame is a 2d object dype.