I'm trying to insert the entirety of a numpy 2d array into a single pyspark row... does anyone know how to achieve this?
Ultimately I would like to be able to achieve the below.. where my numpy array is in a single row
I have tried to use higher order function to do this, but haven't been able to get this working so far. Does anyone have any advice?
import pyspark.sql.functions as f
import numpy as np
df = spark.createDataFrame(np.array([[0. , 0.67235401, 0.35767577],
[0.67235401, 0. , 0.2981656 ],
[0.35767577, 0.2981656 , 0. ]]))
expr = "TRANSFORM(arrays_zip(*), x -> struct(*))"
df = sms.withColumn('array', f.expr(expr))
df.show(truncate=False)
CodePudding user response:
Given a numpy.array
, it can converted into PySpark Dataframe after converting array into a python list.
Working Example
import numpy as np
np_array = np.array([[0. , 0.67235401, 0.35767577],
[0.67235401, 0. , 0.2981656 ],
[0.35767577, 0.2981656 , 0. ]])
df = spark.createDataFrame([(np_array.tolist(), )], ("array", ))
df.show(truncate=False)
Output
-------------------------------------------------------------------------------------------
|array |
-------------------------------------------------------------------------------------------
|[[0.0, 0.67235401, 0.35767577], [0.67235401, 0.0, 0.2981656], [0.35767577, 0.2981656, 0.0]]|
-------------------------------------------------------------------------------------------