I'm trying to convert an int ID
, and an array of 3 ints into a dataframe with 2 columns to then union with another dataframe in pyspark;
However I'm just getting error after error related to schema and nothing seems to work. I'm not sure why this is.
emp_rdd = spark.sparkContext.emptyRDD()
schema = StructType([
StructField("id", IntegerType(), True),
StructField("data", ArrayType(IntegerType()), True),])
df = spark.createDataFrame(data=emp_rdd, schema=schema)
columns = ['id','data']
for i in range(10):
data = [id, data1]
newRows = spark.createDataFrame(data,columns)
df= df.union(newRows)
This is giving me this error;
Can not infer schema for type: <class 'int'>
Any help would be appreciated
CodePudding user response:
The reason you get this error is that in your for-loop, you're passing the param data
as a simple list while spark.createDataFrame
expects an iterable of lists or tuples.
Try changing it to :
data = [(id, data1)]
Example:
for i in range(5):
data = [(i, [i 1, i 2, i 3])]
newRows = spark.createDataFrame(data, columns)
df = df.union(newRows)
df.show()
# --- ---------
#| id| data|
# --- ---------
#| 0|[1, 2, 3]|
#| 1|[2, 3, 4]|
#| 2|[3, 4, 5]|
#| 3|[4, 5, 6]|
#| 4|[5, 6, 7]|
# --- ---------