Home > Blockchain >  Cannot convert a list of int array(int) into a pyspark dataframe
Cannot convert a list of int array(int) into a pyspark dataframe

Time:11-15

I'm trying to convert an int ID, and an array of 3 ints into a dataframe with 2 columns to then union with another dataframe in pyspark;

However I'm just getting error after error related to schema and nothing seems to work. I'm not sure why this is.

emp_rdd = spark.sparkContext.emptyRDD()
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("data", ArrayType(IntegerType()), True),])
df = spark.createDataFrame(data=emp_rdd, schema=schema)

columns = ['id','data']
for i in range(10):     
  data = [id, data1]
  newRows = spark.createDataFrame(data,columns) 
  df= df.union(newRows)

This is giving me this error;

Can not infer schema for type: <class 'int'>

Any help would be appreciated

CodePudding user response:

The reason you get this error is that in your for-loop, you're passing the param data as a simple list while spark.createDataFrame expects an iterable of lists or tuples.

Try changing it to :

data = [(id, data1)]

Example:

for i in range(5):
    data = [(i, [i   1, i   2, i   3])]
    newRows = spark.createDataFrame(data, columns)
    df = df.union(newRows)

df.show()

# --- --------- 
#| id|     data|
# --- --------- 
#|  0|[1, 2, 3]|
#|  1|[2, 3, 4]|
#|  2|[3, 4, 5]|
#|  3|[4, 5, 6]|
#|  4|[5, 6, 7]|
# --- --------- 
  • Related