ArrayType(StringType()) can not accept object '.....' in type <class 'str'&gt-CodePudding

I was trying to convert a big pandas dataframe (6151291 rows × 3 columns) to a Spark dataframe.

Here's my code:

import numpy as np
from pyspark.sql.types import *

df_schema = StructType([StructField("author", ArrayType(StringType()), True)\
                       ,StructField("title", StringType(), True)\
                       ,StructField("year", StringType(), True)])
#rdd = spark.sparkContext.parallelize(data)
#Pandas dataframes can not direct convert to rdd. 
sparkDF=spark.createDataFrame(data,df_schema) 
sparkDF.printSchema()

And I got this error：

TypeError: field author: ArrayType(StringType(), True) can not accept object 'SQL/Data System for VSE: A Relational Data System for Application Development.' in type <class 'str'>

Actually, this code works well when converting a small pandas dataframe.

What should I do？

CodePudding user response：

'SQL/Data System for VSE: A Relational Data System for Application Development.' is a title value in your data right? It seems that title value is being used as author column. I can reproduce your error with following code sample:

data = [
 ('SQL/Data System for VSE: A Relational Data System for Application Development.',"1981",["name1","name 2","name3"]),
]

df_schema = StructType([StructField("author", ArrayType(StringType()), True)\
                       ,StructField("title", StringType(), True)\
                       ,StructField("year", StringType(), True)])
sparkDF=spark.createDataFrame(data,df_schema) 
sparkDF.printSchema()
sparkDF.show()

If all your data is in same schema you can order structs in schema definition as follows:

data = [
 ('SQL/Data System for VSE: A Relational Data System for Application Development.',"1981",["name1","name 2","name3"]),
]

df_schema = StructType([StructField("title", StringType(), True)\
                       ,StructField("year", StringType(), True)\
                       ,StructField("author", ArrayType(StringType()), True)])
#rdd = spark.sparkContext.parallelize(data)
#Pandas dataframes can not direct convert to rdd. 
sparkDF=spark.createDataFrame(data,df_schema) 
sparkDF.printSchema()
sparkDF.show()

Hope it helps.

CodePudding user response：

Type error means that something is not right with data types in your code.

Apparently, in the column "author", your big dataframe has a string value "SQL/Data System for VSE: A Relational Data System for Application Development.". Your small dataframe maybe had other values, but not this one. This value is a problem, because you explicitly said that you want values in this column to be converted to Spark type ArrayType(StringType()). The string value "SQL/Data System for VSE: A Relational Data System for Application Development." cannot be converted into ArrayType(StringType()) data type. But it can stay as string. Use StringType() for the column "author" instead:

df_schema = StructType([StructField("author", StringType(), True),
                        StructField("title", StringType(), True),
                        StructField("year", StringType(), True)])
sparkDF = spark.createDataFrame(data, df_schema)