I was trying to convert a big pandas dataframe (6151291 rows × 3 columns) to a Spark dataframe.
Here's my code:
import numpy as np
from pyspark.sql.types import *
df_schema = StructType([StructField("author", ArrayType(StringType()), True)\
,StructField("title", StringType(), True)\
,StructField("year", StringType(), True)])
#rdd = spark.sparkContext.parallelize(data)
#Pandas dataframes can not direct convert to rdd.
sparkDF=spark.createDataFrame(data,df_schema)
sparkDF.printSchema()
And I got this error:
TypeError: field author: ArrayType(StringType(), True) can not accept object 'SQL/Data System for VSE: A Relational Data System for Application Development.' in type <class 'str'>
Actually, this code works well when converting a small pandas dataframe.
What should I do?
CodePudding user response:
'SQL/Data System for VSE: A Relational Data System for Application Development.' is a title value in your data right? It seems that title value is being used as author column. I can reproduce your error with following code sample:
data = [
('SQL/Data System for VSE: A Relational Data System for Application Development.',"1981",["name1","name 2","name3"]),
]
df_schema = StructType([StructField("author", ArrayType(StringType()), True)\
,StructField("title", StringType(), True)\
,StructField("year", StringType(), True)])
sparkDF=spark.createDataFrame(data,df_schema)
sparkDF.printSchema()
sparkDF.show()
If all your data is in same schema you can order structs in schema definition as follows:
data = [
('SQL/Data System for VSE: A Relational Data System for Application Development.',"1981",["name1","name 2","name3"]),
]
df_schema = StructType([StructField("title", StringType(), True)\
,StructField("year", StringType(), True)\
,StructField("author", ArrayType(StringType()), True)])
#rdd = spark.sparkContext.parallelize(data)
#Pandas dataframes can not direct convert to rdd.
sparkDF=spark.createDataFrame(data,df_schema)
sparkDF.printSchema()
sparkDF.show()
Hope it helps.
CodePudding user response:
Type error means that something is not right with data types in your code.
Apparently, in the column "author", your big dataframe has a string value "SQL/Data System for VSE: A Relational Data System for Application Development.". Your small dataframe maybe had other values, but not this one. This value is a problem, because you explicitly said that you want values in this column to be converted to Spark type ArrayType(StringType())
. The string value "SQL/Data System for VSE: A Relational Data System for Application Development." cannot be converted into ArrayType(StringType())
data type. But it can stay as string. Use StringType()
for the column "author" instead:
df_schema = StructType([StructField("author", StringType(), True),
StructField("title", StringType(), True),
StructField("year", StringType(), True)])
sparkDF = spark.createDataFrame(data, df_schema)