Given this, seem have done this in the past ok, but...:
val arrayStructData2 = Seq(
Row("James", 2),
Row("Alex", 3)
)
val arrayStructSchema2 = new StructType()
.add("names",new StructType()
.add("name", StringType)
.add("extraField", IntegerType)
)
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData2),arrayStructSchema2)
df.printSchema()
df.show()
I get this:
...
Caused by: RuntimeException: java.lang.String is not a valid external type for schema of struct<name:string,extraField:int>
Can't see it immediately.
CodePudding user response:
For others, as a reminder, needed Row(Row... as in:
val arrayStructData2 = Seq(
Row(Row("James", 2)),
Row(Row("Alex", 3))
)
Not so obvious error imho.
CodePudding user response:
When you create the DataFrame
with createDataFrame
you register the schema, but nothing is actually evaluated which is why df.printSchema
works as expected. When you execute df.show
the DataFrame
is evaluated and Spark tries to load the first value you have given it (in this case a String) into a struct
which results in the runtimeException
you're seeing. Here is the scaladoc for Spark 3.1.1:
Creates a DataFrame from a java.util.List containing Rows using the given schema. It is important to make sure that the structure of every Row of the provided List matches the provided schema. Otherwise, there will be runtime exception.
It's telling you that you are trying to force a string into a struct.