Home > Software engineering >  Spark Error With Schema Validation Failure
Spark Error With Schema Validation Failure

Time:09-30

I have some strange error with the simplest of the simple Spark load CSV example.,

%spark

import org.apache.spark.sql.types._

val filePath = "/path/to/my/csv/file.csv"

val rawDF = spark.read.format("csv")
  .option("delimiter"," ")
  .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
  .option("inferSchema","true")
  .option("mode", "DROPMALFORMED")
  .option("header", "true")
  .option("multiLine", true)
  .load(filePath)
   //only extract the values we need

val schema = StructType(Array(
           StructField("id", DoubleType),
           StructField("name", StringType)))
   
val df = spark.createDataFrame(rawDF.rdd, schema)

println("Printing the schema ********************* ")
df.show()

Here is the CSV file:

id,name
1,name1
2,name2
3,name3
4,name4
5,name5
6,name6

I hit this error that I do not understand why:

Printing the schema ********************* 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent failure: Lost task 0.0 in stage 38.0 (TID 38) (192.168.0.35 executor driver): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of double
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, id), DoubleType) AS id#569
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, name), StringType), true, false) AS name#570
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:213)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:195)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
......
......
......
......
......
......

CodePudding user response:

.option("delimiter"," ")

to

.option("delimiter",",")
  • Related