I have some strange error with the simplest of the simple Spark load CSV example.,
%spark
import org.apache.spark.sql.types._
val filePath = "/path/to/my/csv/file.csv"
val rawDF = spark.read.format("csv")
.option("delimiter"," ")
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.option("inferSchema","true")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.option("multiLine", true)
.load(filePath)
//only extract the values we need
val schema = StructType(Array(
StructField("id", DoubleType),
StructField("name", StringType)))
val df = spark.createDataFrame(rawDF.rdd, schema)
println("Printing the schema ********************* ")
df.show()
Here is the CSV file:
id,name
1,name1
2,name2
3,name3
4,name4
5,name5
6,name6
I hit this error that I do not understand why:
Printing the schema *********************
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent failure: Lost task 0.0 in stage 38.0 (TID 38) (192.168.0.35 executor driver): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of double
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, id), DoubleType) AS id#569
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, name), StringType), true, false) AS name#570
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:213)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:195)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
......
......
......
......
......
......
CodePudding user response:
.option("delimiter"," ")
to
.option("delimiter",",")