I have the following dataframe:
val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
Row("Michael ","Rose","","40288","M",4000),
Row("Robert ","","Williams","42114","M",4000),
Row("Maria ","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","bad","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)
--------- ---------- -------- ----- ------ ------
|firstname|middlename|lastname| id|gender|salary|
--------- ---------- -------- ----- ------ ------
| James | | Smith|36636| M| 3000|
| Michael | Rose| |40288| M| 4000|
| Robert | |Williams|42114| M| 4000|
| Maria | Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown|Rose | F| -1|
--------- ---------- -------- ----- ------ ------
I am running sample code below where I want to convert a string column to an integer after casting.
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
-----
| id|
-----
|36636|
|40288|
|42114|
|39192|
| null|
-----
Here all integer data converted correctly however "Rose" converted to null.
Could you please help me with how to throw an exception whenever there is a bad record? Are there any spark config settings for this?
Also, how to get the exact column name that is having this issue if there are multiple casts in the query.
CodePudding user response:
Since Spark 3.0 and correction of ticket SPARK-30292, setting spark.sql.ansi.enabled
config to true
will raise an exception when you try to cast invalid string to a number:
spark.conf.set("spark.sql.ansi.enabled", "true")
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
throws a NumberFormatException
. See https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast for more details.
CodePudding user response:
Spark does not throw if a cast goes wrong.
As a custom approach to catch those errors you could write a UDF that throws if you cast to null. This worsens the performance of your script though, since Spark can not optimise UDF executions.