spark.read. reading empty string as null when data is read from part file-CodePudding

Lets consider the csv file with following data

Id,Job,year

1,,2000

CSV Reader code:

var inputDFRdd = spark.emptyDataFrame.rdd
inputDFRdd = spark.read.format("com.databricks.spark.csv")
        .option("mode", "FAILFAST")
        .option("delimiter", ",")
        .option("header", "false")
        .option("inferSchema", "false")
        .option("escape", "\"").load().rdd.zipWithIndex()
        .map(line => Row.fromSeq(Seq(line._2   1)    line._1.toSeq))

Using the above code to read a file from incoming file, the data frame reads the empty string as empty string, but when the same is used to read data from part file, data frame reads empty string as null.

Looking for a way to read empty string as empty string from the part file.

CodePudding user response：

By default empty string will be inferred as null while reading CSV file.

You can change that behavior by using property - nullValue.

.option("nullValue", "null") // Only the string with value 'null' will be inferred as null.

CodePudding user response：

1,,2000 the second value here is nothing which confirms to null, hence justified