I have a dataframe which contains double quote (") and comma in value. I am trying to export the dataframe into csv but unfortunately double quote is not displayed properly in exported csv.
I managed to handle all other special characters by setting "quoteAll" to true.
In csv, if you replace the single double quote with two double quotes, it works fine. But when I export to csv with below code, it replaces " with " in exported csv.
%scala
val df = Seq((1, "A,B,C,\"DEF\""), (2, "DEF")).toDF("ID", "Val")
val updatedDf = df.columns.foldLeft(df)((acc, colname) => acc.withColumn(colname,regexp_replace(acc(s"`$colname`"), "\"", "\"\"")))
deltaDS.coalesce(1).write
.option("header", true)
.option("encoding", "utf-8")
.option("quoteAll", true)
.mode("Overwrite").csv("[Location to store csv]")
Output:
Expected Output:
How can I handle this ?
CodePudding user response:
Based on the input, the following seems to be the dataframe outputs.
df:
updatedDF:
Since quoteAll is set to true, it will work as mentioned in the spark documentation.
When converted to CSV, following seems to be the output.
df.csv:
"ID","Val"
"1","A,B,C,\"DEF\""
"2","DEF"
updatedDF.csv:
"ID","Val"
"1","A,B,C,\"\"DEF\"\""
"2","DEF"
One suggestion to deal with the escape character is to use the original dataframe and make the escape character as empty. But please be wary of the impact it might have on the other special characters.
%scala
df.coalesce(1).write
.option("header", true)
.option("encoding", "utf-8")
.option("quoteAll", true)
.option("escape", "")
.mode("Overwrite").csv("<path>/tst_1dq_noesc.csv")
"ID","Val"
"1","A,B,C,"DEF""
"2","DEF"
Hope this is the expected output...
CodePudding user response:
You haven't provided exact plaintext csv you expect, so it's hard to guess. Are you looking for this?
val df = Seq((1, "A,B,C,\"DEF\""), (2, "DEF")).toDF("ID", "Val")
df.coalesce(1).write
.option("header", true)
.option("encoding", "utf-8")
.option("quoteAll", true)
.option("escape", "\"") // escapes quotes inside quote using "
.mode("Overwrite").csv("xxx")
Resulting csv:
"ID","Val"
"1","A,B,C,""DEF"""
"2","DEF"