Home > Net >  Handle double quote while exporting dataframe to CSV
Handle double quote while exporting dataframe to CSV

Time:11-14

I have a dataframe which contains double quote (") and comma in value. I am trying to export the dataframe into csv but unfortunately double quote is not displayed properly in exported csv.

I managed to handle all other special characters by setting "quoteAll" to true.

In csv, if you replace the single double quote with two double quotes, it works fine. But when I export to csv with below code, it replaces " with " in exported csv.

%scala

val df = Seq((1, "A,B,C,\"DEF\""), (2, "DEF")).toDF("ID", "Val")

val updatedDf = df.columns.foldLeft(df)((acc, colname) => acc.withColumn(colname,regexp_replace(acc(s"`$colname`"), "\"", "\"\"")))


deltaDS.coalesce(1).write
.option("header", true)
.option("encoding", "utf-8")
.option("quoteAll", true)
.mode("Overwrite").csv("[Location to store csv]")

Output:

enter image description here

Expected Output:

enter image description here

How can I handle this ?

CodePudding user response:

Based on the input, the following seems to be the dataframe outputs.

df:

df

updatedDF:

updatedDF

Since quoteAll is set to true, it will work as mentioned in the spark documentation.

enter image description here

When converted to CSV, following seems to be the output.

df.csv:

"ID","Val"
"1","A,B,C,\"DEF\""
"2","DEF"

updatedDF.csv:

"ID","Val"
"1","A,B,C,\"\"DEF\"\""
"2","DEF"

One suggestion to deal with the escape character is to use the original dataframe and make the escape character as empty. But please be wary of the impact it might have on the other special characters.

%scala
df.coalesce(1).write
.option("header", true)
.option("encoding", "utf-8")
.option("quoteAll", true)
.option("escape", "")
.mode("Overwrite").csv("<path>/tst_1dq_noesc.csv")
"ID","Val"
"1","A,B,C,"DEF""
"2","DEF"

Hope this is the expected output...

CodePudding user response:

You haven't provided exact plaintext csv you expect, so it's hard to guess. Are you looking for this?

val df = Seq((1, "A,B,C,\"DEF\""), (2, "DEF")).toDF("ID", "Val")

df.coalesce(1).write
.option("header", true)
.option("encoding", "utf-8")
.option("quoteAll", true)
.option("escape", "\"")  // escapes quotes inside quote using "
.mode("Overwrite").csv("xxx")

Resulting csv:

"ID","Val"
"1","A,B,C,""DEF"""
"2","DEF"
  • Related