Home > Net >  Pyspark write json value as string in csv column
Pyspark write json value as string in csv column

Time:07-18

I'm using Databricks and Pyspark.

I have a notebook that loads data from csv files into a dataframe.

dataframe = spark.read.option("inferSchema", "true")\
                              .option("header", "true")\
                              .csv(csv_files)

The csv files can have columns that contains json values.

Example of csv file:

Name Age Value Value2
Alex 23 0 3
Tom 25 1 4
Jeff 25 "{ ""property"" : ""value"",""property2"" : ""value2"" }" 5

Then I apply some logic to the dataframe like agrupations and stuff like that.

Example of dataframe:

Name Age Value Value2
Alex 23 0 3
Tom 25 1 4
Jeff 25 { ""property"" : ""value"",""property2"" : ""value2"" } 5

And finally I save the dataframe into a csv file, let's say newfile.csv.

dataframe.repartition(1).write.mode("overwrite").option("header", "true").option("escape",'"').csv(destination_folder)

The problem is that newfile.csv is like the below:

Name Age Value Value2
Alex 23 0 3
Tom 25 1 4
Jeff 25 {""property":""value"" ""property2"":""value2""}

When we load a csv file as a dataframe,it gets rid of the double quotes, and that's why when we save that value it no longer contains double quotes.

Is taking the json commas as separators in the csv file. I tried using the option escape, but is not working.

Any ideas on how to solve this? I want to save json values as string values just like the first csv example.

CodePudding user response:

Here, you would need to save your .csv file with a different separator than comma, since comma is separating the json into two different columns. Here is how you can do it:

dataframe.repartition(1).write.mode("overwrite").option("header", "true").option("sep", "|").csv(destination_folder)

CodePudding user response:

You can escape the quotes in your json, then quote the entire json in your csv. That is use \ to escape quotes instead of repeating the quotes.

  • Related