object of type rdd is not json serializable python spark-CodePudding

I am using spark data bricks cluster in azure, my requirement is to generate json and save json file to databricks storage

But I am getting below error

    object of type rdd is not json serializable

code:

df = spark.read.format("csv") \
                    .option("inferSchema", False) \
                    .option("header", True) \
                    .option("sep", ",") \
                    .load("path-to-file")
df_json = df.toJSON()
file_out="out.json"
with open(file_out, 'w') as f:
    json.dump(df_json, f)

How to fix the issue?

CodePudding user response：

The issue arises with json.dump(). For this function to write a JSON file output, a valid JSON object has to be given which is not an RDD (returned by df.toJSON()). I got the same error when I tried using the same code.

To fix the code, you can get the output of your dataframe as a Dictionary. This can be done using df_json.collect(). The following will be the output when we use df_json.collect() for my sample data

print(df_json.collect())

enter image description here

You can see that above is an array of strings (where each string is json object). You can follow the code below to convert it to a complete JSON dictionary and successfully write it.

output = [eval(i) for i in df_json.collect()]
#output variable has the required generated json

import json
file_out="output.json"
#file would be saved in /databricks/driver/
with open(file_out, 'w') as f:
    json.dump(output, f)

Use dbutils.fs.ls() to verify. /databricks/driver/ will be the location of the saved file in Databricks when file path is just filename (file_out="output.json")

display(dbutils.fs.ls("file:/databricks/driver"))

enter image description here

When I read the same file, you can see that it is successful and given json data.

with open(file_out, 'r') as k:
    ans = json.load(k)