pyspark save json handling nulls for struct-CodePudding

Using Pyspark and Spark 2.4, Python3 here. While writing the dataframe as json file, if the struct column is null I want it to be written as {} and if the struct field is null I want it as "". For example:

    >>> df.printSchema()
    root
     |-- id: string (nullable = true)
     |-- child1: struct (nullable = true)
     |    |-- f_name: string (nullable = true)
     |    |-- l_name: string (nullable = true)
     |-- child2: struct (nullable = true)
     |    |-- f_name: string (nullable = true)
     |    |-- l_name: string (nullable = true)

     >>> df.show()
     --- ------------ ------------ 
    | id|      child1|      child2|
     --- ------------ ------------ 
    |123|[John, Matt]|[Paul, Matt]|
    |111|[Jack, null]|        null|
    |101|        null|        null|
     --- ------------ ------------ 
    df.fillna("").coalesce(1).write.mode("overwrite").format('json').save('/home/test')

Result:


    {"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
    {"id":"111","child1":{"f_name":"jack","l_name":""}}
    {"id":"111"}

Output Required:


    {"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
    {"id":"111","child1":{"f_name":"jack","l_name":""},"child2": {}}
    {"id":"111","child1":{},"child2": {}}

I tried some map and udf's but was not able to acheive what I need. Appreciate your help here.

CodePudding user response：

Spark 3.x

If you pass option ignoreNullFields into your code, you will have output like this. Not exactly an empty struct as you requested, but the schema is still correct.

df.fillna("").coalesce(1).write.mode("overwrite").format('json').option('ignoreNullFields', False).save('/home/test')

{"child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"},"id":"123"}
{"child1":{"f_name":"Jack","l_name":null},"child2":null,"id":"111"}
{"child1":null,"child2":null,"id":"101"}

Spark 2.x

Since that option above does not exist, I figured there is a "dirty fix" for that, is mimicking the JSON structure and bypassing the null check. Again, the result is not exactly like you're asking for, but the schema is correct.

(df
    .select(F.struct(
        F.col('id'),
        F.coalesce(F.col('child1'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child1'),
        F.coalesce(F.col('child2'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child2')
    ).alias('json'))
    .coalesce(1).write.mode("overwrite").format('json').save('/home/test')
)

{"json":{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}}
{"json":{"id":"111","child1":{"f_name":"Jack"},"child2":{}}}
{"json":{"id":"101","child1":{},"child2":{}}}