Problem with selecting json object in Pyspark, which may sometime have Null values-CodePudding

I have this big nested json object from which I need to make a Dataframe. One of the inner json elements sometimes come as empty and sometimes it comes with some values in it.

I am giving a simple example here:

When it is filled:

{"student_address": {"Door Number":"1234",
                     "Place":"xxxx",
                     "Zip code":"12345"}}

When it is empty:

{"student_address":""}

So, in the final DataFrame I have all the three columns Door Number, Place and Zip code. When the address is empty, I should put Null values in the respective columns and should fill them when there is data.

The code I tried:

test = test.withColumn("place",when(col("student_address") == "", lit(None)).otherwise(col("student_address.place")))\
           .withColumn("door_num",when(col("student_address") == "",lit(None)).otherwise(col("student_address.door_num")))\
           .withColumn("zip_code",when(col("student_address") == "", lit(None)).otherwise(col("student_address.zip_code")))

So, I am trying to check wether the value is empty or not.

This is the error I am getting:

AnalysisException: Can't extract value from student_address#34: need struct type but got string

I am not able to understand why PySpark is checking the statement in otherwise, when condition is met in when itself. (I tried giving simple values in otherwise instead of json path and it worked).

I am struggling to understand what is happening here and would like to know if there is any simple way to do this.

CodePudding user response：

use Get JSON object with the first object with is Student_address and the column should be the name of the column. df.withColumn("place",when(df.student_address== "", lit(None)).otherwise(get_json_object(col("student_address"),"$.student_address.place")))

CodePudding user response：

val addressSchema = StructType(StructField("Place", StringType, false) :: Nil) # Add more fields
val schema = StructType(StructField("address", addressSchema, true) :: Nil) # the point is address is nullable
val df = spark.read.schema(schema).json("example.json")