I have this big nested json object from which I need to make a Dataframe. One of the inner json elements sometimes come as empty and sometimes it comes with some values in it.
I am giving a simple example here:
When it is filled:
{"student_address": {"Door Number":"1234",
"Place":"xxxx",
"Zip code":"12345"}}
When it is empty:
{"student_address":""}
So, in the final DataFrame I have all the three columns Door Number, Place and Zip code.
When the address is empty, I should put Null
values in the respective columns and should fill them when there is data.
The code I tried:
test = test.withColumn("place",when(col("student_address") == "", lit(None)).otherwise(col("student_address.place")))\
.withColumn("door_num",when(col("student_address") == "",lit(None)).otherwise(col("student_address.door_num")))\
.withColumn("zip_code",when(col("student_address") == "", lit(None)).otherwise(col("student_address.zip_code")))
So, I am trying to check wether the value is empty or not.
This is the error I am getting:
AnalysisException: Can't extract value from student_address#34: need struct type but got string
I am not able to understand why PySpark is checking the statement in otherwise
, when condition is met in when
itself. (I tried giving simple values in otherwise
instead of json path and it worked).
I am struggling to understand what is happening here and would like to know if there is any simple way to do this.
CodePudding user response:
use Get JSON object with the first object with is Student_address and the column should be the name of the column. df.withColumn("place",when(df.student_address== "", lit(None)).otherwise(get_json_object(col("student_address"),"$.student_address.place")))
CodePudding user response:
val addressSchema = StructType(StructField("Place", StringType, false) :: Nil) # Add more fields
val schema = StructType(StructField("address", addressSchema, true) :: Nil) # the point is address is nullable
val df = spark.read.schema(schema).json("example.json")