PySpark / Spark SQL DataFrame - Error while parsing Struct Type when data is null-CodePudding

I am trying to parse a JSON file, selectively read only 50 data elements (out of 800 ) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?

"customfield_666":{"id":"46889","name":"BBX","tag":"46889"}

"customfield_666":null

AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string

from pyspark.sql.functions import *

rawDF = spark.read.json("abfss://[email protected]/raw/MyData.json", multiLine = "true")
#rawDF.printSchema()

DF = rawDF.select(explode("issues").alias("issues")) \
                .select(
                       col("issues.id").alias("IssueId"), 
                       col("issues.key").alias("IssueKey"), 
                       col("issues.fields").alias("IssueFields"),
                       col("issues.issuetype.name").alias("IssueTypeName"),
                       col("issues.customfield_666.tag").alias("IssueCust666Tag")
                      )

CodePudding user response：

You may check if it is null first

DF = rawDF.select(explode("issues").alias("issues")) \
                .select(
                       col("issues.id").alias("IssueId"), 
                       col("issues.key").alias("IssueKey"), 
                       col("issues.fields").alias("IssueFields"),
                       col("issues.issuetype.name").alias("IssueTypeName"),
                       when(
                           col("issues.customfield_666").isNull(), None
                       ).otherwise(col("issues.customfield_666.tag")).alias("IssueCust666Tag")
                      )

Let me know if this works for you