I am trying to parse a JSON file, selectively read only 50 data elements (out of 800 ) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?
"customfield_666":{"id":"46889","name":"BBX","tag":"46889"}
"customfield_666":null
AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string
from pyspark.sql.functions import *
rawDF = spark.read.json("abfss://[email protected]/raw/MyData.json", multiLine = "true")
#rawDF.printSchema()
DF = rawDF.select(explode("issues").alias("issues")) \
.select(
col("issues.id").alias("IssueId"),
col("issues.key").alias("IssueKey"),
col("issues.fields").alias("IssueFields"),
col("issues.issuetype.name").alias("IssueTypeName"),
col("issues.customfield_666.tag").alias("IssueCust666Tag")
)
CodePudding user response:
You may check if it is null first
DF = rawDF.select(explode("issues").alias("issues")) \
.select(
col("issues.id").alias("IssueId"),
col("issues.key").alias("IssueKey"),
col("issues.fields").alias("IssueFields"),
col("issues.issuetype.name").alias("IssueTypeName"),
when(
col("issues.customfield_666").isNull(), None
).otherwise(col("issues.customfield_666.tag")).alias("IssueCust666Tag")
)
Let me know if this works for you